CHANGELOG

#######################################
#           Thrust v1.7.0             #
#######################################

Summary
    XXX TODO

Breaking API Changes
    XXX TODO
    Dispatch
      custom user backend systems' tag types must now inherit from the corresponding system's dispatchable template (e.g. thrust::cuda::dispatchable) instead
      of the tag struct (e.g. thrust::cuda::tag). Algorithm specializations will silently go unfound during dispatch otherwise.
      See examples/minimal_custom_backend.cu and examples/cuda/custom_temporary_allocation.cu for usage examples.
      thrust::advance and thrust::distance are no longer dispatched based on iterator system and thus may no longer be customized.

    Iterators
      iterator_facade and iterator_adaptor's Pointer template parameters have been eliminated
      iterator_adaptor has been moved into the thrust namespace (previously thrust::experimental::iterator_adaptor)
      iterator_facade has been moved into the thrust namespace (previously thrust::experimental::iterator_facade)
      iterator_core_access has been moved into the thrust namespace (previously thrust::experimental::iterator_core_access)
      all iterators' nested pointer typedef (the type of the result of operator->) is now void instead of a pointer type to indicate that such expressions are currently not possible
      floating point counting_iterators' nested difference_type typedef is now a signed integral type instead of a floating point type

    Other
      placeholder expressions may no longer include the comma operator

New Features
    XXX TODO
    Functions
	free
	get_temporary_buffer
	malloc
        merge_by_key
        partition with stencil
        partition_copy with stencil
	return_temporary_buffer
        set_difference_by_key
        set_intersection_by_key
        set_symmetric_difference_by_key
        set_union_by_key
        stable_partition with stencil
        stable_partition_copy with stencil
	tabulate

    Types
        XXX TODO

    Device support
        XXX TODO

New Examples
    uninitialized_vector
    XXX TODO

Other Enhancements
    XXX TODO
    XXX TODO stateful dispatch
    XXX TODO THRUST_DEVICE_SYSTEM_CPP
    XXX TODO CUDA merge performance
    XXX TODO CUDA set operation performance
    XXX TODO TBB reduce_by_key performance
    XXX TODO TBB algorithm parallelization
    better support for user allocators in vectors
    sparse_vector example is now implemented with merge_by_key instead of sort_by_key
    eliminate various warnings

Bug Fixes
    XXX TODO
    #248 fix broken counting_iterator<float> behavior with OpenMP
    #231, #209 fix set operation failures with CUDA
    #187 fix incorrect occupancy calculation with CUDA
    #153 fix broken multigpu behavior with CUDA
    #142 eliminate warning produced by thrust::random::taus88 and MSVC 2010
    #208 correctly initialize elements in temporary storage when necessary
    #16 fix compilation error when sorting bool with CUDA
    #10 fix ambiguous overloads of reinterpret_tag

Known Issues
    XXX TODO

Acknowledgments
    Thanks to Sean Baxter, Bryan Catanzaro, and Manjunath Kudlur for contributing a faster merge implementation for CUDA
    Thanks to Sean Baxter for contributing a faster set operation implementation for CUDA
    Thanks to Cliff Woolley for contributing a correct occupancy calculation algorithm

#######################################
#           Thrust v1.6.0             #
#######################################

Summary
    Thrust v1.6.0 provides an interface for customization and extension and a new
    backend system based on the Threading Building Blocks library. With this
    new interface, programmers may customize the behavior of specific algorithms
    as well as control the allocation of temporary storage or invent entirely new
    backends. These enhancements also allow multiple different backend systems
    such as CUDA and OpenMP to coexist within a single program. Support for TBB
    allows Thrust programs to integrate more naturally into applications which
    may already employ the TBB task scheduler.

Breaking API Changes
    The header <thrust/experimental/cuda/pinned_allocator.h> has been moved to <thrust/system/cuda/experimental/pinned_allocator.h>
    thrust::experimental::cuda::pinned_allocator has been moved to thrust::cuda::experimental::pinned_allocator
    The macro THRUST_DEVICE_BACKEND has been renamed THRUST_DEVICE_SYSTEM
    The macro THRUST_DEVICE_BACKEND_CUDA has been renamed THRUST_DEVICE_SYSTEM_CUDA
    The macro THRUST_DEVICE_BACKEND_OMP has been renamed THRUST_DEVICE_SYSTEM_OMP
    thrust::host_space_tag has been renamed thrust::host_system_tag
    thrust::device_space_tag has been renamed thrust::device_system_tag
    thrust::any_space_tag has been renamed thrust::any_system_tag
    thrust::iterator_space has been renamed thrust::iterator_system
    

New Features
    Backend Systems
        Threading Building Blocks (TBB) is now supported
    Functions
        for_each_n
        raw_reference_cast
    Types
        pointer
        reference

New Examples
    cuda/custom_temporary_allocation
    cuda/fallback_allocator
    device_ptr
    expand
    minimal_custom_backend
    raw_reference_cast
    set_operations

Other Enhancements
    thrust::for_each now returns the end of the input range similar to most other algorithms
    thrust::pair and thrust::tuple have swap functionality
    all CUDA algorithms now support large data types
    iterators may be dereferenced in user __device__ or __global__ functions
    the safe use of different backend systems is now possible within a single binary

Bug Fixes
    #469 min_element and max_element algorithms no longer require a const comparison operator

Known Issues
    cudafe++.exe may crash when parsing TBB headers on Windows. 

#######################################
#           Thrust v1.5.2             #
#######################################

Summary
    Small bug fixes

Bug Fixes
    Fixed warning about C-style initialization of structures

#######################################
#           Thrust v1.5.1             #
#######################################

Summary
    Small bug fixes

Bug Fixes
    Sorting data referenced by permutation_iterators on CUDA produces invalid results

#######################################
#           Thrust v1.5.0             #
#######################################

Summary
    Thrust v1.5.0 provides introduces new programmer productivity and performance
    enhancements. New functionality for creating anonymous "lambda" functions has
    been added. A faster host sort provides 2-10x faster performance for sorting
    arithmetic types on (single-threaded) CPUs. A new OpenMP sort provides
    2.5x-3.0x speedup over the host sort using a quad-core CPU. When sorting
    arithmetic types with the OpenMP backend the combined performance improvement
    is 5.9x for 32-bit integers and ranges from 3.0x (64-bit types) to 14.2x
    (8-bit types). A new CUDA reduce_by_key implementation provides 2-3x faster
    performance.

Breaking API Changes
    device_ptr<void> no longer unsafely converts to device_ptr<T> without an
    explicit cast. Use the expression
    device_pointer_cast(static_cast<int*>(void_ptr.get()))
    to convert, for example, device_ptr<void> to device_ptr<int>.

New Features
    Functions
        stencil-less transform_if

    Types
        lambda placeholders

New Examples
    lambda

Other Enhancements
    host sort is 2-10x faster for arithmetic types
    OMP sort provides speedup over host sort
    reduce_by_key is 2-3x faster
    reduce_by_key no longer requires O(N) temporary storage
    CUDA scan algorithms are 10-40% faster
    host_vector and device_vector are now documented
    out-of-memory exceptions now provide detailed information from CUDART
    improved histogram example
    device_reference now has a specialized swap
    reduce_by_key and scan algorithms are compatible with discard_iterator

Removed Functionality

Bug Fixes
     #44 allow host_vector to compile when value_type uses __align__
    #198 allow adjacent_difference to permit safe in-situ operation
    #303 make thrust thread-safe
    #313 avoid race conditions in device_vector::insert
    #314 avoid unintended adl invocation when dispatching copy
    #365 fix merge and set operation failures

Known Issues
    None

Acknowledgments
    Thanks to Manjunath Kudlur for contributing his Carbon library, from which the lambda functionality is derived.
    Thanks to Jean-Francois Bastien for suggesting a fix for issue 303.

#######################################
#           Thrust v1.4.0             #
#######################################

Summary
    Thrust v1.4.0 provides support for CUDA 4.0 in addition to many feature
    and performance improvements.  New set theoretic algorithms operating on
    sorted sequences have been added.  Additionally, a new fancy iterator
    allows discarding redundant or otherwise unnecessary output from
    algorithms, conserving memory storage and bandwidth.

Breaking API Changes
    Eliminations
        thrust/is_sorted.h
        thrust/utility.h
        thrust/set_intersection.h
        thrust/experimental/cuda/ogl_interop_allocator.h and the functionality therein
        thrust::deprecated::copy_when
        thrust::deprecated::absolute_value

New Features
    Functions
        copy_n
        merge
        set_difference
        set_symmetric_difference
        set_union

    Types
        discard_iterator

    Device support
        Compute Capability 2.1 GPUs

New Examples
    run_length_decoding

Other Enhancements
    Compilation warnings are substantially reduced in various contexts.
    The compilation time of thrust::sort, thrust::stable_sort, thrust::sort_by_key,
    and thrust::stable_sort_by_key are substantially reduced.
    A fast sort implementation is used when sorting primitive types with thrust::greater.
    The performance of thrust::set_intersection is improved.
    The performance of thrust::fill is improved on SM 1.x devices.
    A code example is now provided in each algorithm's documentation.
    thrust::reverse now operates in-place

Removed Functionality
    thrust::deprecated::copy_when
    thrust::deprecated::absolute_value
    thrust::experimental::cuda::ogl_interop_allocator
    thrust::gather and thrust::scatter from host to device and vice versa are no longer supported.
    Operations which modify the elements of a thrust::device_vector are no longer
    available from source code compiled without nvcc when the device backend is CUDA.
    Instead, use the idiom from the cpp_interop example.

Bug Fixes
    #212 set_intersection works correctly for large input sizes.
    #275 counting_iterator and constant_iterator work correctly with OpenMP as the
    backend when compiling with optimization
    #256 min and max correctly return their first argument as a tie-breaker
    #248 NDEBUG is interpreted correctly

Known Issues
    nvcc may generate code containing warnings when compiling some Thrust algorithms.
    When compiling with -arch=sm_1x, some Thrust algorithms may cause nvcc to issue
    benign pointer advisories.
    When compiling with -arch=sm_1x and -G, some Thrust algorithms may fail to execute correctly.
    thrust::inclusive_scan, thrust::exclusive_scan, thrust::inclusive_scan_by_key,
    and thrust::exclusive_scan_by_key are currently incompatible with thrust::discard_iterator.

Acknowledgments
    Thanks to David Tarjan for improving the performance of set_intersection.
    Thanks to Duane Merrill for continued help with sort.
    Thanks to Nathan Whitehead for help with CUDA Toolkit integration.

#######################################
#           Thrust v1.3.0             #
#######################################

Summary
    Thrust v1.3.0 provides support for CUDA 3.2 in addition to many feature
    and performance enhancements.
    
    Performance of the sort and sort_by_key algorithms is improved by as much 
    as 3x in certain situations.  The performance of stream compaction algorithms,
    such as copy_if, is improved by as much as 2x.  Reduction performance is 
    also improved, particularly for small input sizes.
    
    CUDA errors are now converted to runtime exceptions using the system_error
    interface.  Combined with a debug mode, also new in v1.3, runtime errors
    can be located with greater precision.

    Lastly, a few header files have been consolidated or renamed for clarity.
    See the deprecations section below for additional details.


Breaking API Changes
    Promotions
        thrust::experimental::inclusive_segmented_scan has been renamed thrust::inclusive_scan_by_key and exposes a different interface
        thrust::experimental::exclusive_segmented_scan has been renamed thrust::exclusive_scan_by_key and exposes a different interface
        thrust::experimental::partition_copy has been renamed thrust::partition_copy and exposes a different interface
        thrust::next::gather has been renamed thrust::gather
        thrust::next::gather_if has been renamed thrust::gather_if
        thrust::unique_copy_by_key has been renamed thrust::unique_by_key_copy
    Deprecations
        thrust::copy_when has been renamed thrust::deprecated::copy_when
        thrust::absolute_value has been renamed thrust::deprecated::absolute_value
        The header thrust/set_intersection.h is now deprecated; use thrust/set_operations.h instead
        The header thrust/utility.h is now deprecated; use thrust/swap.h instead
        The header thrust/swap_ranges.h is now deprecated; use thrust/swap.h instead
    Eliminations
        thrust::deprecated::gather
        thrust::deprecated::gather_if
        thrust/experimental/arch.h and the functions therein
        thrust/sorting/merge_sort.h
        thrust/sorting/radix_sort.h

New Features
    Functions
        exclusive_scan_by_key
        find
        find_if
        find_if_not
        inclusive_scan_by_key
        is_partitioned
        is_sorted_until
        mismatch
        partition_point
        reverse
        reverse_copy
        stable_partition_copy

    Types
        system_error and related types
        experimental::cuda::ogl_interop_allocator
        bit_and, bit_or, and bit_xor

    Device support
        gf104-based GPUs

New Examples
    opengl_interop.cu
    repeated_range.cu
    simple_moving_average.cu
    sparse_vector.cu
    strided_range.cu

Other Enhancements
    Performance of thrust::sort and thrust::sort_by_key is substantially improved for primitive key types
    Performance of thrust::copy_if is substantially improved
    Performance of thrust::reduce and related reductions is improved
    THRUST_DEBUG mode added
    Callers of Thrust functions may detect error conditions by catching thrust::system_error, which derives from std::runtime_error
    The number of compiler warnings generated by Thrust has been substantially reduced
    Comparison sort now works correctly for input sizes > 32M
    min & max usage no longer collides with <windows.h> definitions
    Compiling against the OpenMP backend no longer requires nvcc
    Performance of device_vector initialized in .cpp files is substantially improved in common cases
    Performance of thrust::sort_by_key on the host is substantially improved

Removed Functionality
    nvcc 2.3 is no longer supported

Bug Fixes
    Debug device code now compiles correctly
    thrust::uninitialized_copy and thrust::unintialized_fill now dispatch constructors on the device rather than the host

Known Issues
    #212 set_intersection is known to fail for large input sizes
    partition_point is known to fail for 64b types with nvcc 3.2

Acknowledgments
    Thanks to Duane Merrill for contributing a fast CUDA radix sort implementation
    Thanks to Erich Elsen for contributing an implementation of find_if
    Thanks to Andrew Corrigan for contributing changes which allow the OpenMP backend to compile in the absence of nvcc
    Thanks to Andrew Corrigan, Cliff Wooley, David Coeurjolly, Janick Martinez Esturo, John Bowers, Maxim Naumov, Michael Garland, and Ryuta Suzuki for bug reports
    Thanks to Cliff Woolley for help with testing

#######################################
#           Thrust v1.2.1             #
#######################################

Summary
    Small fixes for compatibility with CUDA 3.1

Known Issues
    inclusive_scan & exclusive_scan may fail with very large types
    the Microsoft compiler may fail to compile code using both sort and binary search algorithms
    uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device
    # 109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads
    default_random_engine::discard is not accelerated with nvcc 2.3
    nvcc 3.1 may fail to compile code using types derived from thrust::subtract_with_carry_engine, such as thrust::ranlux24 & thrust::ranlux48.

#######################################
#           Thrust v1.2.0             #
#######################################

Summary
    Thrust v1.2 introduces support for compilation to multicore CPUs
    and the Ocelot virtual machine, and several new facilities for
    pseudo-random number generation.  New algorithms such as set
    intersection and segmented reduction have also been added.  Lastly,
    improvements to the robustness of the CUDA backend ensure
    correctness across a broad set of (uncommon) use cases.

Breaking API Changes
    thrust::gather's interface was incorrect and has been removed.
    The old interface is deprecated but will be preserved for Thrust
    version 1.2 at thrust::deprecated::gather &
    thrust::deprecated::gather_if. The new interface is provided at
    thrust::next::gather & thrust::next::gather_if.  The new interface
    will be promoted to thrust:: in Thrust version 1.3. For more details,
    please refer to this thread:
    http://groups.google.com/group/thrust-users/browse_thread/thread/f5f0583cb97b51fd

    The thrust::sorting namespace has been deprecated in favor of the
    top-level sorting functions, such as thrust::sort() and
    thrust::sort_by_key().

New Features
    Functions
        reduce_by_key
        set_intersection
        tie
        unique_copy
        unique_by_key
        unique_copy_by_key

    Types
        Random Number Generation
            discard_block_engine
            default_random_engine
            linear_congruential_engine
            linear_feedback_shift_engine
            minstd_rand
            minstd_rand0
            normal_distribution (experimental)
            ranlux24
            ranlux48
            ranlux24_base
            ranlux48_base
            subtract_with_carry_engine
            taus88
            uniform_int_distribution
            uniform_real_distribution
            xor_combine_engine
        Functionals
            project1st
            project2nd

    Fancy Iterators
        permutation_iterator
        reverse_iterator

    Device support
        Add support for multicore CPUs via OpenMP
        Add support for Fermi-class GPUs
        Add support for Ocelot virtual machine

New Examples
    cpp_integration
    histogram
    mode
    monte_carlo
    monte_carlo_disjoint_sequences
    padded_grid_reduction
    permutation_iterator
    row_sum
    run_length_encoding
    segmented_scan
    stream_compaction
    summary_statistics
    transform_iterator
    word_count

Other Enhancements
    vector functions operator!=, rbegin, crbegin, rend, crend, data, & shrink_to_fit
    integer sorting performance is improved when max is large but (max - min) is small and when min is negative
    performance of inclusive_scan() and exclusive_scan() is improved by 20-25% for primitive types
    support for nvcc 3.0

Removed Functionality
    removed support for equal between host & device sequences
    removed support for gather() and scatter() between host & device sequences

Bug Fixes
    # 8 cause a compiler error if the required compiler is not found rather than a mysterious error at link time
    # 42 device_ptr & device_reference are classes rather than structs, eliminating warnings on certain platforms
    # 46 gather & scatter handle any space iterators correctly
    # 51 thrust::experimental::arch functions gracefully handle unrecognized GPUs
    # 52 avoid collisions with common user macros such as BLOCK_SIZE
    # 62 provide better documentation for device_reference
    # 68 allow built-in CUDA vector types to work with device_vector in pure C++ mode
    # 102 eliminated a race condition in device_vector::erase
    various compilation warnings eliminated

Known Issues
   inclusive_scan & exclusive_scan may fail with very large types
   the Microsoft compiler may fail to compile code using both sort and binary search algorithms
   uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device
   # 109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads
   default_random_engine::discard is not accelerated with nvcc 2.3

Acknowledgments
   Thanks to Gregory Diamos for contributing a CUDA implementation of set_intersection
   Thanks to Ryuta Suzuki & Gregory Diamos for rigorously testing Thrust's unit tests and examples against Ocelot
   Thanks to Tom Bradley for contributing an implementation of normal_distribution
   Thanks to Joseph Rhoads for contributing the example summary_statistics

#######################################
#           Thrust v1.1.1             #
#######################################

Summary
    Small fixes for compatibility with CUDA 2.3a and Mac OSX Snow Leopard.

#######################################
#           Thrust v1.1.0             #
#######################################

Summary
    Thrust v1.1 introduces fancy iterators, binary search functions, and
    several specialized reduction functions.  Experimental support for
    segmented scan has also been added.

Breaking API Changes
    counting_iterator has been moved into the thrust namespace (previously thrust::experimental)

New Features
    Functions
        copy_if
        lower_bound
        upper_bound
        vectorized lower_bound
        vectorized upper_bound
        equal_range
        binary_search
        vectorized binary_search
        all_of
        any_of
        none_of
        minmax_element
        advance
        inclusive_segmented_scan (experimental)
        exclusive_segmented_scan (experimental)

    Types
        pair
        tuple
        device_malloc_allocator

    Fancy Iterators
        constant_iterator
        counting_iterator
        transform_iterator
        zip_iterator

New Examples
    computing the maximum absolute difference between vectors
    computing the bounding box of a two-dimensional point set
    sorting multiple arrays together (lexicographical sorting)
    constructing a summed area table
    using zip_iterator to mimic an array of structs
    using constant_iterator to increment array values

Other Enhancements
    added pinned memory allocator (experimental)
    added more methods to host_vector & device_vector (issue #4)
    added variant of remove_if with a stencil argument (issue #29)
    scan and reduce use cudaFuncGetAttributes to determine grid size
    exceptions are reported when temporary device arrays cannot be allocated 

Bug Fixes
     #5 make vector work for larger data types
     #9 stable_partition_copy doesn't respect OutputIterator concept semantics
    #10 scans should return OutputIterator
    #16 make algorithms work for larger data types
    #27 dispatch radix_sort even when comp=less<T> is explicitly provided

Known Issues
    Using functors with Thrust entry points may not compile on Mac OSX with gcc-4.0.1
    uninitialized_copy & uninitialized_fill dispatch constructors on the host rather than the device.
    inclusive_scan, inclusive_scan_by_key, exclusive_scan, and exclusive_scan_by_key may fail when used with large types with the CUDA 3.1 driver


#######################################
#           Thrust v1.0.0             #
#######################################

Breaking API changes
    Rename top level namespace komrade to thrust.
    Move partition_copy() & stable_partition_copy() into thrust::experimental namespace until we can easily provide the standard interface.
    Rename range() to sequence() to avoid collision with Boost.Range.
    Rename copy_if() to copy_when() due to semantic differences with C++0x copy_if().

New Features
    Add C++0x style cbegin() & cend() methods to host_vector & device_vector.
    Add transform_if function.
    Add stencil versions of replace_if() & replace_copy_if().
    Allow counting_iterator to work with for_each().
    Allow types with constructors in comparison sort & reduce.

Other Enhancements
    merge_sort and stable_merge_sort are now 2 to 5x faster when executed on the parallel device.

Bug fixes
    Workaround an issue where an incremented iterator causes nvcc to crash. (Komrade issue #6)
    Fix an issue where const_iterators could not be passed to transform. (Komrade issue #7)