Releases: rapidsai/cudf
Releases Β· rapidsai/cudf
v24.10.00
π¨ Breaking Changes
- Whitespace normalization of nested column coerced as string column in JSONL inputs (#16759) @shrshi
- Add libcudf wrappers around current_device_resource functions. (#16679) @harrism
- Fix empty cluster handling in tdigest merge (#16675) @jihoonson
- Remove java ColumnView.copyWithBooleanColumnAsValidity (#16660) @revans2
- Support reading multiple PQ sources with mismatching nullability for columns (#16639) @mhaseeb123
- Remove arrow_io_source (#16607) @vyasr
- Remove legacy Arrow interop APIs (#16590) @vyasr
- Remove NativeFile support from cudf Python (#16589) @vyasr
- Revert "Make proxy NumPy arrays pass isinstance check in
cudf.pandas
" (#16586) @Matt711 - Align public utility function signatures with pandas 2.x (#16565) @mroeschke
- Disallow cudf.Index accepting column in favor of ._from_column (#16549) @mroeschke
- Refactor dictionary encoding in PQ writer to migrate to the new
cuco::static_map
(#16541) @mhaseeb123 - Change IPv4 convert APIs to support UINT32 instead of INT64 (#16489) @davidwendt
- enable list to be forced as string in JSON reader. (#16472) @karthikeyann
- Disallow cudf.Series to accept column in favor of
._from_column
(#16454) @mroeschke - Align groupby APIs with pandas 2.x (#16403) @mroeschke
- Align misc DataFrame and MultiIndex methods with pandas 2.x (#16402) @mroeschke
- Align Index APIs with pandas 2.x (#16361) @mroeschke
- Add
stream
param to stream compaction APIs (#16295) @JayjeetAtGithub
π Bug Fixes
- Add license to the pylibcudf wheel (#16976) @raydouglass
- Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter (#16950) @shrshi
- Add dask-cudf workaround for missing
rename_axis
support in cudf (#16899) @rjzamora - Update oldest deps for
pyarrow
&numpy
(#16883) @galipremsagar - Update labeler for pylibcudf (#16868) @vyasr
- Revert "Refactor mixed_semi_join using cuco::static_set" (#16855) @mhaseeb123
- Fix metadata after implicit array conversion from Dask cuDF (#16842) @rjzamora
- Add cudf.pandas dependencies.yaml to update-version.sh (#16840) @raydouglass
- Use cupy 12.2.0 as oldest dependency pinning on CUDA 12 ARM (#16808) @bdice
- Revert "Fix empty cluster handling in tdigest merge (#16675)" (#16800) @jihoonson
- Intentionally leak thread_local CUDA resources to avoid crash (part 1) (#16787) @kingcrimsontianyu
- Fix
cov
/corr
bug in dask-cudf (#16786) @rjzamora - Fix slice_strings wide strings logic with multi-byte characters (#16777) @davidwendt
- Fix nvbench output for sha512 (#16773) @davidwendt
- Allow read_csv(header=None) to return int column labels in
mode.pandas_compatible
(#16769) @mroeschke - Whitespace normalization of nested column coerced as string column in JSONL inputs (#16759) @shrshi
- Fix DataFrame.drop(columns=cudf.Series/Index, axis=1) (#16712) @mroeschke
- Use merge base when calculating changed files (#16709) @KyleFromNVIDIA
- Ensure we pass the has_nulls tparam to mixed_join kernels (#16708) @abellina
- Add boost-devel to Java CI Docker image (#16707) @jlowe
- [BUG] Add gpu node type to cudf-pandas 3rd-party integration nightly CI job (#16704) @Matt711
- Fix typo in column_factories.hpp comment from 'depth 1' to 'depth 2' (#16700) @a-hirota
- Fix Series.to_frame(name=None) setting a None name (#16698) @mroeschke
- Disable gtests/ERROR_TEST during compute-sanitizer memcheck test (#16691) @davidwendt
- Enable batched multi-source reading of JSONL files with large records (#16687) @shrshi
- Handle
ordered
parameter inCategoricalIndex.__repr__
(#16683) @galipremsagar - Fix loc/iloc.setitem[:, loc] with non cupy types (#16677) @mroeschke
- Fix empty cluster handling in tdigest merge (#16675) @jihoonson
- Fix
cudf::rank
not getting enough params (#16666) @JayjeetAtGithub - Fix slowdown in
CategoricalIndex.__repr__
(#16665) @galipremsagar - Remove java ColumnView.copyWithBooleanColumnAsValidity (#16660) @revans2
- Fix slowdown in DataFrame repr in jupyter notebook (#16656) @galipremsagar
- Preserve Series name in duplicated method. (#16655) @bdice
- Fix interval_range right child non-zero offset (#16651) @mroeschke
- fix libcudf wheel publishing, make package-type explicit in wheel publishing (#16650) @jameslamb
- Revert "Hide all gtest symbols in cudftestutil (#16546)" (#16644) @robertmaynard
- Fix integer overflow in indexalator pointer logic (#16643) @davidwendt
- Allow for binops between two differently sized DecimalDtypes (#16638) @mroeschke
- Move pragma once in rolling/jit/operation.hpp. (#16636) @bdice
- Fix overflow bug in low-memory JSON reader (#16632) @shrshi
- Add the missing
num_aggregations
axis forgroupby_max_cardinality
(#16630) @PointKernel - Fix strings::detail::copy_range when target contains nulls (#16626) @davidwendt
- Fix function parameters with common dependency modified during their evaluation (#16620) @ttnghia
- bug-fix: Don't enable the CUDA language if testing was requested when finding cudf (#16615) @cryos
- bug-fix: cudf/io/json.hpp use after move (#16609) @NicolasDenoyelle
- Remove CUDA whole compilation ODR violations (#16603) @robertmaynard
- MAINT: Adapt to numpy hiding flagsobject away (#16593) @seberg
- Revert "Make proxy NumPy arrays pass isinstance check in
cudf.pandas
" (#16586) @Matt711 - Switch python version to
3.10
incudf.pandas
pandas test scripts (#16559) @galipremsagar - Hide all gtest symbols in cudftestutil (#16546) @robertmaynard
- Update the java code to properly deal with lists being returned as strings (#16536) @revans2
- Register
read_parquet
andread_csv
with dask-expr (#16535) @rjzamora - Change cudf::empty_like to not include offsets for empty strings columns (#16529) @davidwendt
- Fix DataFrame reductions with median returning scalar instead of Series (#16527) @mroeschke
- Allow DataFrame.sort_values(by=) to select an index level (#16519) @mroeschke
- Fix
date_range(start, end, freq)
when end-start is divisible by freq (#16516) @mroeschke - Preserve array name in MultiIndex.from_arrays (#16515) @mroeschke
- Disallow indexing by selecting duplicate labels (#16514) @mroeschke
- Fix
.replace(Index, Index)
raising a TypeError (#16513) @mroeschke - Check index bounds in compact protocol reader. (#16493) @bdice
- Fix build failures with GCC 13 (#16488) @PointKernel
- Fix all-empty input column for strings split APIs (#16466) @davidwendt
- Fix segmented-sort overlapped input/output indices (#16463) @davidwendt
- Fix merge conflict for auto merge 16447 (#16449) @davidwendt
π Documentation
- Fix links in Dask cuDF documentation (#16929) @rjzamora
- Improve aggregation documentation (#16822) @PointKernel
- Add best practices page to Dask cuDF docs (#16821) @rjzamora
- [DOC] Update Pylibcudf doc strings (#16810) @Matt711
- Recommending
miniforge
for conda install (#16782) @mmccarty - Add labeling pylibcudf doc pages (#16779) @mroeschke
- Migrate dask-cudf README improvements to dask-cudf sphinx docs (#16765) @rjzamora
- [DOC] Remove out of date section from cudf.pandas docs (#16697) @Matt711
- Add performance tips to cudf.pandas FAQ. (#16693) @bdice
- Update documentation for Dask cuDF (#16671) @rjzamora
- Add missing pylibcudf strings docs (#16471) @brandon-b-miller
- DOC: Refresh pylibcudf guide (#15856) @lithomas1
π New Features
- Build
cudf-polars
withbuild.sh
(#16898) @brandon-b-miller - Add polars to "all" dependency list. (#16875) @bdice
- nvCOMP GZIP integration (#16770) @vuule
- [FEA] Add support for
cudf.NamedAgg
(#16744) @Matt711 - Add experimental
filesystem="arrow"
support indask_cudf.read_parquet
(#16684) @rjzamora - Relax Arrow pin (#16681) @vyasr
- Add libcudf wrappers around current_device_resource functions. (#16679) @harrism
- Move NDS-H examples into benchmarks (#16663) @JayjeetAtGithub
- [FEA] Add third-party library integration testing of cudf.pandas to cudf (#16645) @Matt711
- Make isinstance check pass for proxy ndarrays (#16601) @Matt711
- [FEA] Add an environment variable to fail on fallback in
cudf.pandas
(#16562) @Matt711 - [FEA] Add support for
cudf.unique
(#16554) @Matt711 - [FEA] Support named aggregations in
df.groupby().agg()
(#16528) @Matt711 - Change IPv4 convert APIs to support UINT32 instead of INT64 (#16489) @davidwendt
- enable list to be forced as string in JSON reader. (#16472) @karthikeyann
- Remove cuDF dependency from pylibcudf column from_device tests (#16441) @brandon-b-miller
- Enable cudf.pandas REPL and -c command support (#16428) @bdice
- Setup pylibcudf package (#16299) @lithomas1
- Add a libcudf/thrust-based TPC-H derived datagen (#16294) @JayjeetAtGithub
- Make proxy NumPy arrays pass isinstance check in
cudf.pandas
(#16286) @Matt711 - Add skiprows and nrows to parquet reader (#16214) @lithomas1
- Upgrade to nvcomp 4.0.1 (#16076) @vuule
- Migrate ORC reader to pylibcudf (#16042) @lithomas1
- JSON reader validation of values (#15968) @karthikeyann
- Implement exposed null mask APIs in pylibcudf (#15908) @charlesbluca
- Word-based nvtext::minhash function (#15368) @davidwendt
π οΈ Improvements
- Make tests deterministic (#16910) @galipremsagar
- Update update-version.sh to use packaging lib (#16891) @AyodeAwe
- Pin polars for 24.10 and update polars test suite xfail list (#16886) @wence-
- Add in support for setting delim when parsing JSON through java (#16867) (#16880) @revans2
- Remove unnecessary flag from build.sh (#16879) @vyasr
- Ignore numba warning specific to ARM runners (#16872) @galipremsagar
- Display deltas for
cudf.pandas
test summary (#16864) @galipremsagar - Switch to using native
traceback
(#16851) @galipremsagar - JSON tree algorithm code reorg (#16836) @karthikeyann
- Add string.repeats API to pylibcudf (#16834) @mroeschke
- Use CI workflow branch 'branch-24.10' again (#16832) @jameslamb
- Rename the NDS-H benchmark binaries (#16831) @JayjeetAtGithub
- Add string.findall APIs t...
[NIGHTLY] v24.12.00
π Links
π¨ Breaking Changes
- Deprecate support for directly accessing logger (#16964) @vyasr
- Made cudftestutil header-only and removed GTest dependency (#16839) @lamarrr
π Bug Fixes
- Fix ORC reader when using
device_read_async
while the destination device buffers are not ready (#17074) @ttnghia - Adding assertion to check for regular JSON inputs of size greater than
INT_MAX
bytes (#17057) @shrshi - Disable kvikio remote I/O to avoid openssl dependencies in JNI build (#17026) @pxLi
- Fix
host_span
constructor to correctly copyis_device_accessible
(#17020) @vuule - Add pinning for pyarrow in wheels (#17018) @vyasr
- Use std::optional for host types (#17015) @robertmaynard
- Fix write_json to handle empty string column (#16995) @karthikeyann
- Restore export of nvcomp outside of wheel builds (#16988) @KyleFromNVIDIA
- Allow melt(var_name=) to be a falsy label (#16981) @mroeschke
- Fix astype from tz-aware type to tz-aware type (#16980) @mroeschke
- Use
libcudf
wheel from PR rather than nightly forpolars-polars
CI test job (#16975) @brandon-b-miller - Fix order-preservation in pandas-compat unsorted groupby (#16942) @wence-
- Fix cudf::strings::findall error with empty input (#16928) @davidwendt
- Fix JsonLargeReaderTest.MultiBatch use of LIBCUDF_JSON_BATCH_SIZE env var (#16927) @davidwendt
- Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter (#16923) @shrshi
- Respect groupby.nunique(dropna=False) (#16921) @mroeschke
- Update all rmm imports to use pylibrmm/librmm (#16913) @Matt711
- Fix order-preservation in cudf-polars groupby (#16907) @wence-
- Add a shortcut for when the input clusters are all empty for the tdigest merge (#16897) @jihoonson
- Properly handle the mapped and registered regions in
memory_mapped_source
(#16865) @vuule - Fix performance regression for generate_character_ngrams (#16849) @davidwendt
- Fix regex parsing logic handling of nested quantifiers (#16798) @davidwendt
- Compute whole column variance using numerically stable approach (#16448) @wence-
π Documentation
- docs: change 'CSV' to 'csv' in python/custreamz/README.md to match kafka.py (#17041) @a-hirota
- [DOC] Document limitation using
cudf.pandas
proxy arrays (#16955) @Matt711 - [DOC] Document environment variable for failing on fallback in
cudf.pandas
(#16932) @Matt711
π New Features
- Add profilers to CUDA 12 conda devcontainers (#17066) @vyasr
- Migrate Min Hashing APIs to pylibcudf (#17021) @Matt711
- Reorganize
cudf_polars
expression code (#17014) @brandon-b-miller - Migrate nvtext jaccard API to pylibcudf (#17007) @Matt711
- Migrate nvtext generate_ngrams APIs to pylibcudf (#17006) @Matt711
- Switched BINARY_OP Benchmarks from GoogleBench to NVBench (#16963) @lamarrr
- [FEA] Migrate nvtext/edit_distance APIs to pylibcudf (#16957) @Matt711
- Switched AST benchmarks from GoogleBench to NVBench (#16952) @lamarrr
- Made cudftestutil header-only and removed GTest dependency (#16839) @lamarrr
- Add an example to demonstrate multithreaded
read_parquet
pipelines (#16828) @mhaseeb123 - Implement
extract_datetime_component
inlibcudf
/pylibcudf
(#16776) @brandon-b-miller - Add cudf::strings::find_re API (#16742) @davidwendt
π οΈ Improvements
- Remove unused hash helper functions (#17056) @PointKernel
- Move
flatten_single_pass_aggs
to its own TU (#17053) @PointKernel - Replace deprecated cuco APIs with updated versions (#17052) @PointKernel
- Refactor ORC dictionary encoding to migrate to the new
cuco::static_map
(#17049) @mhaseeb123 - Move pylibcudf/libcudf/wrappers/decimals to pylibcudf/libcudf/fixed_point (#17048) @mroeschke
- make conda installs in CI stricter (part 2) (#17042) @jameslamb
- Clean up hash-groupby
var_hash_functor
(#17034) @PointKernel - Add json APIs to pylibcudf (#17025) @mroeschke
- make conda installs in CI stricter (#17013) @jameslamb
- Pylibcudf: pack and unpack (#17012) @madsbk
- Remove unneeded pylibcudf.libcudf.wrappers.duration usage in cudf (#17010) @mroeschke
- Remove unused import (#17005) @Matt711
- Add string.convert.convert_urls APIs to pylibcudf (#17003) @mroeschke
- Add release tracking to project automation scripts (#17001) @jarmak-nv
- Add string.convert.convert_lists APIs to pylibcudf (#16997) @mroeschke
- Performance optimization of JSON validation (#16996) @karthikeyann
- Add string.convert.convert_ipv4 APIs to pylibcudf (#16994) @mroeschke
- Add string.convert_floats APIs to pylibcudf (#16990) @mroeschke
- Add string.convert.convert_fixed_type APIs to pylibcudf (#16984) @mroeschke
- Add docstrings and test for strings.convert_durations APIs for pylibcudf (#16982) @mroeschke
- Turn on
xfail_strict = true
for all python packages (#16977) @wence- - Add string.convert.convert_datetime/convert_booleans APIs to pylibcudf (#16971) @mroeschke
- Deprecate support for directly accessing logger (#16964) @vyasr
- Expunge NamedColumn (#16962) @wence-
- Add clang-tidy to CI (#16958) @vyasr
- Address all remaining clang-tidy errors (#16956) @vyasr
- Apply clang-tidy autofixes (#16949) @vyasr
- Use nvcomp wheel instead of bundling nvcomp (#16946) @KyleFromNVIDIA
- Refactor the
cuda_memcpy
functions to make them more usable (#16945) @vuule - Add string.split APIs to pylibcudf (#16940) @mroeschke
- clang-tidy fixes part 3 (#16939) @vyasr
- clang-tidy fixes part 2 (#16938) @vyasr
- clang-tidy fixes part 1 (#16937) @vyasr
- Add string.wrap APIs to pylibcudf (#16935) @mroeschke
- Add string.translate APIs to pylibcudf (#16934) @mroeschke
- Add string.find_multiple APIs to pylibcudf (#16920) @mroeschke
- Batch memcpy the last offsets for output buffers of str and list cols in PQ reader (#16905) @mhaseeb123
- reduce wheel build verbosity, narrow deprecation warning filter (#16896) @jameslamb
- Improve aggregation device functors (#16884) @PointKernel
- Upgrade pandas pinnings to support
2.2.3
(#16882) @galipremsagar - Fix 24.10 to 24.12 forward merge (#16876) @bdice
- Manually resolve conflicts in between branch-24.12 and branch-24.10 (#16871) @galipremsagar
- Add in support for setting delim when parsing JSON through java (#16867) @revans2
- Reapply
mixed_semi_join
refactoring and bug fixes (#16859) @mhaseeb123 - Add string padding and side_type APIs to pylibcudf (#16833) @mroeschke
- Organize parquet reader mukernel non-nullable code, introduce manual block scans (#16830) @pmattione-nvidia
- Remove superfluous use of std::vector for std::future (#16829) @kingcrimsontianyu
- Rework
read_csv
IO to avoid reading whole input with a singlehost_read
(#16826) @vuule - Add remaining string.char_types APIs to pylibcudf (#16788) @mroeschke
- Avoid public constructors when called with columns to avoid unnecessary validation (#16747) @mroeschke
- Use
changed-files
shared workflow (#16713) @KyleFromNVIDIA - Refactor
histogram
reduction usingcuco::static_set::insert_and_find
(#16485) @srinivasyadav18 - Use numba-cuda>=0.0.13 (#16474) @gmarkall
v24.08.03
π¨ Breaking Changes
- Align Index init APIs with pandas 2.x (#16362) @mroeschke
- Align Series APIs with pandas 2.x (#16333) @mroeschke
- Add missing
stream
param to dictionary factory APIs (#16319) @JayjeetAtGithub - Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
- Remove squeeze argument from groupby (#16312) @mroeschke
- Align more DataFrame APIs with pandas (#16310) @mroeschke
- Remove
mr
param fromwrite_csv
andwrite_json
(#16231) @JayjeetAtGithub - Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
- Refactor from_arrow_device/host to use resource_ref (#16160) @harrism
- Deprecate Arrow support in I/O (#16132) @lithomas1
- Return
FrozenList
forIndex.names
(#16047) @galipremsagar - Add compile option to enable large strings support (#16037) @davidwendt
- Hide visibility of non public symbols (#15982) @robertmaynard
- Rename strings multiple target replace API (#15898) @davidwendt
- Pinned vector factory that uses the global pool (#15895) @vuule
- Apply clang-tidy autofixes (#15894) @vyasr
- Support
arrow:schema
in Parquet writer to faithfully roundtripduration
types with Arrow (#15875) @mhaseeb123 - Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
- Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
- Remove legacy JSON reader and concurrent_unordered_map.cuh. (#15813) @bdice
π Bug Fixes
- Ensure managed memory is supported in cudf.pandas. (#16552) @bdice
- Add
flatbuffers
tolibcudf
build (#16446) @galipremsagar - Fix parquet_field_list read_func lambda capture invalid this pointer (#16440) @davidwendt
- Enable prefetching in cudf.pandas.install() (#16439) @bdice
- Enable prefetching before
runpy
(#16427) @galipremsagar - Support thread-safe for
prefetch_config::get
andprefetch_config::set
(#16425) @ttnghia - Fix a
pandas-2.0
missing attribute error (#16416) @galipremsagar - [Bug] Remove loud
NativeFile
deprecation noise forread_parquet
from S3 (#16415) @rjzamora - Fix nightly memcheck error for empty STREAM_INTEROP_TEST (#16406) @davidwendt
- Gate ArrowStringArrayNumpySemantics cudf.pandas proxy behind version check (#16401) @mroeschke
- Don't export bs_thread_pool (#16398) @KyleFromNVIDIA
- Require fixed width types for casting in
cudf-polars
(#16381) @brandon-b-miller - Fix docstring of
DataFrame.apply
(#16351) @galipremsagar - Make bool raise for more cudf objects (#16311) @mroeschke
- Rename
.devcontainer
s for CUDA 12.5 (#16293) @jakirkham - Fix split_record for all empty strings column (#16291) @davidwendt
- Fix logic in to_arrow for empty list column (#16279) @wence-
- [BUG] Make name attr of Index fast slow attrs (#16270) @Matt711
- Add custom name setter and getter for proxy objects in
cudf.pandas
(#16234) @Matt711 - Fall back when casting a timestamp to numeric in cudf-polars (#16232) @brandon-b-miller
- Disable large string support for Java build (#16216) @jlowe
- Remove CCCL patch for PR 211. (#16207) @bdice
- Add single offset to an empty ListArray in cudf::to_arrow (#16201) @davidwendt
- Fix
memory_usage
when calculating nested list column (#16193) @mroeschke - Support at/iat indexers in cudf.pandas (#16177) @mroeschke
- Fix unused-return-value debug build error in from_arrow_stream_test.cpp (#16168) @davidwendt
- Fix cudf::strings::replace_multiple hang on empty target (#16167) @davidwendt
- Refactor from_arrow_device/host to use resource_ref (#16160) @harrism
- interpolate returns new column if no values are interpolated (#16158) @mroeschke
- Use provided memory resource for allocating mixed join results. (#16153) @bdice
- Run DFG after verify-alpha-spec (#16151) @KyleFromNVIDIA
- Use size_t to allow large conditional joins (#16127) @bdice
- Allow only scale=0 fixed-point values in fixed_width_column_wrapper (#16120) @davidwendt
- Fix pylibcudf Table.num_rows for 0 columns case and add interop to docs (#16108) @lithomas1
- Add support for proxy
np.flatiter
objects (#16107) @Matt711 - Ensure cudf objects can astype to any type when empty (#16106) @mroeschke
- Support
pd.read_pickle
andpd.to_pickle
incudf.pandas
(#16105) @Matt711 - Fix unnecessarily strict check in parquet chunked reader for choosing split locations. (#16099) @nvdbaranec
- Fix
is_monotonic_*
APIs to includenan's
(#16085) @galipremsagar - More safely parse CUDA versions when subprocess output is contaminated (#16067) @brandon-b-miller
- fast_slow_proxy: Don't import assert_eq at top-level (#16063) @wence-
- Prevent bad ColumnAccessor state after .sort_index(axis=1, ignore_index=True) (#16061) @mroeschke
- Fix ArrowDeviceArray interface to pass address of event (#16058) @zeroshade
- Fix a size overflow bug in hash groupby (#16053) @PointKernel
- Fix
atomic_ref
scope when multiple blocks are updating the same output (#16051) @vuule - Fix initialization error in to_arrow for empty string views (#16033) @wence-
- Fix the int32 overflow when computing page fragment sizes for large string columns (#16028) @mhaseeb123
- Fix the pool size alignment issue (#16024) @PointKernel
- Improve multibyte-split byte-range performance (#16019) @davidwendt
- Fix target counting in strings char-parallel replace (#16017) @davidwendt
- Support IntervalDtype in cudf.from_pandas (#16014) @mroeschke
- Fix memory size in create_byte_range_infos_consecutive (#16012) @davidwendt
- Hide visibility of non public symbols (#15982) @robertmaynard
- Fix Cython typo preventing proper inheritance (#15978) @vyasr
- Fix convert_dtypes with convert_integer=False/convert_floating=True (#15964) @mroeschke
- Fix nunique for
MultiIndex
,DataFrame
, and all NA case withdropna=False
(#15962) @mroeschke - Explicitly build for all GPU architectures (#15959) @vyasr
- Preserve column type and class information in more DataFrame operations (#15949) @mroeschke
- Add array_interface to cudf.pandas numpy.ndarray proxy (#15936) @mroeschke
- Allow tests to be built when stream util is disabled (#15933) @robertmaynard
- Fix JSON multi-source reading when total source size exceeds
INT_MAX
bytes (#15930) @shrshi - Fix
dask_cudf.read_parquet
regression for legacy timestamp data (#15929) @rjzamora - Fix offsetalator when accessing over 268 million rows (#15921) @davidwendt
- Fix debug assert in rowgroup_char_counts_kernel (#15902) @davidwendt
- Fix categorical conversion from chunked arrow arrays (#15886) @vyasr
- Handling for
NaN
andinf
when converting floating point to fixed point types (#15885) @ttnghia - Manual merge of Branch 24.08 from 24.06 (#15869) @galipremsagar
- Avoid unnecessary
Index
cast inIndexedFrame.index
setter (#15843) @charlesbluca - Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
- Fix multi-replace target count logic for large strings (#15807) @davidwendt
- Fix JSON parsing memory corruption - Fix Mixed types nested children removal (#15798) @karthikeyann
- Allow anonymous user in devcontainer name. (#15784) @bdice
- Add support for additional metaclasses of proxies and use for ExcelWriter (#15399) @vyasr
π Documentation
- Add docstring for from_dataframe (#16260) @mroeschke
- Update libcudf compiler requirements in contributing doc (#16103) @davidwendt
- Add libcudf public/detail API pattern to developer guide (#16086) @davidwendt
- Explain line profiler and how to know which functions are GPU-accelerated. (#16079) @bdice
- cudf.pandas documentation improvement (#15948) @Matt711
- Reland "Fix docs for IO readers and strings_convert" (#15872)" (#15941) @lithomas1
- Document how to use cudf.pandas in tandem with multiprocessing (#15940) @wence-
- DOC: Add documentation for cudf.pandas in the Developer Guide (#15889) @Matt711
- Improve options docs (#15888) @bdice
- DOC: add linkcode to docs (#15860) @raybellwaves
- DOC: use intersphinx mapping in pandas-compat ext (#15846) @raybellwaves
- Fix inconsistent usage of 'results' and 'records' in read-json.md (#15766) @dagardner-nv
- Update PandasCompat.py to resolve references (#15704) @raybellwaves
π New Features
- Creation of CI artifacts for cudf-polars wheels (#16680) @wence-
- Warn on cuDF failure when
POLARS_VERBOSE
is true (#16308) @brandon-b-miller - Add
drop_nulls
incudf-polars
(#16290) @brandon-b-miller - [JNI] Add setKernelPinnedCopyThreshold and setPinnedAllocationThreshold (#16288) @abellina
- Implement support for scan_ndjson in cudf-polars (#16263) @lithomas1
- Publish cudf-polars nightlies (#16213) @lithomas1
- Modify
make_host_vector
andmake_device_uvector
factories to optionally use pinned memory and kernel copy (#16206) @vuule - Migrate lists/set_operations to pylibcudf (#16190) @Matt711
- Migrate lists/filling to pylibcudf (#16189) @Matt711
- Fall back to CPU for unsupported libcudf binaryops in cudf-polars (#16188) @brandon-b-miller
- Use resource_ref for upstream in stream_checking_resource_adaptor (#16187) @harrism
- Migrate lists/modifying to pylibcudf (#16185) @Matt711
- Migrate lists/filtering to pylibcudf (#16184) @Matt711
- Migrate lists/sorting to pylibcudf (#16179) @Matt711
- Add missing methods to lists/list_column_view.pxd in pylibcudf (#16175) @Matt711
- Migrate pylibcudf lists gathering (#16170) @Matt711
- Move kernel vis over to CUDF_HIDDEN (#16165) @robertmaynard
- Add groupby_max multi-threaded benchmark (#16154) @srinivasyadav18
- Promote has_nested_columns to cudf public API (#16131) @robertmaynard
- Promote IO support queries to cudf API (#16125) @robertmaynard
- cudf::merge public API now support passing a user stream (#16124) @robertmaynard
- Add TPC-H inspired examples for Libcudf (#16088) @JayjeetAtGithub
- Installed cudf header use cudf::allocate_like (#16087) @robertmaynard
cudf-polars
string slicing (#16082) @brandon-b-miller- Migrate Parquet reader to pylibcudf (#16078) @lithomas1
- Migrate lists/c...
v24.08.02
π¨ Breaking Changes
- Align Index init APIs with pandas 2.x (#16362) @mroeschke
- Align Series APIs with pandas 2.x (#16333) @mroeschke
- Add missing
stream
param to dictionary factory APIs (#16319) @JayjeetAtGithub - Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
- Remove squeeze argument from groupby (#16312) @mroeschke
- Align more DataFrame APIs with pandas (#16310) @mroeschke
- Remove
mr
param fromwrite_csv
andwrite_json
(#16231) @JayjeetAtGithub - Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
- Refactor from_arrow_device/host to use resource_ref (#16160) @harrism
- Deprecate Arrow support in I/O (#16132) @lithomas1
- Return
FrozenList
forIndex.names
(#16047) @galipremsagar - Add compile option to enable large strings support (#16037) @davidwendt
- Hide visibility of non public symbols (#15982) @robertmaynard
- Rename strings multiple target replace API (#15898) @davidwendt
- Pinned vector factory that uses the global pool (#15895) @vuule
- Apply clang-tidy autofixes (#15894) @vyasr
- Support
arrow:schema
in Parquet writer to faithfully roundtripduration
types with Arrow (#15875) @mhaseeb123 - Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
- Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
- Remove legacy JSON reader and concurrent_unordered_map.cuh. (#15813) @bdice
π Bug Fixes
- Ensure managed memory is supported in cudf.pandas. (#16552) @bdice
- Add
flatbuffers
tolibcudf
build (#16446) @galipremsagar - Fix parquet_field_list read_func lambda capture invalid this pointer (#16440) @davidwendt
- Enable prefetching in cudf.pandas.install() (#16439) @bdice
- Enable prefetching before
runpy
(#16427) @galipremsagar - Support thread-safe for
prefetch_config::get
andprefetch_config::set
(#16425) @ttnghia - Fix a
pandas-2.0
missing attribute error (#16416) @galipremsagar - [Bug] Remove loud
NativeFile
deprecation noise forread_parquet
from S3 (#16415) @rjzamora - Fix nightly memcheck error for empty STREAM_INTEROP_TEST (#16406) @davidwendt
- Gate ArrowStringArrayNumpySemantics cudf.pandas proxy behind version check (#16401) @mroeschke
- Don't export bs_thread_pool (#16398) @KyleFromNVIDIA
- Require fixed width types for casting in
cudf-polars
(#16381) @brandon-b-miller - Fix docstring of
DataFrame.apply
(#16351) @galipremsagar - Make bool raise for more cudf objects (#16311) @mroeschke
- Rename
.devcontainer
s for CUDA 12.5 (#16293) @jakirkham - Fix split_record for all empty strings column (#16291) @davidwendt
- Fix logic in to_arrow for empty list column (#16279) @wence-
- [BUG] Make name attr of Index fast slow attrs (#16270) @Matt711
- Add custom name setter and getter for proxy objects in
cudf.pandas
(#16234) @Matt711 - Fall back when casting a timestamp to numeric in cudf-polars (#16232) @brandon-b-miller
- Disable large string support for Java build (#16216) @jlowe
- Remove CCCL patch for PR 211. (#16207) @bdice
- Add single offset to an empty ListArray in cudf::to_arrow (#16201) @davidwendt
- Fix
memory_usage
when calculating nested list column (#16193) @mroeschke - Support at/iat indexers in cudf.pandas (#16177) @mroeschke
- Fix unused-return-value debug build error in from_arrow_stream_test.cpp (#16168) @davidwendt
- Fix cudf::strings::replace_multiple hang on empty target (#16167) @davidwendt
- Refactor from_arrow_device/host to use resource_ref (#16160) @harrism
- interpolate returns new column if no values are interpolated (#16158) @mroeschke
- Use provided memory resource for allocating mixed join results. (#16153) @bdice
- Run DFG after verify-alpha-spec (#16151) @KyleFromNVIDIA
- Use size_t to allow large conditional joins (#16127) @bdice
- Allow only scale=0 fixed-point values in fixed_width_column_wrapper (#16120) @davidwendt
- Fix pylibcudf Table.num_rows for 0 columns case and add interop to docs (#16108) @lithomas1
- Add support for proxy
np.flatiter
objects (#16107) @Matt711 - Ensure cudf objects can astype to any type when empty (#16106) @mroeschke
- Support
pd.read_pickle
andpd.to_pickle
incudf.pandas
(#16105) @Matt711 - Fix unnecessarily strict check in parquet chunked reader for choosing split locations. (#16099) @nvdbaranec
- Fix
is_monotonic_*
APIs to includenan's
(#16085) @galipremsagar - More safely parse CUDA versions when subprocess output is contaminated (#16067) @brandon-b-miller
- fast_slow_proxy: Don't import assert_eq at top-level (#16063) @wence-
- Prevent bad ColumnAccessor state after .sort_index(axis=1, ignore_index=True) (#16061) @mroeschke
- Fix ArrowDeviceArray interface to pass address of event (#16058) @zeroshade
- Fix a size overflow bug in hash groupby (#16053) @PointKernel
- Fix
atomic_ref
scope when multiple blocks are updating the same output (#16051) @vuule - Fix initialization error in to_arrow for empty string views (#16033) @wence-
- Fix the int32 overflow when computing page fragment sizes for large string columns (#16028) @mhaseeb123
- Fix the pool size alignment issue (#16024) @PointKernel
- Improve multibyte-split byte-range performance (#16019) @davidwendt
- Fix target counting in strings char-parallel replace (#16017) @davidwendt
- Support IntervalDtype in cudf.from_pandas (#16014) @mroeschke
- Fix memory size in create_byte_range_infos_consecutive (#16012) @davidwendt
- Hide visibility of non public symbols (#15982) @robertmaynard
- Fix Cython typo preventing proper inheritance (#15978) @vyasr
- Fix convert_dtypes with convert_integer=False/convert_floating=True (#15964) @mroeschke
- Fix nunique for
MultiIndex
,DataFrame
, and all NA case withdropna=False
(#15962) @mroeschke - Explicitly build for all GPU architectures (#15959) @vyasr
- Preserve column type and class information in more DataFrame operations (#15949) @mroeschke
- Add array_interface to cudf.pandas numpy.ndarray proxy (#15936) @mroeschke
- Allow tests to be built when stream util is disabled (#15933) @robertmaynard
- Fix JSON multi-source reading when total source size exceeds
INT_MAX
bytes (#15930) @shrshi - Fix
dask_cudf.read_parquet
regression for legacy timestamp data (#15929) @rjzamora - Fix offsetalator when accessing over 268 million rows (#15921) @davidwendt
- Fix debug assert in rowgroup_char_counts_kernel (#15902) @davidwendt
- Fix categorical conversion from chunked arrow arrays (#15886) @vyasr
- Handling for
NaN
andinf
when converting floating point to fixed point types (#15885) @ttnghia - Manual merge of Branch 24.08 from 24.06 (#15869) @galipremsagar
- Avoid unnecessary
Index
cast inIndexedFrame.index
setter (#15843) @charlesbluca - Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
- Fix multi-replace target count logic for large strings (#15807) @davidwendt
- Fix JSON parsing memory corruption - Fix Mixed types nested children removal (#15798) @karthikeyann
- Allow anonymous user in devcontainer name. (#15784) @bdice
- Add support for additional metaclasses of proxies and use for ExcelWriter (#15399) @vyasr
π Documentation
- Add docstring for from_dataframe (#16260) @mroeschke
- Update libcudf compiler requirements in contributing doc (#16103) @davidwendt
- Add libcudf public/detail API pattern to developer guide (#16086) @davidwendt
- Explain line profiler and how to know which functions are GPU-accelerated. (#16079) @bdice
- cudf.pandas documentation improvement (#15948) @Matt711
- Reland "Fix docs for IO readers and strings_convert" (#15872)" (#15941) @lithomas1
- Document how to use cudf.pandas in tandem with multiprocessing (#15940) @wence-
- DOC: Add documentation for cudf.pandas in the Developer Guide (#15889) @Matt711
- Improve options docs (#15888) @bdice
- DOC: add linkcode to docs (#15860) @raybellwaves
- DOC: use intersphinx mapping in pandas-compat ext (#15846) @raybellwaves
- Fix inconsistent usage of 'results' and 'records' in read-json.md (#15766) @dagardner-nv
- Update PandasCompat.py to resolve references (#15704) @raybellwaves
π New Features
- Warn on cuDF failure when
POLARS_VERBOSE
is true (#16308) @brandon-b-miller - Add
drop_nulls
incudf-polars
(#16290) @brandon-b-miller - [JNI] Add setKernelPinnedCopyThreshold and setPinnedAllocationThreshold (#16288) @abellina
- Implement support for scan_ndjson in cudf-polars (#16263) @lithomas1
- Publish cudf-polars nightlies (#16213) @lithomas1
- Modify
make_host_vector
andmake_device_uvector
factories to optionally use pinned memory and kernel copy (#16206) @vuule - Migrate lists/set_operations to pylibcudf (#16190) @Matt711
- Migrate lists/filling to pylibcudf (#16189) @Matt711
- Fall back to CPU for unsupported libcudf binaryops in cudf-polars (#16188) @brandon-b-miller
- Use resource_ref for upstream in stream_checking_resource_adaptor (#16187) @harrism
- Migrate lists/modifying to pylibcudf (#16185) @Matt711
- Migrate lists/filtering to pylibcudf (#16184) @Matt711
- Migrate lists/sorting to pylibcudf (#16179) @Matt711
- Add missing methods to lists/list_column_view.pxd in pylibcudf (#16175) @Matt711
- Migrate pylibcudf lists gathering (#16170) @Matt711
- Move kernel vis over to CUDF_HIDDEN (#16165) @robertmaynard
- Add groupby_max multi-threaded benchmark (#16154) @srinivasyadav18
- Promote has_nested_columns to cudf public API (#16131) @robertmaynard
- Promote IO support queries to cudf API (#16125) @robertmaynard
- cudf::merge public API now support passing a user stream (#16124) @robertmaynard
- Add TPC-H inspired examples for Libcudf (#16088) @JayjeetAtGithub
- Installed cudf header use cudf::allocate_like (#16087) @robertmaynard
cudf-polars
string slicing (#16082) @brandon-b-miller- Migrate Parquet reader to pylibcudf (#16078) @lithomas1
- Migrate lists/count_elements to pylibcudf (#16072) @Matt711
- Migrate lists/extrac...
v24.08.00
π¨ Breaking Changes
- Align Index init APIs with pandas 2.x (#16362) @mroeschke
- Align Series APIs with pandas 2.x (#16333) @mroeschke
- Add missing
stream
param to dictionary factory APIs (#16319) @JayjeetAtGithub - Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
- Remove squeeze argument from groupby (#16312) @mroeschke
- Align more DataFrame APIs with pandas (#16310) @mroeschke
- Remove
mr
param fromwrite_csv
andwrite_json
(#16231) @JayjeetAtGithub - Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
- Refactor from_arrow_device/host to use resource_ref (#16160) @harrism
- Deprecate Arrow support in I/O (#16132) @lithomas1
- Return
FrozenList
forIndex.names
(#16047) @galipremsagar - Add compile option to enable large strings support (#16037) @davidwendt
- Hide visibility of non public symbols (#15982) @robertmaynard
- Rename strings multiple target replace API (#15898) @davidwendt
- Pinned vector factory that uses the global pool (#15895) @vuule
- Apply clang-tidy autofixes (#15894) @vyasr
- Support
arrow:schema
in Parquet writer to faithfully roundtripduration
types with Arrow (#15875) @mhaseeb123 - Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
- Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
- Remove legacy JSON reader and concurrent_unordered_map.cuh. (#15813) @bdice
π Bug Fixes
- Add
flatbuffers
tolibcudf
build (#16446) @galipremsagar - Fix parquet_field_list read_func lambda capture invalid this pointer (#16440) @davidwendt
- Enable prefetching in cudf.pandas.install() (#16439) @bdice
- Enable prefetching before
runpy
(#16427) @galipremsagar - Support thread-safe for
prefetch_config::get
andprefetch_config::set
(#16425) @ttnghia - Fix a
pandas-2.0
missing attribute error (#16416) @galipremsagar - [Bug] Remove loud
NativeFile
deprecation noise forread_parquet
from S3 (#16415) @rjzamora - Fix nightly memcheck error for empty STREAM_INTEROP_TEST (#16406) @davidwendt
- Gate ArrowStringArrayNumpySemantics cudf.pandas proxy behind version check (#16401) @mroeschke
- Don't export bs_thread_pool (#16398) @KyleFromNVIDIA
- Require fixed width types for casting in
cudf-polars
(#16381) @brandon-b-miller - Fix docstring of
DataFrame.apply
(#16351) @galipremsagar - Make bool raise for more cudf objects (#16311) @mroeschke
- Rename
.devcontainer
s for CUDA 12.5 (#16293) @jakirkham - Fix split_record for all empty strings column (#16291) @davidwendt
- Fix logic in to_arrow for empty list column (#16279) @wence-
- [BUG] Make name attr of Index fast slow attrs (#16270) @Matt711
- Add custom name setter and getter for proxy objects in
cudf.pandas
(#16234) @Matt711 - Fall back when casting a timestamp to numeric in cudf-polars (#16232) @brandon-b-miller
- Disable large string support for Java build (#16216) @jlowe
- Remove CCCL patch for PR 211. (#16207) @bdice
- Add single offset to an empty ListArray in cudf::to_arrow (#16201) @davidwendt
- Fix
memory_usage
when calculating nested list column (#16193) @mroeschke - Support at/iat indexers in cudf.pandas (#16177) @mroeschke
- Fix unused-return-value debug build error in from_arrow_stream_test.cpp (#16168) @davidwendt
- Fix cudf::strings::replace_multiple hang on empty target (#16167) @davidwendt
- Refactor from_arrow_device/host to use resource_ref (#16160) @harrism
- interpolate returns new column if no values are interpolated (#16158) @mroeschke
- Use provided memory resource for allocating mixed join results. (#16153) @bdice
- Run DFG after verify-alpha-spec (#16151) @KyleFromNVIDIA
- Use size_t to allow large conditional joins (#16127) @bdice
- Allow only scale=0 fixed-point values in fixed_width_column_wrapper (#16120) @davidwendt
- Fix pylibcudf Table.num_rows for 0 columns case and add interop to docs (#16108) @lithomas1
- Add support for proxy
np.flatiter
objects (#16107) @Matt711 - Ensure cudf objects can astype to any type when empty (#16106) @mroeschke
- Support
pd.read_pickle
andpd.to_pickle
incudf.pandas
(#16105) @Matt711 - Fix unnecessarily strict check in parquet chunked reader for choosing split locations. (#16099) @nvdbaranec
- Fix
is_monotonic_*
APIs to includenan's
(#16085) @galipremsagar - More safely parse CUDA versions when subprocess output is contaminated (#16067) @brandon-b-miller
- fast_slow_proxy: Don't import assert_eq at top-level (#16063) @wence-
- Prevent bad ColumnAccessor state after .sort_index(axis=1, ignore_index=True) (#16061) @mroeschke
- Fix ArrowDeviceArray interface to pass address of event (#16058) @zeroshade
- Fix a size overflow bug in hash groupby (#16053) @PointKernel
- Fix
atomic_ref
scope when multiple blocks are updating the same output (#16051) @vuule - Fix initialization error in to_arrow for empty string views (#16033) @wence-
- Fix the int32 overflow when computing page fragment sizes for large string columns (#16028) @mhaseeb123
- Fix the pool size alignment issue (#16024) @PointKernel
- Improve multibyte-split byte-range performance (#16019) @davidwendt
- Fix target counting in strings char-parallel replace (#16017) @davidwendt
- Support IntervalDtype in cudf.from_pandas (#16014) @mroeschke
- Fix memory size in create_byte_range_infos_consecutive (#16012) @davidwendt
- Hide visibility of non public symbols (#15982) @robertmaynard
- Fix Cython typo preventing proper inheritance (#15978) @vyasr
- Fix convert_dtypes with convert_integer=False/convert_floating=True (#15964) @mroeschke
- Fix nunique for
MultiIndex
,DataFrame
, and all NA case withdropna=False
(#15962) @mroeschke - Explicitly build for all GPU architectures (#15959) @vyasr
- Preserve column type and class information in more DataFrame operations (#15949) @mroeschke
- Add array_interface to cudf.pandas numpy.ndarray proxy (#15936) @mroeschke
- Allow tests to be built when stream util is disabled (#15933) @robertmaynard
- Fix JSON multi-source reading when total source size exceeds
INT_MAX
bytes (#15930) @shrshi - Fix
dask_cudf.read_parquet
regression for legacy timestamp data (#15929) @rjzamora - Fix offsetalator when accessing over 268 million rows (#15921) @davidwendt
- Fix debug assert in rowgroup_char_counts_kernel (#15902) @davidwendt
- Fix categorical conversion from chunked arrow arrays (#15886) @vyasr
- Handling for
NaN
andinf
when converting floating point to fixed point types (#15885) @ttnghia - Manual merge of Branch 24.08 from 24.06 (#15869) @galipremsagar
- Avoid unnecessary
Index
cast inIndexedFrame.index
setter (#15843) @charlesbluca - Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
- Fix multi-replace target count logic for large strings (#15807) @davidwendt
- Fix JSON parsing memory corruption - Fix Mixed types nested children removal (#15798) @karthikeyann
- Allow anonymous user in devcontainer name. (#15784) @bdice
- Add support for additional metaclasses of proxies and use for ExcelWriter (#15399) @vyasr
π Documentation
- Add docstring for from_dataframe (#16260) @mroeschke
- Update libcudf compiler requirements in contributing doc (#16103) @davidwendt
- Add libcudf public/detail API pattern to developer guide (#16086) @davidwendt
- Explain line profiler and how to know which functions are GPU-accelerated. (#16079) @bdice
- cudf.pandas documentation improvement (#15948) @Matt711
- Reland "Fix docs for IO readers and strings_convert" (#15872)" (#15941) @lithomas1
- Document how to use cudf.pandas in tandem with multiprocessing (#15940) @wence-
- DOC: Add documentation for cudf.pandas in the Developer Guide (#15889) @Matt711
- Improve options docs (#15888) @bdice
- DOC: add linkcode to docs (#15860) @raybellwaves
- DOC: use intersphinx mapping in pandas-compat ext (#15846) @raybellwaves
- Fix inconsistent usage of 'results' and 'records' in read-json.md (#15766) @dagardner-nv
- Update PandasCompat.py to resolve references (#15704) @raybellwaves
π New Features
- Warn on cuDF failure when
POLARS_VERBOSE
is true (#16308) @brandon-b-miller - Add
drop_nulls
incudf-polars
(#16290) @brandon-b-miller - [JNI] Add setKernelPinnedCopyThreshold and setPinnedAllocationThreshold (#16288) @abellina
- Implement support for scan_ndjson in cudf-polars (#16263) @lithomas1
- Publish cudf-polars nightlies (#16213) @lithomas1
- Modify
make_host_vector
andmake_device_uvector
factories to optionally use pinned memory and kernel copy (#16206) @vuule - Migrate lists/set_operations to pylibcudf (#16190) @Matt711
- Migrate lists/filling to pylibcudf (#16189) @Matt711
- Fall back to CPU for unsupported libcudf binaryops in cudf-polars (#16188) @brandon-b-miller
- Use resource_ref for upstream in stream_checking_resource_adaptor (#16187) @harrism
- Migrate lists/modifying to pylibcudf (#16185) @Matt711
- Migrate lists/filtering to pylibcudf (#16184) @Matt711
- Migrate lists/sorting to pylibcudf (#16179) @Matt711
- Add missing methods to lists/list_column_view.pxd in pylibcudf (#16175) @Matt711
- Migrate pylibcudf lists gathering (#16170) @Matt711
- Move kernel vis over to CUDF_HIDDEN (#16165) @robertmaynard
- Add groupby_max multi-threaded benchmark (#16154) @srinivasyadav18
- Promote has_nested_columns to cudf public API (#16131) @robertmaynard
- Promote IO support queries to cudf API (#16125) @robertmaynard
- cudf::merge public API now support passing a user stream (#16124) @robertmaynard
- Add TPC-H inspired examples for Libcudf (#16088) @JayjeetAtGithub
- Installed cudf header use cudf::allocate_like (#16087) @robertmaynard
cudf-polars
string slicing (#16082) @brandon-b-miller- Migrate Parquet reader to pylibcudf (#16078) @lithomas1
- Migrate lists/count_elements to pylibcudf (#16072) @Matt711
- Migrate lists/extract to pylibcudf (#16071) @Matt711
- Move common string utilities to pu...
v24.06.01
π¨ Breaking Changes
- Deprecate
Groupby.collect
(#15808) @galipremsagar - Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
- Support filtered I/O in
chunked_parquet_reader
and simplify the use ofparquet_reader_options
(#15764) @mhaseeb123 - Raise errors for unsupported operations on certain types (#15712) @galipremsagar
- Support
DurationType
in cudf parquet reader viaarrow:schema
(#15617) @mhaseeb123 - Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
- Remove legacy JSON reader from Python (#15538) @bdice
- Removing all batching code from parquet writer (#15528) @mhaseeb123
- Convert libcudf resource parameters to rmm::device_async_resource_ref (#15507) @harrism
- Remove deprecated strings offsets_begin (#15454) @davidwendt
- Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
- Bind
read_parquet_metadata
API to libcudf instead of pyarrow and extractRowGroup
information (#15398) @mhaseeb123 - Remove deprecated hash() and spark_murmurhash3_x86_32() (#15375) @davidwendt
- Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
- [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
- Align date_range defaults with pandas, support tz (#15139) @mroeschke
π Bug Fixes
- Backport: Use size_t to allow large conditional joins (#16127) (#16133) @bdice
- Backport #16045 to 24.06 (#16102) @vyasr
- Backport #16038 to 24.06 (#16101) @vyasr
- Backport: Fix segfault in conditional join (#16094) (#16100) @bdice
- Add patch for incorrect cuco noexcept clauses (#16077) @vyasr
- Revert "Fix docs for IO readers and strings_convert" (#15872) @vyasr
- Remove problematic call of index setter to unblock dask-cuda CI (#15844) @charlesbluca
- Use rapids_cpm_nvtx3 to get same nvtx3 target state as rmm (#15840) @robertmaynard
- Return boolean from config_host_memory_resource instead of throwing (#15815) @abellina
- Add temporary dask-cudf workaround for categorical sorting (#15801) @rjzamora
- Fix row group alignment in ORC writer (#15789) @vuule
- Raise error when sorting by categorical column in dask-cudf (#15788) @rjzamora
- Upgrade
arrow
to 16.1 (#15787) @galipremsagar - Add support for
PandasArray
forpandas<2.1.0
(#15786) @galipremsagar - Limit runtime dependency to
libarrow>=16.0.0,<16.1.0a0
(#15782) @pentschev - Fix cat.as_ordered not propogating correct size (#15780) @mroeschke
- Handle mixed-like homogeneous types in
isin
(#15771) @galipremsagar - Fix id_vars and value_vars not accepting string scalars in melt (#15765) @mroeschke
- Fix
DatetimeIndex.loc
for all types of ordering cases (#15761) @galipremsagar - Fix arrow versioning logic (#15755) @vyasr
- Avoid running sanitizer on Java test designed to cause an error (#15753) @jlowe
- Handle empty dataframe object with index present in setitem of
loc
(#15752) @galipremsagar - Eliminate circular reference in DataFrame/Series.iloc/loc (#15749) @mroeschke
- Cap the absolute row index per pass in parquet chunked reader. (#15735) @nvdbaranec
- Fix
Index.repeat
fordatetime64
types (#15722) @galipremsagar - Fix multibyte check for case convert for large strings (#15721) @davidwendt
- Fix
get_loc
to properly fetch results from an index that is in decreasing order (#15719) @galipremsagar - Return same type as the original index for
.loc
operations (#15717) @galipremsagar - Correct static builds + static arrow (#15715) @robertmaynard
- Raise errors for unsupported operations on certain types (#15712) @galipremsagar
- Fix ColumnAccessor caching of nrows if empty previously (#15710) @mroeschke
- Allow
None
whennan_as_null=False
in column constructor (#15709) @galipremsagar - Refine
CudaTest.testCudaException
in case throwing wrong type of CudaError under aarch64 (#15706) @sperlingxx - Fix maxima of categorical column (#15701) @rjzamora
- Add proxy for inplace operations in
cudf.pandas
(#15695) @galipremsagar - Make
nan_as_null
behavior consistent across all APIs (#15692) @galipremsagar - Fix CI s3 api command to fetch latest results (#15687) @galipremsagar
- Add
NumpyExtensionArray
proxy type incudf.pandas
(#15686) @galipremsagar - Properly implement binaryops for proxy types (#15684) @galipremsagar
- Fix copy assignment and the comparison operator of
rmm_host_allocator
(#15677) @vuule - Fix multi-source reading in JSON byte range reader (#15671) @shrshi
- Return
int64
when pandas compatible mode is turned on forget_indexer
(#15659) @galipremsagar - Fix Index contains for error validations and float vs int comparisons (#15657) @galipremsagar
- Preserve sub-second data for time scalars in column construction (#15655) @galipremsagar
- Check row limit size in cudf::strings::join_strings (#15643) @davidwendt
- Enable sorting on column with nulls using query-planning (#15639) @rjzamora
- Fix operator precedence problem in Parquet reader (#15638) @etseidl
- Fix decoding of dictionary encoded FIXED_LEN_BYTE_ARRAY data in Parquet reader (#15601) @etseidl
- Fix debug warnings/errors in from_arrow_device_test.cpp (#15596) @davidwendt
- Add "collect" aggregation support to dask-cudf (#15593) @rjzamora
- Fix categorical-accessor support and testing in dask-cudf (#15591) @rjzamora
- Disable compute-sanitizer usage in CI tests with CUDA<11.6 (#15584) @davidwendt
- Preserve RangeIndex.step in to_arrow/from_arrow (#15581) @mroeschke
- Ignore new cupy warning (#15574) @vyasr
- Add cuda-sanitizer-api dependency for test-cpp matrix 11.4 (#15573) @davidwendt
- Allow apply udf to reference global modules in cudf.pandas (#15569) @mroeschke
- Fix deprecation warnings for json legacy reader (#15563) @davidwendt
- Fix millisecond resampling in cudf Python (#15560) @mroeschke
- Rename JSON_READER_OPTION to JSON_READER_OPTION_NVBENCH. (#15553) @bdice
- Fix a JNI bug in JSON parsing fixup (#15550) @revans2
- Remove conda channel setup from wheel CI image script. (#15539) @bdice
- cudf.pandas: Series dt accessor is CombinedDatetimelikeProperties (#15523) @wence-
- Fix for some compiler warnings in parquet/page_decode.cuh (#15518) @etseidl
- Fix exponent overflow in strings-to-double conversion (#15517) @davidwendt
- nanoarrow uses package override for proper pinned versions generation (#15515) @robertmaynard
- Remove index name overrides in dask-cudf pyarrow table dispatch (#15514) @charlesbluca
- Fix async synchronization issues in json_column.cu (#15497) @karthikeyann
- Add new patch to hide more CCCL APIs (#15493) @vyasr
- Make improvements in pandas-test reporting (#15485) @galipremsagar
- Fixed page data truncation in parquet writer under certain conditions. (#15474) @nvdbaranec
- Only use data_type constructor with scale for decimal types (#15472) @wence-
- Avoid "p2p" shuffle as a default when
dask_cudf
is imported (#15469) @rjzamora - Fix debug build errors from to_arrow_device_test.cpp (#15463) @davidwendt
- Fix base_normalator::integer_sizeof_fn integer dispatch (#15457) @davidwendt
- Allow consumers of static builds to find nanoarrow (#15456) @robertmaynard
- Allow jit compilation when using a splayed CUDA toolkit (#15451) @robertmaynard
- Handle case of scan aggregation in groupby-transform (#15450) @wence-
- Test static builds in CI and fix nanoarrow configure (#15437) @vyasr
- Fixes potential race in JSON parser when parsing JSON lines format and when recovering from invalid lines (#15419) @elstehle
- Fix errors in chunked ORC writer when no tables were (successfully) written (#15393) @vuule
- Support implicit array conversion with query-planning enabled (#15378) @rjzamora
- Fix arrow-based round trip of empty dataframes (#15373) @wence-
- Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
- Remove boundscheck=False setting in cython files (#15362) @wence-
- Patch dask-expr
var
logic in dask-cudf (#15347) @rjzamora - Fix for logical and syntactical errors in libcudf c++ examples (#15346) @mhaseeb123
- Disable dask-expr in docs builds. (#15343) @bdice
- Apply the cuFile error work around to data_sink as well (#15335) @vuule
- Fix parquet predicate filtering with column projection (#15113) @karthikeyann
- Check column type equality, handling nested types correctly. (#14531) @bdice
π Documentation
- Fix docs for IO readers and strings_convert (#15842) @bdice
- Update cudf.pandas docs for GA (#15744) @beckernick
- Add contributing warning about circular imports (#15691) @er-eis
- Update libcudf developer guide for strings offsets column (#15661) @davidwendt
- Update developer guide with device_async_resource_ref guidelines (#15562) @harrism
- DOC: add pandas intersphinx mapping (#15531) @raybellwaves
- rm-dup-doc in frame.py (#15530) @raybellwaves
- Update CONTRIBUTING.md to use latest cuda env (#15467) @raybellwaves
- Doc: interleave columns pandas compat (#15383) @raybellwaves
- Simplified README Examples (#15338) @wkaisertexas
- Add debug tips section to libcudf developer guide (#15329) @davidwendt
- Fix and clarify notes on result ordering (#13255) @shwina
π New Features
- Add JNI bindings for zstd compression of NVCOMP. (#15729) @firestarman
- Fix spaces around CSV quoted strings (#15727) @thabetx
- Add default pinned pool that falls back to new pinned allocations (#15665) @vuule
- Overhaul ops-codeowners coverage (#15660) @raydouglass
- Concatenate dictionary of objects along axis=1 (#15623) @er-eis
- Construct
pylibcudf
columns from objects supporting__cuda_array_interface__
(#15615) @brandon-b-miller - Expose some Parquet per-column configuration options via the python API (#15613) @etseidl
- Migrate string
find
operations topylibcudf
(#15604) @brandon-b-miller - Round trip FIXED_LEN_BYTE_ARRAY data properly in Parquet writer (#15600) @etseidl
- Reading multi-line JSON in string columns using runtime configurable delimiter (#15556) @shrshi
- Remove p...
v24.06.00
π¨ Breaking Changes
- Deprecate
Groupby.collect
(#15808) @galipremsagar - Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
- Support filtered I/O in
chunked_parquet_reader
and simplify the use ofparquet_reader_options
(#15764) @mhaseeb123 - Raise errors for unsupported operations on certain types (#15712) @galipremsagar
- Support
DurationType
in cudf parquet reader viaarrow:schema
(#15617) @mhaseeb123 - Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
- Remove legacy JSON reader from Python (#15538) @bdice
- Removing all batching code from parquet writer (#15528) @mhaseeb123
- Convert libcudf resource parameters to rmm::device_async_resource_ref (#15507) @harrism
- Remove deprecated strings offsets_begin (#15454) @davidwendt
- Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
- Bind
read_parquet_metadata
API to libcudf instead of pyarrow and extractRowGroup
information (#15398) @mhaseeb123 - Remove deprecated hash() and spark_murmurhash3_x86_32() (#15375) @davidwendt
- Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
- [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
- Align date_range defaults with pandas, support tz (#15139) @mroeschke
π Bug Fixes
- Revert "Fix docs for IO readers and strings_convert" (#15872) @vyasr
- Remove problematic call of index setter to unblock dask-cuda CI (#15844) @charlesbluca
- Use rapids_cpm_nvtx3 to get same nvtx3 target state as rmm (#15840) @robertmaynard
- Return boolean from config_host_memory_resource instead of throwing (#15815) @abellina
- Add temporary dask-cudf workaround for categorical sorting (#15801) @rjzamora
- Fix row group alignment in ORC writer (#15789) @vuule
- Raise error when sorting by categorical column in dask-cudf (#15788) @rjzamora
- Upgrade
arrow
to 16.1 (#15787) @galipremsagar - Add support for
PandasArray
forpandas<2.1.0
(#15786) @galipremsagar - Limit runtime dependency to
libarrow>=16.0.0,<16.1.0a0
(#15782) @pentschev - Fix cat.as_ordered not propogating correct size (#15780) @mroeschke
- Handle mixed-like homogeneous types in
isin
(#15771) @galipremsagar - Fix id_vars and value_vars not accepting string scalars in melt (#15765) @mroeschke
- Fix
DatetimeIndex.loc
for all types of ordering cases (#15761) @galipremsagar - Fix arrow versioning logic (#15755) @vyasr
- Avoid running sanitizer on Java test designed to cause an error (#15753) @jlowe
- Handle empty dataframe object with index present in setitem of
loc
(#15752) @galipremsagar - Eliminate circular reference in DataFrame/Series.iloc/loc (#15749) @mroeschke
- Cap the absolute row index per pass in parquet chunked reader. (#15735) @nvdbaranec
- Fix
Index.repeat
fordatetime64
types (#15722) @galipremsagar - Fix multibyte check for case convert for large strings (#15721) @davidwendt
- Fix
get_loc
to properly fetch results from an index that is in decreasing order (#15719) @galipremsagar - Return same type as the original index for
.loc
operations (#15717) @galipremsagar - Correct static builds + static arrow (#15715) @robertmaynard
- Raise errors for unsupported operations on certain types (#15712) @galipremsagar
- Fix ColumnAccessor caching of nrows if empty previously (#15710) @mroeschke
- Allow
None
whennan_as_null=False
in column constructor (#15709) @galipremsagar - Refine
CudaTest.testCudaException
in case throwing wrong type of CudaError under aarch64 (#15706) @sperlingxx - Fix maxima of categorical column (#15701) @rjzamora
- Add proxy for inplace operations in
cudf.pandas
(#15695) @galipremsagar - Make
nan_as_null
behavior consistent across all APIs (#15692) @galipremsagar - Fix CI s3 api command to fetch latest results (#15687) @galipremsagar
- Add
NumpyExtensionArray
proxy type incudf.pandas
(#15686) @galipremsagar - Properly implement binaryops for proxy types (#15684) @galipremsagar
- Fix copy assignment and the comparison operator of
rmm_host_allocator
(#15677) @vuule - Fix multi-source reading in JSON byte range reader (#15671) @shrshi
- Return
int64
when pandas compatible mode is turned on forget_indexer
(#15659) @galipremsagar - Fix Index contains for error validations and float vs int comparisons (#15657) @galipremsagar
- Preserve sub-second data for time scalars in column construction (#15655) @galipremsagar
- Check row limit size in cudf::strings::join_strings (#15643) @davidwendt
- Enable sorting on column with nulls using query-planning (#15639) @rjzamora
- Fix operator precedence problem in Parquet reader (#15638) @etseidl
- Fix decoding of dictionary encoded FIXED_LEN_BYTE_ARRAY data in Parquet reader (#15601) @etseidl
- Fix debug warnings/errors in from_arrow_device_test.cpp (#15596) @davidwendt
- Add "collect" aggregation support to dask-cudf (#15593) @rjzamora
- Fix categorical-accessor support and testing in dask-cudf (#15591) @rjzamora
- Disable compute-sanitizer usage in CI tests with CUDA<11.6 (#15584) @davidwendt
- Preserve RangeIndex.step in to_arrow/from_arrow (#15581) @mroeschke
- Ignore new cupy warning (#15574) @vyasr
- Add cuda-sanitizer-api dependency for test-cpp matrix 11.4 (#15573) @davidwendt
- Allow apply udf to reference global modules in cudf.pandas (#15569) @mroeschke
- Fix deprecation warnings for json legacy reader (#15563) @davidwendt
- Fix millisecond resampling in cudf Python (#15560) @mroeschke
- Rename JSON_READER_OPTION to JSON_READER_OPTION_NVBENCH. (#15553) @bdice
- Fix a JNI bug in JSON parsing fixup (#15550) @revans2
- Remove conda channel setup from wheel CI image script. (#15539) @bdice
- cudf.pandas: Series dt accessor is CombinedDatetimelikeProperties (#15523) @wence-
- Fix for some compiler warnings in parquet/page_decode.cuh (#15518) @etseidl
- Fix exponent overflow in strings-to-double conversion (#15517) @davidwendt
- nanoarrow uses package override for proper pinned versions generation (#15515) @robertmaynard
- Remove index name overrides in dask-cudf pyarrow table dispatch (#15514) @charlesbluca
- Fix async synchronization issues in json_column.cu (#15497) @karthikeyann
- Add new patch to hide more CCCL APIs (#15493) @vyasr
- Make improvements in pandas-test reporting (#15485) @galipremsagar
- Fixed page data truncation in parquet writer under certain conditions. (#15474) @nvdbaranec
- Only use data_type constructor with scale for decimal types (#15472) @wence-
- Avoid "p2p" shuffle as a default when
dask_cudf
is imported (#15469) @rjzamora - Fix debug build errors from to_arrow_device_test.cpp (#15463) @davidwendt
- Fix base_normalator::integer_sizeof_fn integer dispatch (#15457) @davidwendt
- Allow consumers of static builds to find nanoarrow (#15456) @robertmaynard
- Allow jit compilation when using a splayed CUDA toolkit (#15451) @robertmaynard
- Handle case of scan aggregation in groupby-transform (#15450) @wence-
- Test static builds in CI and fix nanoarrow configure (#15437) @vyasr
- Fixes potential race in JSON parser when parsing JSON lines format and when recovering from invalid lines (#15419) @elstehle
- Fix errors in chunked ORC writer when no tables were (successfully) written (#15393) @vuule
- Support implicit array conversion with query-planning enabled (#15378) @rjzamora
- Fix arrow-based round trip of empty dataframes (#15373) @wence-
- Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
- Remove boundscheck=False setting in cython files (#15362) @wence-
- Patch dask-expr
var
logic in dask-cudf (#15347) @rjzamora - Fix for logical and syntactical errors in libcudf c++ examples (#15346) @mhaseeb123
- Disable dask-expr in docs builds. (#15343) @bdice
- Apply the cuFile error work around to data_sink as well (#15335) @vuule
- Fix parquet predicate filtering with column projection (#15113) @karthikeyann
- Check column type equality, handling nested types correctly. (#14531) @bdice
π Documentation
- Fix docs for IO readers and strings_convert (#15842) @bdice
- Update cudf.pandas docs for GA (#15744) @beckernick
- Add contributing warning about circular imports (#15691) @er-eis
- Update libcudf developer guide for strings offsets column (#15661) @davidwendt
- Update developer guide with device_async_resource_ref guidelines (#15562) @harrism
- DOC: add pandas intersphinx mapping (#15531) @raybellwaves
- rm-dup-doc in frame.py (#15530) @raybellwaves
- Update CONTRIBUTING.md to use latest cuda env (#15467) @raybellwaves
- Doc: interleave columns pandas compat (#15383) @raybellwaves
- Simplified README Examples (#15338) @wkaisertexas
- Add debug tips section to libcudf developer guide (#15329) @davidwendt
- Fix and clarify notes on result ordering (#13255) @shwina
π New Features
- Add JNI bindings for zstd compression of NVCOMP. (#15729) @firestarman
- Fix spaces around CSV quoted strings (#15727) @thabetx
- Add default pinned pool that falls back to new pinned allocations (#15665) @vuule
- Overhaul ops-codeowners coverage (#15660) @raydouglass
- Concatenate dictionary of objects along axis=1 (#15623) @er-eis
- Construct
pylibcudf
columns from objects supporting__cuda_array_interface__
(#15615) @brandon-b-miller - Expose some Parquet per-column configuration options via the python API (#15613) @etseidl
- Migrate string
find
operations topylibcudf
(#15604) @brandon-b-miller - Round trip FIXED_LEN_BYTE_ARRAY data properly in Parquet writer (#15600) @etseidl
- Reading multi-line JSON in string columns using runtime configurable delimiter (#15556) @shrshi
- Remove public gtest dependency from libcudf conda package (#15534) @robertmaynard
- Fea/move to latest nanoarrow (#15526) @robertmaynard
- Migrate string
case
operations topylibcudf
(#15489) @brandon-b-miller - Add Parquet encoding statistics to column chunk metadata (#15452) @etseidl
- Implement JNI fo...
[NIGHTLY] v24.08.00
π Links
π¨ Breaking Changes
- Align Index init APIs with pandas 2.x (#16362) @mroeschke
- Align Series APIs with pandas 2.x (#16333) @mroeschke
- Add missing
stream
param to dictionary factory APIs (#16319) @JayjeetAtGithub - Deprecate dtype= parameter in reduction methods (#16313) @mroeschke
- Remove squeeze argument from groupby (#16312) @mroeschke
- Align more DataFrame APIs with pandas (#16310) @mroeschke
- Remove
mr
param fromwrite_csv
andwrite_json
(#16231) @JayjeetAtGithub - Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195) @mhaseeb123
- Refactor from_arrow_device/host to use resource_ref (#16160) @harrism
- Deprecate Arrow support in I/O (#16132) @lithomas1
- Return
FrozenList
forIndex.names
(#16047) @galipremsagar - Add compile option to enable large strings support (#16037) @davidwendt
- Hide visibility of non public symbols (#15982) @robertmaynard
- Rename strings multiple target replace API (#15898) @davidwendt
- Pinned vector factory that uses the global pool (#15895) @vuule
- Apply clang-tidy autofixes (#15894) @vyasr
- Support
arrow:schema
in Parquet writer to faithfully roundtripduration
types with Arrow (#15875) @mhaseeb123 - Expose stream parameter to public rolling APIs (#15865) @srinivasyadav18
- Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
- Remove legacy JSON reader and concurrent_unordered_map.cuh. (#15813) @bdice
π Bug Fixes
- Ensure managed memory is supported in cudf.pandas. (#16552) @bdice
- Add
flatbuffers
tolibcudf
build (#16446) @galipremsagar - Fix parquet_field_list read_func lambda capture invalid this pointer (#16440) @davidwendt
- Enable prefetching in cudf.pandas.install() (#16439) @bdice
- Enable prefetching before
runpy
(#16427) @galipremsagar - Support thread-safe for
prefetch_config::get
andprefetch_config::set
(#16425) @ttnghia - Fix a
pandas-2.0
missing attribute error (#16416) @galipremsagar - [Bug] Remove loud
NativeFile
deprecation noise forread_parquet
from S3 (#16415) @rjzamora - Fix nightly memcheck error for empty STREAM_INTEROP_TEST (#16406) @davidwendt
- Gate ArrowStringArrayNumpySemantics cudf.pandas proxy behind version check (#16401) @mroeschke
- Don't export bs_thread_pool (#16398) @KyleFromNVIDIA
- Require fixed width types for casting in
cudf-polars
(#16381) @brandon-b-miller - Fix docstring of
DataFrame.apply
(#16351) @galipremsagar - Make bool raise for more cudf objects (#16311) @mroeschke
- Rename
.devcontainer
s for CUDA 12.5 (#16293) @jakirkham - Fix split_record for all empty strings column (#16291) @davidwendt
- Fix logic in to_arrow for empty list column (#16279) @wence-
- [BUG] Make name attr of Index fast slow attrs (#16270) @Matt711
- Add custom name setter and getter for proxy objects in
cudf.pandas
(#16234) @Matt711 - Fall back when casting a timestamp to numeric in cudf-polars (#16232) @brandon-b-miller
- Disable large string support for Java build (#16216) @jlowe
- Remove CCCL patch for PR 211. (#16207) @bdice
- Add single offset to an empty ListArray in cudf::to_arrow (#16201) @davidwendt
- Fix
memory_usage
when calculating nested list column (#16193) @mroeschke - Support at/iat indexers in cudf.pandas (#16177) @mroeschke
- Fix unused-return-value debug build error in from_arrow_stream_test.cpp (#16168) @davidwendt
- Fix cudf::strings::replace_multiple hang on empty target (#16167) @davidwendt
- Refactor from_arrow_device/host to use resource_ref (#16160) @harrism
- interpolate returns new column if no values are interpolated (#16158) @mroeschke
- Use provided memory resource for allocating mixed join results. (#16153) @bdice
- Run DFG after verify-alpha-spec (#16151) @KyleFromNVIDIA
- Use size_t to allow large conditional joins (#16127) @bdice
- Allow only scale=0 fixed-point values in fixed_width_column_wrapper (#16120) @davidwendt
- Fix pylibcudf Table.num_rows for 0 columns case and add interop to docs (#16108) @lithomas1
- Add support for proxy
np.flatiter
objects (#16107) @Matt711 - Ensure cudf objects can astype to any type when empty (#16106) @mroeschke
- Support
pd.read_pickle
andpd.to_pickle
incudf.pandas
(#16105) @Matt711 - Fix unnecessarily strict check in parquet chunked reader for choosing split locations. (#16099) @nvdbaranec
- Fix
is_monotonic_*
APIs to includenan's
(#16085) @galipremsagar - More safely parse CUDA versions when subprocess output is contaminated (#16067) @brandon-b-miller
- fast_slow_proxy: Don't import assert_eq at top-level (#16063) @wence-
- Prevent bad ColumnAccessor state after .sort_index(axis=1, ignore_index=True) (#16061) @mroeschke
- Fix ArrowDeviceArray interface to pass address of event (#16058) @zeroshade
- Fix a size overflow bug in hash groupby (#16053) @PointKernel
- Fix
atomic_ref
scope when multiple blocks are updating the same output (#16051) @vuule - Fix initialization error in to_arrow for empty string views (#16033) @wence-
- Fix the int32 overflow when computing page fragment sizes for large string columns (#16028) @mhaseeb123
- Fix the pool size alignment issue (#16024) @PointKernel
- Improve multibyte-split byte-range performance (#16019) @davidwendt
- Fix target counting in strings char-parallel replace (#16017) @davidwendt
- Support IntervalDtype in cudf.from_pandas (#16014) @mroeschke
- Fix memory size in create_byte_range_infos_consecutive (#16012) @davidwendt
- Hide visibility of non public symbols (#15982) @robertmaynard
- Fix Cython typo preventing proper inheritance (#15978) @vyasr
- Fix convert_dtypes with convert_integer=False/convert_floating=True (#15964) @mroeschke
- Fix nunique for
MultiIndex
,DataFrame
, and all NA case withdropna=False
(#15962) @mroeschke - Explicitly build for all GPU architectures (#15959) @vyasr
- Preserve column type and class information in more DataFrame operations (#15949) @mroeschke
- Add array_interface to cudf.pandas numpy.ndarray proxy (#15936) @mroeschke
- Allow tests to be built when stream util is disabled (#15933) @robertmaynard
- Fix JSON multi-source reading when total source size exceeds
INT_MAX
bytes (#15930) @shrshi - Fix
dask_cudf.read_parquet
regression for legacy timestamp data (#15929) @rjzamora - Fix offsetalator when accessing over 268 million rows (#15921) @davidwendt
- Fix debug assert in rowgroup_char_counts_kernel (#15902) @davidwendt
- Fix categorical conversion from chunked arrow arrays (#15886) @vyasr
- Handling for
NaN
andinf
when converting floating point to fixed point types (#15885) @ttnghia - Manual merge of Branch 24.08 from 24.06 (#15869) @galipremsagar
- Avoid unnecessary
Index
cast inIndexedFrame.index
setter (#15843) @charlesbluca - Fix large strings handling in nvtext::character_tokenize (#15829) @davidwendt
- Fix multi-replace target count logic for large strings (#15807) @davidwendt
- Fix JSON parsing memory corruption - Fix Mixed types nested children removal (#15798) @karthikeyann
- Allow anonymous user in devcontainer name. (#15784) @bdice
- Add support for additional metaclasses of proxies and use for ExcelWriter (#15399) @vyasr
π Documentation
- Improve Polars docs (#16820) @bdice
- Add docstring for from_dataframe (#16260) @mroeschke
- Update libcudf compiler requirements in contributing doc (#16103) @davidwendt
- Add libcudf public/detail API pattern to developer guide (#16086) @davidwendt
- Explain line profiler and how to know which functions are GPU-accelerated. (#16079) @bdice
- cudf.pandas documentation improvement (#15948) @Matt711
- Reland "Fix docs for IO readers and strings_convert" (#15872)" (#15941) @lithomas1
- Document how to use cudf.pandas in tandem with multiprocessing (#15940) @wence-
- DOC: Add documentation for cudf.pandas in the Developer Guide (#15889) @Matt711
- Improve options docs (#15888) @bdice
- DOC: add linkcode to docs (#15860) @raybellwaves
- DOC: use intersphinx mapping in pandas-compat ext (#15846) @raybellwaves
- Fix inconsistent usage of 'results' and 'records' in read-json.md (#15766) @dagardner-nv
- Update PandasCompat.py to resolve references (#15704) @raybellwaves
π New Features
- Creation of CI artifacts for cudf-polars wheels (#16680) @wence-
- Warn on cuDF failure when
POLARS_VERBOSE
is true (#16308) @brandon-b-miller - Add
drop_nulls
incudf-polars
(#16290) @brandon-b-miller - [JNI] Add setKernelPinnedCopyThreshold and setPinnedAllocationThreshold (#16288) @abellina
- Implement support for scan_ndjson in cudf-polars (#16263) @lithomas1
- Publish cudf-polars nightlies (#16213) @lithomas1
- Modify
make_host_vector
andmake_device_uvector
factories to optionally use pinned memory and kernel copy (#16206) @vuule - Migrate lists/set_operations to pylibcudf (#16190) @Matt711
- Migrate lists/filling to pylibcudf (#16189) @Matt711
- Fall back to CPU for unsupported libcudf binaryops in cudf-polars (#16188) @brandon-b-miller
- Use resource_ref for upstream in stream_checking_resource_adaptor (#16187) @harrism
- Migrate lists/modifying to pylibcudf (#16185) @Matt711
- Migrate lists/filtering to pylibcudf (#16184) @Matt711
- Migrate lists/sorting to pylibcudf (#16179) @Matt711
- Add missing methods to lists/list_column_view.pxd in pylibcudf (#16175) @Matt711
- Migrate pylibcudf lists gathering (#16170) @Matt711
- Move kernel vis over to CUDF_HIDDEN (#16165) @robertmaynard
- Add groupby_max multi-threaded benchmark (#16154) @srinivasyadav18
- Promote has_nested_columns to cudf public API (#16131) @robertmaynard
- Promote IO support queries to cudf API (#16125) @robertmaynard
- cudf::merge public API now support passing a user stream (#16124) @robertmaynard
- Add TPC-H inspired examples for Libcudf (#16088) @ja...
v24.04.01
π¨ Breaking Changes
- Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
- Change exceptions thrown by copying APIs (#15319) @vyasr
- Change strings_column_view::char_size to return int64 (#15197) @davidwendt
- Upgrade to
arrow-14.0.2
(#15108) @galipremsagar - Add support for
pandas-2.2
incudf
(#15100) @galipremsagar - Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
- Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
- Raise an error on import for unsupported GPUs. (#15053) @bdice
- Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
- Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
- Add
future_stack
toDataFrame.stack
(#15015) @galipremsagar - Deprecate groupby fillna (#15000) @mroeschke
- Deprecate replace with categorical columns (#14988) @mroeschke
- Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
- Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
- Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
- Add
pandas-2.x
support incudf
(#14916) @galipremsagar - Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
π Bug Fixes
- Fix an issue with creating a series from scalar when
dtype='category'
(#15476) @galipremsagar - Update pre-commit-hooks to v0.0.3 (#15355) @KyleFromNVIDIA
- [BUG][JNI] Trigger MemoryBuffer.onClosed after memory is freed (#15351) @abellina
- Fix an issue with multiple short list rowgroups using the Parquet chunked reader. (#15342) @nvdbaranec
- Avoid importing dask-expr if "query-planning" config is
False
(#15340) @rjzamora - Fix gtests/ERROR_TEST errors when run in Debug (#15317) @davidwendt
- Fix OOB read in
inflate_kernel
(#15309) @vuule - Work around a cuFile error when running CSV tests with memcheck (#15293) @vuule
- Fix Doxygen upload directory (#15291) @KyleFromNVIDIA
- Fix Doxygen check (#15289) @KyleFromNVIDIA
- Reintroduce PANDAS_GE_220 import (#15287) @wence-
- Fix mean computation for the geometric distribution in the data generator (#15282) @vuule
- Fix Parquet decimal64 stats (#15281) @etseidl
- Make linking of nvtx3-cpp BUILD_LOCAL_INTERFACE (#15271) @KyleFromNVIDIA
- Workaround compute-sanitizer memcheck bug (#15259) @davidwendt
- Cleanup
hostdevice_vector
and add more APIs (#15252) @ttnghia - Fix number of rows in randomly generated lists columns (#15248) @vuule
- Fix wrong output for
collect_list
/collect_set
of lists column (#15243) @ttnghia - Fix testchunkedPackTwoPasses to copy from the bounce buffer (#15220) @abellina
- Fix accessing
.columns
by an external API (#15212) @galipremsagar - [JNI] Disable testChunkedPackTwoPasses for now (#15210) @abellina
- Update labeler and codeowner configs for CMake files (#15208) @PointKernel
- Avoid dict normalization in
__dask_tokenize__
(#15187) @rjzamora - Fix memcheck error in distinct inner join (#15164) @PointKernel
- Remove unneeded script parameters in test_cpp_memcheck.sh (#15158) @davidwendt
- Fix
ListColumn.to_pandas()
to retainlist
type (#15155) @galipremsagar - Avoid factorization in MultiIndex.to_pandas (#15150) @mroeschke
- Fix GroupBy.get_group and GroupBy.indices (#15143) @wence-
- Remove
const
fromrange_window_bounds::_extent
. (#15138) @mythrocks - DataFrame.columns = ... retains RangeIndex & set dtype (#15129) @mroeschke
- Correctly handle output for
GroupBy.apply
when chunk results are reindexed series (#15109) @brandon-b-miller - Fix Series.groupby.shift with a MultiIndex (#15098) @mroeschke
- Fix reductions when DataFrame has MulitIndex columns (#15097) @mroeschke
- Fix deprecation warnings for deprecated hash() calls (#15095) @davidwendt
- Add support for arrow
large_string
incudf
(#15093) @galipremsagar - Fix
sort_values
pytest failure with pandas-2.x regression (#15092) @galipremsagar - Resolve path parsing issues in
get_json_object
(#15082) @SurajAralihalli - Fix bugs in handling of delta encodings (#15075) @etseidl
- Fix
is_device_write_preferred
invoid_sink
anduser_sink_wrapper
(#15064) @vuule - Eliminate duplicate allocation of nested string columns (#15061) @vuule
- Raise an error on import for unsupported GPUs. (#15053) @bdice
- Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
- Fix
Index.difference
to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar - Add
future_stack
toDataFrame.stack
(#15015) @galipremsagar - Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
- Fix
DataFrame.sort_index
to respectignore_index
on all axis (#14995) @galipremsagar - Raise for pyarrow array that is tz-aware (#14980) @mroeschke
- Direct
SeriesGroupBy.aggregate
toSeriesGroupBy.agg
(#14971) @rjzamora - Respect IntervalDtype and CategoricalDtype objects passed by users (#14961) @mroeschke
- unset
CUDF_SPILL
after a pytest (#14958) @galipremsagar - Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader (#14939) @karthikeyann
- Fix chunked reads of Parquet delta encoded pages (#14921) @etseidl
- Fix reading offset for data stream in ORC reader (#14911) @ttnghia
- Enable sanitizer check for a test case testORCReadAndWriteForDecimal128 (#14897) @res-life
- Fix dask token normalization (#14829) @rjzamora
- Fix 24.04 versions (#14825) @raydouglass
- Ensure slow private attrs are maybe proxies (#14380) @mroeschke
π Documentation
- Ignore DLManagedTensor in the docs build (#15392) @davidwendt
- Revert "Temporarily disable docs errors. (#15265)" (#15269) @bdice
- Temporarily disable docs errors. (#15265) @bdice
- Update
developer_guide.md
with new guidance on quoted internal includes (#15238) @harrism - Fix broken link for developer guide (#15025) @sanjana098
- [DOC] Update typo in docs example of structs_column_wrapper (#14949) @karthikeyann
- Update cudf.pandas FAQ. (#14940) @bdice
- Optimize doc builds (#14856) @vyasr
- Add developer guideline to use east const. (#14836) @bdice
- Document how cuDF is pronounced (#14753) @pentschev
- Notes convert to Pandas-compat (#12641) @Touutae-lab
π New Features
- Address inconsistency in single quote normalization in JSON reader (#15324) @shrshi
- Use JNI pinned pool resource with cuIO (#15255) @abellina
- Add DELTA_BYTE_ARRAY encoder for Parquet (#15239) @etseidl
- Migrate filling operations to pylibcudf (#15225) @brandon-b-miller
- [JNI] rmm based pinned pool (#15219) @abellina
- Implement zero-copy host buffer source instead of using an arrow implementation (#15189) @vuule
- Enable creation of columns from scalar (#15181) @vyasr
- Use NVTX from GitHub. (#15178) @bdice
- Implement
segmented_row_bit_count
for computing row sizes by segments of rows (#15169) @ttnghia - Implement search using pylibcudf (#15166) @vyasr
- Add distinct left join (#15149) @PointKernel
- Add cardinality control for groupby benchs with flat types (#15134) @PointKernel
- Add ability to request Parquet encodings on a per-column basis (#15081) @etseidl
- Automate include grouping order in .clang-format (#15063) @harrism
- Requesting a clean build directory also clears Jitify cache (#15052) @robertmaynard
- API for JSON unquoted whitespace normalization (#15033) @shrshi
- Implement concatenate, lists.explode, merge, sorting, and stream compaction in pylibcudf (#15011) @vyasr
- Implement replace in pylibcudf (#15005) @vyasr
- Add distinct key inner join (#14990) @PointKernel
- Implement rolling in pylibcudf (#14982) @vyasr
- Implement joins in pylibcudf (#14972) @vyasr
- Implement scans and reductions in pylibcudf (#14970) @vyasr
- Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
- Implement groupby in pylibcudf (#14945) @vyasr
- Support casting of Map type to string in JSON reader (#14936) @karthikeyann
- POC for whitespace removal in input JSON data using FST (#14931) @shrshi
- Support for LZ4 compression in ORC and Parquet (#14906) @vuule
- Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
- Migrate unary operations to pylibcudf (#14850) @vyasr
- Migrate binary operations to pylibcudf (#14821) @vyasr
- Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
- Support CUDA 12.2 (#14712) @jameslamb
π οΈ Improvements
- Backport: Relax protobuf lower bound to 3.20. (#15506) (#15610) @bdice
- Use
conda env create --yes
instead of--force
(#15403) @bdice - Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
- Change exceptions thrown by copying APIs (#15319) @vyasr
- Enable branch testing for
cudf.pandas
(#15316) @galipremsagar - Replace black with ruff-format (#15312) @mroeschke
- This fixes an NPE when trying to read empty JSON data by adding a new API for missing information (#15307) @revans2
- Address poor performance of Parquet string decoding (#15304) @etseidl
- Update script input name (#15301) @AyodeAwe
- Make test_read_parquet_partitioned_filtered data deterministic (#15296) @mroeschke
- Add timeout for
cudf.pandas
pandas tests (#15284) @galipremsagar - Add upper bound to prevent usage of NumPy 2 (#15283) @bdice
- Fix cudf::test::to_host return of host_vector (#15263) @davidwendt
- Implement grouped product scan (#15254) @wence-
- Add CUDA 12.4 to supported PTX versions (#15247) @brandon-b-miller
- Implement DataFrame|Series.squeeze (#15244) @mroeschke
- Roll back ipow changes due to register pressure. (#15242) @pmattione-nvidia
- Remove create_chars_child_column utility (#15241) @davidwendt
- Update dlpack to version 0.8 (#15237) @dantegd
- Improve performance in JSON reader when
mixed_types_as_string
option is enabled (#15236) @shrshi - Remove row conversion code from libcudf (#15234) @ttnghia
- Use variable substitution for RAPIDS version in Doxyfile (#15231) @KyleFromNVIDIA
- Add ListColumns.to_pandas(arrow_type=) (#15228) @mroeSC...
v24.04.00
π¨ Breaking Changes
- Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
- Change exceptions thrown by copying APIs (#15319) @vyasr
- Change strings_column_view::char_size to return int64 (#15197) @davidwendt
- Upgrade to
arrow-14.0.2
(#15108) @galipremsagar - Add support for
pandas-2.2
incudf
(#15100) @galipremsagar - Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
- Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
- Raise an error on import for unsupported GPUs. (#15053) @bdice
- Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
- Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
- Add
future_stack
toDataFrame.stack
(#15015) @galipremsagar - Deprecate groupby fillna (#15000) @mroeschke
- Deprecate replace with categorical columns (#14988) @mroeschke
- Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
- Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
- Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
- Add
pandas-2.x
support incudf
(#14916) @galipremsagar - Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
π Bug Fixes
- Fix an issue with creating a series from scalar when
dtype='category'
(#15476) @galipremsagar - Update pre-commit-hooks to v0.0.3 (#15355) @KyleFromNVIDIA
- [BUG][JNI] Trigger MemoryBuffer.onClosed after memory is freed (#15351) @abellina
- Fix an issue with multiple short list rowgroups using the Parquet chunked reader. (#15342) @nvdbaranec
- Avoid importing dask-expr if "query-planning" config is
False
(#15340) @rjzamora - Fix gtests/ERROR_TEST errors when run in Debug (#15317) @davidwendt
- Fix OOB read in
inflate_kernel
(#15309) @vuule - Work around a cuFile error when running CSV tests with memcheck (#15293) @vuule
- Fix Doxygen upload directory (#15291) @KyleFromNVIDIA
- Fix Doxygen check (#15289) @KyleFromNVIDIA
- Reintroduce PANDAS_GE_220 import (#15287) @wence-
- Fix mean computation for the geometric distribution in the data generator (#15282) @vuule
- Fix Parquet decimal64 stats (#15281) @etseidl
- Make linking of nvtx3-cpp BUILD_LOCAL_INTERFACE (#15271) @KyleFromNVIDIA
- Workaround compute-sanitizer memcheck bug (#15259) @davidwendt
- Cleanup
hostdevice_vector
and add more APIs (#15252) @ttnghia - Fix number of rows in randomly generated lists columns (#15248) @vuule
- Fix wrong output for
collect_list
/collect_set
of lists column (#15243) @ttnghia - Fix testchunkedPackTwoPasses to copy from the bounce buffer (#15220) @abellina
- Fix accessing
.columns
by an external API (#15212) @galipremsagar - [JNI] Disable testChunkedPackTwoPasses for now (#15210) @abellina
- Update labeler and codeowner configs for CMake files (#15208) @PointKernel
- Avoid dict normalization in
__dask_tokenize__
(#15187) @rjzamora - Fix memcheck error in distinct inner join (#15164) @PointKernel
- Remove unneeded script parameters in test_cpp_memcheck.sh (#15158) @davidwendt
- Fix
ListColumn.to_pandas()
to retainlist
type (#15155) @galipremsagar - Avoid factorization in MultiIndex.to_pandas (#15150) @mroeschke
- Fix GroupBy.get_group and GroupBy.indices (#15143) @wence-
- Remove
const
fromrange_window_bounds::_extent
. (#15138) @mythrocks - DataFrame.columns = ... retains RangeIndex & set dtype (#15129) @mroeschke
- Correctly handle output for
GroupBy.apply
when chunk results are reindexed series (#15109) @brandon-b-miller - Fix Series.groupby.shift with a MultiIndex (#15098) @mroeschke
- Fix reductions when DataFrame has MulitIndex columns (#15097) @mroeschke
- Fix deprecation warnings for deprecated hash() calls (#15095) @davidwendt
- Add support for arrow
large_string
incudf
(#15093) @galipremsagar - Fix
sort_values
pytest failure with pandas-2.x regression (#15092) @galipremsagar - Resolve path parsing issues in
get_json_object
(#15082) @SurajAralihalli - Fix bugs in handling of delta encodings (#15075) @etseidl
- Fix
is_device_write_preferred
invoid_sink
anduser_sink_wrapper
(#15064) @vuule - Eliminate duplicate allocation of nested string columns (#15061) @vuule
- Raise an error on import for unsupported GPUs. (#15053) @bdice
- Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
- Fix
Index.difference
to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar - Add
future_stack
toDataFrame.stack
(#15015) @galipremsagar - Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
- Fix
DataFrame.sort_index
to respectignore_index
on all axis (#14995) @galipremsagar - Raise for pyarrow array that is tz-aware (#14980) @mroeschke
- Direct
SeriesGroupBy.aggregate
toSeriesGroupBy.agg
(#14971) @rjzamora - Respect IntervalDtype and CategoricalDtype objects passed by users (#14961) @mroeschke
- unset
CUDF_SPILL
after a pytest (#14958) @galipremsagar - Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader (#14939) @karthikeyann
- Fix chunked reads of Parquet delta encoded pages (#14921) @etseidl
- Fix reading offset for data stream in ORC reader (#14911) @ttnghia
- Enable sanitizer check for a test case testORCReadAndWriteForDecimal128 (#14897) @res-life
- Fix dask token normalization (#14829) @rjzamora
- Fix 24.04 versions (#14825) @raydouglass
- Ensure slow private attrs are maybe proxies (#14380) @mroeschke
π Documentation
- Ignore DLManagedTensor in the docs build (#15392) @davidwendt
- Revert "Temporarily disable docs errors. (#15265)" (#15269) @bdice
- Temporarily disable docs errors. (#15265) @bdice
- Update
developer_guide.md
with new guidance on quoted internal includes (#15238) @harrism - Fix broken link for developer guide (#15025) @sanjana098
- [DOC] Update typo in docs example of structs_column_wrapper (#14949) @karthikeyann
- Update cudf.pandas FAQ. (#14940) @bdice
- Optimize doc builds (#14856) @vyasr
- Add developer guideline to use east const. (#14836) @bdice
- Document how cuDF is pronounced (#14753) @pentschev
- Notes convert to Pandas-compat (#12641) @Touutae-lab
π New Features
- Address inconsistency in single quote normalization in JSON reader (#15324) @shrshi
- Use JNI pinned pool resource with cuIO (#15255) @abellina
- Add DELTA_BYTE_ARRAY encoder for Parquet (#15239) @etseidl
- Migrate filling operations to pylibcudf (#15225) @brandon-b-miller
- [JNI] rmm based pinned pool (#15219) @abellina
- Implement zero-copy host buffer source instead of using an arrow implementation (#15189) @vuule
- Enable creation of columns from scalar (#15181) @vyasr
- Use NVTX from GitHub. (#15178) @bdice
- Implement
segmented_row_bit_count
for computing row sizes by segments of rows (#15169) @ttnghia - Implement search using pylibcudf (#15166) @vyasr
- Add distinct left join (#15149) @PointKernel
- Add cardinality control for groupby benchs with flat types (#15134) @PointKernel
- Add ability to request Parquet encodings on a per-column basis (#15081) @etseidl
- Automate include grouping order in .clang-format (#15063) @harrism
- Requesting a clean build directory also clears Jitify cache (#15052) @robertmaynard
- API for JSON unquoted whitespace normalization (#15033) @shrshi
- Implement concatenate, lists.explode, merge, sorting, and stream compaction in pylibcudf (#15011) @vyasr
- Implement replace in pylibcudf (#15005) @vyasr
- Add distinct key inner join (#14990) @PointKernel
- Implement rolling in pylibcudf (#14982) @vyasr
- Implement joins in pylibcudf (#14972) @vyasr
- Implement scans and reductions in pylibcudf (#14970) @vyasr
- Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
- Implement groupby in pylibcudf (#14945) @vyasr
- Support casting of Map type to string in JSON reader (#14936) @karthikeyann
- POC for whitespace removal in input JSON data using FST (#14931) @shrshi
- Support for LZ4 compression in ORC and Parquet (#14906) @vuule
- Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
- Migrate unary operations to pylibcudf (#14850) @vyasr
- Migrate binary operations to pylibcudf (#14821) @vyasr
- Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
- Support CUDA 12.2 (#14712) @jameslamb
π οΈ Improvements
- Use
conda env create --yes
instead of--force
(#15403) @bdice - Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
- Change exceptions thrown by copying APIs (#15319) @vyasr
- Enable branch testing for
cudf.pandas
(#15316) @galipremsagar - Replace black with ruff-format (#15312) @mroeschke
- This fixes an NPE when trying to read empty JSON data by adding a new API for missing information (#15307) @revans2
- Address poor performance of Parquet string decoding (#15304) @etseidl
- Update script input name (#15301) @AyodeAwe
- Make test_read_parquet_partitioned_filtered data deterministic (#15296) @mroeschke
- Add timeout for
cudf.pandas
pandas tests (#15284) @galipremsagar - Add upper bound to prevent usage of NumPy 2 (#15283) @bdice
- Fix cudf::test::to_host return of host_vector (#15263) @davidwendt
- Implement grouped product scan (#15254) @wence-
- Add CUDA 12.4 to supported PTX versions (#15247) @brandon-b-miller
- Implement DataFrame|Series.squeeze (#15244) @mroeschke
- Roll back ipow changes due to register pressure. (#15242) @pmattione-nvidia
- Remove create_chars_child_column utility (#15241) @davidwendt
- Update dlpack to version 0.8 (#15237) @dantegd
- Improve performance in JSON reader when
mixed_types_as_string
option is enabled (#15236) @shrshi - Remove row conversion code from libcudf (#15234) @ttnghia
- Use variable substitution for RAPIDS version in Doxyfile (#15231) @KyleFromNVIDIA
- Add ListColumns.to_pandas(arrow_type=) (#15228) @mroeschke
- Treat dask-cudf CI artifacts as pure wheels (#15223) @bdice
- Clean...