Skip to content

Releases: modin-project/modin

Modin 0.19.0

09 Mar 21:08
0.19.0
8d3db2b
Compare
Choose a tag to compare

Modin 0.19.0

This release introduces Modin's new, experimental NumPy API. It also features
many bug fixes, improvements to documentation, and performance optimizations,
including faster initialization with NumPy arrays.

Key Features and Updates Since 0.18.0

  • Stability and Bugfixes
    • FIX-#0000: Fix a typo in expr.py (#5757)
    • FIX-#1227: Avoid RecursionError for __int__ and __float__ (#5502)
    • FIX-#1503: Proper implementation of Series.values (#5469)
    • FIX-#2320: Raise exceptions in read_csv in some cases with skipfooter!=0 (#5522)
    • FIX-#2493: Defaults to pandas for read_csv if lineterminator!=None (#5515)
    • FIX-#2494: Defaults to pandas for read_csv if escapechar!=None (#5521)
    • FIX-#2508: Defaults to pandas for read_csv if dialect!=None (#5512)
    • FIX-#3080: read_csv with HDK backend doesn't handle duplicated columns (#5639)
    • FIX-#3305: Fix read_excel when usecols and index_cols parameters are provided (#5508)
    • FIX-#3620: Fix construction of dataframe from index (#5490)
    • FIX-#3928: Fix column insertion into empty data frame (#5103)
    • FIX-#4154: add value_counts method for SeriesGroupBy and DataFrameGroupBy (#5453)
    • FIX-#4186: Fix __repr__ of Modin categorical Series (#5516)
    • FIX-#4640: Fix __repr__ when display.max_rows=None (#5504)
    • FIX-#5165: make 'groupby' handle non-str 'by' columns (#5411)
    • FIX-#5273: Make ParquetFileToRead a named tuple (#5352)
    • FIX-#5430: Make groupby work on empty frames (#5442)
    • FIX-#5436: Fix '.index' extraction for an empty frame (#5431)
    • FIX-#5473: Fixed a bug that ignored positional arguments in DataFrameGroupBy.take() (#5474)
    • FIX-#5477: Fix TypeError: read_sas() takes 1 positional argument but 2 were given (#5465)
    • FIX-#5488: Remove usage of deprecated numpy types (#5487)
    • FIX-#5492: Fix Series.values when Series.dtype==ExtensionDtype (#5493)
    • FIX-#5514: pin sphinx<6.0.0 (#5513)
    • FIX-#5531: Fix failure when inserting a 2D python list into a frame (#5555)
    • FIX-#5537: disable empty-groupby handling logic in experimental mode (#5538)
    • FIX-#5539: Allow partitioning to adapt to the shape changes caused by '.merge' (#5556)
    • FIX-#5545: Aligned with pandas default 'groupby.skew' results for invalid data (#5558)
    • FIX-#5552: Fix sort_values when data is over-partitioned. (#5553)
    • FIX-#5561: CalciteSerializer does not support unsigned integers (#5563)
    • FIX-#5568: Pin 'fastparquet<2023.1.0' (#5569)
    • FIX-#5581: Don't use deprecated inplace parameter for set_axis function (#5579)
    • FIX-#5589: Do not trigger metadata materialization on 'filter' (#5588)
    • FIX-#5597: pin sqlalchemy<1.4.46 as pandas does to fix CI (#5593)
    • FIX-#5598: make PyArrowDataset.files work for 3.0.0 <= pyarrow < 8.0.0 (#5592)
    • FIX-#5600: Copy '.dtypes' on 'df.copy()' (#5601)
    • FIX-#5604: Fix dictionary groupby aggregation for a single col partition case (#5605)
    • FIX-#5608: Pin openpyxl<3.1.0 (#5603)
    • FIX-#5610: Add default to pandas implementation for qcut (#5611)
    • FIX-#5621: Do not preserve suboptimal partitioning on keep_partitioning=False (#5622)
    • FIX-#5625: Fix set_index with modin series. (#5630)
    • FIX-#5628: BUG: HDK: Unable to concatenate tables with different number of non-numeric columns (#5673)
    • FIX-#5629: Make read_sql alias compatible with snowflake. (#5631)
    • FIX-#5650: Restore the right dtype for applying Series.cat (#5651)
    • FIX-#5665: Fix operations that flatten an array, as well as handling of where argument in such operations (#5668)
    • FIX-#5698: Read list of parquet files (#5725)
    • FIX-#5702: Fix passing RangeIndex to loc. (#5719)
    • FIX-#5714: BUG: Empty frames concatenation with inner join is not valid (#5715)
    • FIX-#5720: Ensure that modin.numpy.array's propagate NaN values when computing mean (#5735)
    • FIX-#5721: Fix loc[tuple] on multiindex. (#5726)
    • FIX-#5730: Add repr, len, size, and make dtype changing lazy. (#5731)
    • FIX-#5733: Allow all Modin objects in all Modin object constructors, and make sure copy=False works (#5736)
    • FIX-#5742: BUG: HDK: Binary operations on strings are not supported (#5743)
    • FIX-#5761: Add _exp, _sqrt to query compiler (#5762)
  • Performance enhancements
    • PERF-#5182: Precompute dtypes when performing binary operations in certain cases (#5494)
    • PERF-#5183: Compute dtypes when performing from_labels operation (#5478)
    • PERF-#5247: Make MultiIndex use memory more efficiently (#5632)
    • PERF-#5369: GroupBy.skew implementation via MapReduce pattern (#5318)
    • PERF-#5484: speed up read_csv; compute metadata after skipping rows (#5482)
    • PERF-#5549: copy dtypes for invert op (#5541)
    • PERF-#5550: Don't trigger axes computation in to_pandas function (#5544)
    • PERF-#5551: Preserve index and columns on _repartition (#5543)
    • PERF-#5554: Implement drop_duplicates via new duplicated (#5587)
    • PERF-#5557: Don't trigger axes computation in pivot_table (#5546)
    • PERF-#5573: Don't trigger axes computation in columnarize function (#5548)
    • PERF-#5575: Don't trigger axes computation in reset_index function (#5547)
    • PERF-#5586: Precompute resulting '.merge' partitioning based on the arguments (#5585)
    • PERF-#5589: Do no trigger 'dtypes' materialization for '.filter()' (#5595)
    • PERF-#5596: Do not trigger index materialization for '.merge' result (#5619)
    • PERF-#5613: Optimize duplicated in case there is only one column partition (#5640)
    • FIX-#5641: Add fastpath for numpy arrays to dataframe constructor (#5655)
    • PERF-#5657: Don't trigger axes computation when accessing .str.* methods (#5658)
    • PERF-#5660: Don't trigger axes computation when accessing cat.codes (#5661)
    • PERF-#5680: Don't trigger axes computation when doing binary operations (#5681)
    • PERF-#5682: Don't trigger axes computation when calling isin (#5683)
    • PERF-#5690: move read_callback from dispatchers into parsers (#5689)
    • PERF-#5691: Set item via .loc without converting a Series to np.array (#5693)
    • PERF-#5700: Treat numpy arrays more efficiently at df.__setitem__ (#5708)
    • PERF-#5705: Preserve metadata when applying Series.cat.codes (#5706)
    • PERF-#5709: Avoid re-putting a distributed Series to the engine's object store at .map() (#5704)
    • PERF-#5710: Avoid re-putting a distributed Series to the engine's object store at .isin() (#5707)
  • Refactor Codebase
    • REFACTOR-#0000: make deploy functions in virtual_partition.py files private (#5455)
    • REFACTOR-#1531: move default_to_pandas into base query_compiler class (#5479)
    • REFACTOR-#3883: Unify tests execution approach in the Github workflow files (#5520)
    • REFACTOR-#3948: Use __constructor__ in DataFrame and Series classes (#5485)
    • REFACTOR-#5275: Deduplicate code for Ray and Unidist engines (#5457)
    • REFACTOR-#5370: Move merge_asof implementation to base query compiler. (#5371)
    • REFACTOR-#5393: remove unused '_VIEW_IS_COPY_WARNING' global var (#5392)
    • REFACTOR-#5416: fix FutureWarning: the mangle_dupe_cols keyword is deprecated for read_excel (#5415)
    • REFACTOR-#5434: Define public interfaces in modin.core.execution.dask module (#5418)
    • REFACTOR-#5459: Install code linters through conda and unpin flake8 (#5450)
    • REFACTOR-#5462: Update execution.ray public api with virtual partitions (#5456)
    • REFACTOR-#5467: remove FutureWarning for df.iloc[:, i] = newvals (#5468)
    • REFACTOR-#5471: add FutureWarning for DataFrameGroupBy.backfill (#5472)
    • REFACTOR-#5475: Update execution.unidist public api with virtual partitions (#5476)
    • REFACTOR-#5535: remove duplication for 'columnarize' method (#5534)
    • REFACTOR-#5607: Fix missing formatting with 'black' (#5606)
    • REFACTOR-#5685: add RayWrapper.put implementation (#5686)
    • REFACTOR-#5687: add UnidistWrapper.put implementation (#5688)
    • REFACTOR-#5703: align 'DaskWrapper.deploy' behavior with others (#5701)
    • REFACTOR-#5718: add columns parameter for get_dtypes function (#5717)
  • Update testing suite
    • TEST-#0000: correct behavior of CI for push action (#5748)
    • TEST-#5420: port asv benchmarks for Repr, MaskBool, isNull, dropNa and equals functions (#5421)
    • TEST-#5444: reduce Series' shape for TimeReindex asv bench (#5443)
    • TEST-#5448: reduce Dataframe' shape for 'time_merge_default' asv bench (#5446)
    • TEST-#5451: reduce shapes for TimeLevelAlign, TimeStack and TimeUnstack ASV benchmarks (#5452)
    • TEST-#5540: add module level setup function for ASV benchmarks (#5530)
    • TEST-#5664: speedup Post Run conda-incubator/setup-miniconda@v2 step on Windows (#5662)
    • TEST-#5747: Synchronize jobs between push.yml and ci.yml that are used to measure test coverage (#5745)
    • TEST-#5764: run test-asv-benchmarks CI job only for PRs (#5765)
  • Documentation improvements
    • DOCS-#3803: Update "building modin from source" docs (#5480)
    • DOCS-#5157: Add a note regarding poor perf of the first op with Modin on Ray (#5491)
    • DOCS-#5463: Add jupyter tutorials for Modin on Unidist (#5464)
    • DOCS-#5498: mention 'DataFrame._repartition' API at docs (#5499)
  • New Features
    • FEAT-#5147: implement xs (#5143)
    • FEAT-#5423: Add a NumPy API to Modin (#5422)
    • FEAT-#5481: Implement dictionary groupby aggregation via TreeReduce (#5503)
    • FEAT-#5559: Upgrade pandas to 1.5.3 (#5560)
    • FEAT-#5562: Upgrade pyhdk to 0.3.1 (#5564)
    • FEAT-#5620: Synchronize parameters of apply_full_axis with broadcast_apply_full_axis (#5637)
    • FEAT-#5666: Support logic operations on modin numpy arrays (#5667)
    • FEAT-#5751: Bump pyhdk version to 0.4 (#5752)
    • FEAT-#5753: Add math functions necessary for picoGPT (#5756)
    • FEAT-#5754: Add np.linalg operations (#5755)

Contributors

@AndreyPavlenko...

Read more

Modin 0.18.1

26 Jan 06:32
0.18.1
9068fbc
Compare
Choose a tag to compare

Modin 0.18.1

This release includes pandas 1.5.3 support and a bunch of bug fixes.

Key Features and Updates Since 0.18.0

  • Stability and Bugfixes
    • FIX-#1227: Avoid RecursionError for __int__ and __float__ (#5502)
    • FIX-#1503: Proper implementation of Series.values (#5469)
    • FIX-#2320: Raise exceptions in read_csv in some cases with skipfooter!=0 (#5522)
    • FIX-#2493: Defaults to pandas for read_csv if lineterminator!=None (#5515)
    • FIX-#2494: Defaults to pandas for read_csv if escapechar!=None (#5521)
    • FIX-#2508: Defaults to pandas for read_csv if dialect!=None (#5512)
    • FIX-#3620: Fix construction of dataframe from index (#5490)
    • FIX-#3928: Fix column insertion into empty data frame (#5103)
    • FIX-#4186: Fix __repr__ of Modin categorical Series (#5516)
    • FIX-#5165: make 'groupby' handle non-str 'by' columns (#5411)
    • FIX-#5273: Make ParquetFileToRead a named tuple (#5352)
    • FIX-#5436: Fix '.index' extraction for an empty frame (#5431)
    • FIX-#5473: Fixed a bug that ignored positional arguments in DataFrameGroupBy.take() (#5474)
    • FIX-#5477: Fix TypeError: read_sas() takes 1 positional argument but 2 were given (#5465)
    • FIX-#5488: Remove usage of deprecated numpy types (#5487)
    • FIX-#5492: Fix Series.values when Series.dtype==ExtensionDtype (#5493)
    • FIX-#5514: pin sphinx<6.0.0 (#5513)
    • FIX-#5531: Fix failure when inserting a 2D python list into a frame (#5555)
    • FIX-#5568: Pin 'fastparquet<2023.1.0' (#5569)
  • New Features

Contributors

@AndreyPavlenko
@YarShev
@anmyachev
@dchigarev
@vnlitvinov
@Retribution98

Modin 0.18.0

12 Dec 17:29
0.18.0
ba7ab8e
Compare
Choose a tag to compare

This release includes support for MPI backend using Unidist, improvements to the shuffling mechanism,
SQL query execution on the HDK backend (currently pyhdk==0.3), support for pandas 1.5.2 and external query compilers.
It also includes many bug fixes and some performance enhancements.

Key Features and Updates Since 0.17.0

  • Stability and Bugfixes
    • FIX-#3823: Fix TypeError when creating Series from SparseArray (#5377)
    • FIX-#4100: Fall back to Pandas on row drop (#4937)
    • FIX-#4636: Allows read_parquet to detect column partitioning in non-local filesystems (#5192)
    • FIX-#4859: Add support for PyArrow Dictionary Arrays to type mapping (#4864)
    • FIX-#4859: Add support for PyArrow Dictionary Arrays to type mapping (#5271)
    • FIX-#5016: Suppress spammy ray task errors. (#5298)
    • FIX-#5114: Change mask name to resolve namespace conflict with numpy mask (#5215)
    • FIX-#5137: df.info failure with default columns (#5251)
    • FIX-#5138: df_categories_equals typo (#5250)
    • FIX-#5171: Allow xgboost >= 1.7.0. (#5195)
    • FIX-#5186: set_index case with multiindex (#5190)
    • FIX-#5187: Fixed RecursionError in OmnisciLaunchParameters.get() (#5199)
    • FIX-#5204: Fix binary operations with a dictionary (#5205)
    • FIX-#5208: Support ray==2.1.0 (#5283)
    • FIX-#5232: Stop changing original series names during binary ops. (#5249)
    • FIX-#5234: Use query compiler str_repeat. (#5235)
    • FIX-#5236: Allow binary operations with custom classes. (#5237)
    • FIX-#5238: Make rmul really rmul instead of mul. (#5246)
    • FIX-#5240: Fix dask[complete] syntax in conda environment files (#5241)
    • FIX-#5252: Disable notebook tests until access control issues are resolved for modin-test bucket (#5257)
    • FIX-#5277: Fix internal execute function (#5278)
    • FIX-#5284: Move ray, redis, tqdm, xgboost packages from pip to conda deps (#5270)
    • FIX-#5285: Check for both pyarrow and fastparquet when read parquet format (#5297)
    • FIX-#5306: Fix code scanning alert - Use of the return value of a procedure (#5307)
    • FIX-#5308: Allow custom execution with no known engine. (#5379)
    • FIX-#5319: Do not use deprecated '.iteritems()' (#5320)
    • FIX-#5325: Fix read_csv_glob with non-empty parse_dates dict (#5339)
    • FIX-#5327: Bump mypy cap to fix CI. (#5328)
    • FIX-#5364: Fix get_indices internal function (#5355)
    • FIX-#5380: Fix warning about setting _cache attribute. (#5381)
    • FIX-#5398: Resolve length 1 nonNA partition issue, and off by one error in sort (#5400)
    • FIX-#5405: Pin ray>=1.13.0 (#5390)
  • Performance enhancements
    • PERF-#5225: Do not convert 'value' to a list at '.insert()' (#5226)
    • PERF-#5268: Call get on all partitions at once in to_pandas (#4776)
  • Refactor Codebase
    • REFACTOR-#5202: Pass loc arguments to query compiler. (#5305)
    • REFACTOR-#5262: Update the examples to the latest version of the omniscripts (#5263)
    • REFACTOR-#5287: Remove code to test getting TypeError for Series.dropna (#5288)
    • REFACTOR-#5294: Fix code scanning alert - Potentially uninitialized local variable (#5383)
    • REFACTOR-#5299: Variable defined multiple times error found by CodeQL (#5300)
    • REFACTOR-#5301: Fix code scanning alert - Duplicate key in dict literal (#5302)
    • REFACTOR-#5303: Fix code scanning alert - Unused local variable (#5304)
    • REFACTOR-#5310: Remove some hasattr('columns') checks. (#5311)
    • REFACTOR-#5312: Let lazy query compilers check for astype and drop errors. (#5313)
    • REFACTOR-#5322: Remove python3.7 related code from read_csv_glob (#5323)
    • REFACTOR-#5330: Remove BaseIO._read (#5329)
    • REFACTOR-#5332: Define PQ_INDEX_REGEX as class variable (#5333)
    • REFACTOR-#5334: Make _validate as classmethod (#5331)
    • REFACTOR-#5335: Remove unnecessary lambdas (#5336)
    • REFACTOR-#5359: Fix code scanning alert - File is not always closed (#5362)
    • REFACTOR-#5363: Introduce partition constructor; move add_to_apply_calls impl in base class (#5354)
    • REFACTOR-#5382: Use pandas.util.cache_readonly for __constructors__ (#5368)
    • REFACTOR-#5386: Move partition.split implementation in base class (#5384)
    • REFACTOR-#5391: Improve setup function in TimeDropDuplicatesDataframe (#5389)
    • REFACTOR-#5413: Check Index.dtype instead of isinstance(obj, Int64Index) (#5406)
  • Update testing suite
    • TEST-#2073: Check that read_csv can use a parse_dates dict. (#4572)
    • TEST-#4562: In windows CI, try to start ray a few times (#5101)
    • TEST-#4821: Monkeypatch cache_readonly to avoid errors in doc_checker.py (#5365)
    • TEST-#5123: Add CodeQL workflow for GitHub code scanning (#5222)
    • TEST-#5219: Relax matplotlib and coverage pins (#5216)...
Read more

Modin 0.17.1

25 Nov 13:55
0.17.1
7f801ad
Compare
Choose a tag to compare

This release includes pandas 1.5.2 support and a bunch of bug fixes.

Key Features and Updates Since 0.17.0

  • Stability and Bugfixes
    • FIX-#4100: Fall back to Pandas on row drop (#4937)
    • FIX-#4636: allows read_parquet to detect column partitioning in non-local filesystems (#5192)
    • FIX-#5138: df_categories_equals typo (#5250)
    • FIX-#5186: set_index case with multiindex (#5190)
    • FIX-#5187: Fixed RecursionError in OmnisciLaunchParameters.get() (#5199)
    • FIX-#5204: fix binary operations with a dictionary (#5205)
    • FIX-#5232: Stop changing original series names during binary ops. (#5249)
    • FIX-#5234: Use query compiler str_repeat. (#5235)
    • FIX-#5236: Allow binary operations with custom classes. (#5237)
    • FIX-#5252: Disable notebook tests until access control issues are resolved for modin-test bucket (#5257)
  • New Features

Contributors

@AndreyPavlenko
@Billy2551
@RehanSD
@YarShev
@anmyachev
@dchigarev
@mvashishtha
@noloerino

Modin 0.17.0

11 Nov 14:45
e50cec1
Compare
Choose a tag to compare

This release includes support for pyhdk 0.2. It also includes many bug fixes and some performance enhancements.

Key Features and Updates Since 0.16.0

  • Stability and Bugfixes
    • FIX-#3764: Ensure df.loc with a scalar out of bounds appends to df (#3765)
    • FIX-#4016, FIX-#4086, FIX-#4039: Fall back to pandas in case of duplicate column names (#4896)
    • FIX-#4023: Fall back to pandas in case of MultiIndex columns (#5149)
    • FIX-#4660: Fix fillna when Modin series object is an argument (#4674)
    • FIX-#5034: Handle lists in df.get() (#5035)
    • FIX-#5097: Stop using deprecated mangle_dup_cols. (#5104)
    • FIX-#5098: Stop using append internally. (#5100)
    • FIX-#5099: Fix PandasQueryCompiler.groupby_mean with timestamp in by (#5140)
    • FIX-#5112: allows empty partition to be passed into query_compiler.dt_prop_map (#5133)
    • FIX-#5128: Fix reading parquet directory from s3. (#5129)
    • FIX-#5150: Sync row labels after read_csv when index_col is False (#5151)
    • FIX-#5158: Synchronize metadata before to_parquet (#5161)
    • FIX-#5168: module 'collections' has no attribute 'Sequence' in dataframe protocol (#5169)
    • FIX-#5174: Pin xgboost < 1.7. (#5175)
    • FIX-#5180: Do not set OMP_NUM_THREADS=1 on modin.pandas init (#5181)
    • FIX-#5184: Fix get_dummies to respect passed columns to be encoded (#5185)
    • FIX-#5188: Fix getitem_bool when the key is Series with empty partition (#5189)
    • FIX-#5206: pin mypy<0.990 (#5207)
    • FIX-#5208: pin ray version under 2.1.0 (#5209)
  • Performance enhancements
    • PERF-#5029: Don't use _compute_axis_labels_and_lengths for computing _row_lengths/_column_widths (#5030)
    • PERF-#5087: use cache for widths/lengths/index/columns if possible (#5031)
    • PERF-#5162: precompute new row/column lengths in '._reorder_labels' (#5144)
  • Refactor Codebase
    • REFACTOR-#4631: Add mypy checks for modin.distributed (#5109)
    • REFACTOR-#5079: Add mypy checks for modin.core.dataframe.base (#5110)
    • REFACTOR-#5092: Fix future warning for set_axis function (#5093)
  • Update testing suite
    • TEST-#4982: Require format for PR descriptions instead of commit descriptions (#5117)
    • TEST-#5124: Disable codecov comments. (#5125)
    • TEST-#5135: Return CI back after accidental removal (#5136)
    • TEST-#5172: Add fuzzydata logs to artifacts (#5173)
  • Benchmarking enhancements
    • BENCH: add some cases for join and merge ops from pandas (#5021)
    • TEST-#5102: Add HDK benchmarks to github workflows (#5063)
  • Documentation improvements
    • DOCS-#3634: Fix examples related to ProgressBar usage (#5119)
    • DOCS-#5019: Update HDK on native documentation (#5088)
    • DOCS-#5095: Remove release note checkbox from PR template (#5096)
    • DOCS-#5105: Update release procedure (#5106)
  • New Features
    • FEAT-#5120: Update to pyhdk 0.2 (#5121)
    • FEAT-#5141: Implement 2D insertion of Modin DFs in .__setitem__ (#5142)
    • FEAT-#5145: Upgrade pandas to 1.5.1 (#5146)

Contributors

@AndreyPavlenko
@Billy2551
@RehanSD
@YarShev
@anmyachev
@dchigarev
@devin-petersohn
@ienkovich
@mvashishtha
@noloerino
@pyrito
@rosdyana
@shalearkane
@suhailrehman
@vnlitvinov

Modin 0.16.2

21 Oct 21:00
3f01114
Compare
Choose a tag to compare

This release includes pandas 1.5.1 support and two bug fixes.

Key features and Updates

  • Stability and Bugfixes
    • FIX-#4016, FIX-#4086, FIX-#4039: Fall back to pandas in case of duplicate column names (#4896)
    • FIX-#5128: Fix reading parquet directory from s3. (#5129)
  • New Features

Contributors

@AndreyPavlenko
@mvashishtha
@YarShev

Modin 0.16.1

11 Oct 21:08
0.16.1
98a9694
Compare
Choose a tag to compare

This release features a bug fix, as well as fixes for deprecation warnings introduced by pandas 1.5.

Key Features and Updates

  • Stability and Bugfixes
    • FIX-#5034: Handle lists in df.get() (#5035)
    • FIX-#5098: Stop using append internally. (#5100)
    • FIX-#5097: Stop using deprecated mangle_dup_cols. (#5104)
  • Refactor Codebase
    • REFACTOR-#5092: Fix future warning for set_axis function (#5093)

Contributors

@mvashishtha
@pyrito
@anmyachev
@vnlitvinov

Modin 0.16.0

05 Oct 19:50
621bc10
Compare
Choose a tag to compare

This release includes support for pandas 1.5, support for the latest version of dask, and backwards compatibility with python 3.6 and pandas 1.1. Additionally, it includes many performance enhancements, bug fixes, and documentation improvements.

Key Features and Updates

  • Stability and Bugfixes
    • FIX-#4570: Replace np.bool -> np.bool_ (#4571)
    • FIX-#4543: Fix read_csv in case skiprows=<0, []> (#4544)
    • FIX-#4059: Add cell-wise execution for binary ops, fix bin ops for empty dataframes (#4391)
    • FIX-#4589: Pin protobuf<4.0.0 to fix ray (#4590)
    • FIX-#4577: Set attribute of Modin dataframe to updated value (#4588)
    • FIX-#4411: Fix binary_op between datetime64 Series and pandas timedelta (#4592)
    • FIX-#4604: Fix groupby + agg in case when multicolumn can arise (#4642)
    • FIX-#4582: Inherit custom log layer (#4583)
    • FIX-#4639: Fix storage_options usage for read_csv and read_csv_glob (#4644)
    • FIX-#4593: Ensure Modin warns when setting columns via attributes (#4621)
    • FIX-#4584: Enable pdb debug when running cloud tests (#4585)
    • FIX-#4564: Workaround import issues in Ray: auto-import pandas on python start if env var is set (#4603)
    • FIX-#4641: Reindex pandas partitions in df.describe() (#4651)
    • FIX-#2064: Fix iloc/loc assignment when dataframe is empty (#4677)
    • FIX-#4634: Check for FrozenList as by in df.groupby() (#4667)
    • FIX-#4680: Fix read_csv that started defaulting to pandas again in case of reading from a buffer and when a buffer has a non-zero starting position (#4681)
    • FIX-#4491: Wait for all partitions in parallel in benchmark mode (#4656)
    • FIX-#4358: MultiIndex loc shouldn't drop levels for full-key lookups (#4608)
    • FIX-#4658: Expand exception handling for read_* functions from s3 storages (#4659)
    • FIX-#4672: Fix incorrect warning when setting frame.index or frame.columns (#4721)
    • FIX-#4686: Propagate metadata and drain call queue in unwrap_partitions (#4697)
    • FIX-#4652: Support categorical data in from_dataframe (#4737)
    • FIX-#4756: Correctly propagate storage_options in read_parquet (#4764)
    • FIX-#4657: Use fsspec for handling s3/http-like paths instead of s3fs (#4710)
    • FIX-#4676: drain sub-virtual-partition call queues (#4695)
    • FIX-#4782: Exclude certain non-parquet files in read_parquet (#4783)
    • FIX-#4808: Set dtypes correctly after column rename (#4809)
    • FIX-#4811: Apply dataframe -> not_dataframe functions to virtual partitions (#4812)
    • FIX-#4099: Use mangled column names but keep the original when building frames from arrow (#4767)
    • FIX-#4838: Bump up modin-spreadsheet to latest master (#4839)
    • FIX-#4840: Change modin-spreadsheet version for notebook requirements (#4841)
    • FIX-#4835: Handle Pathlike paths in read_parquet (#4837)
    • FIX-#4872: Stop checking the private ray mac memory limit (#4873)
    • FIX-#4914: base_lengths should be computed from base_frame instead of self in copartition (#4915)
    • FIX-#4848: Fix rebalancing partitions when NPartitions == 1 (#4874)
    • FIX-#4927: Fix dtypes computation in dataframe.filter (#4928)
    • FIX-#4907: Implement radd for Series and DataFrame (#4908)
    • FIX-#4945: Fix _take_2d_positional that loses indexes due to filtering empty dataframes (#4951)
    • FIX-#4818, PERF-#4825: Fix where by using the new n-ary operator (#4820)
    • FIX-#3983: FIX-#4107: Materialize 'rowid' columns when selecting rows by position (#4834)
    • FIX-#4845: Fix KeyError from __getitem_bool for single row dataframes (#4845)
    • FIX-#4734: Handle Series.apply when return type is a DataFrame (#4830)
    • FIX-#4983: Set frac to None in _sample when n=0 (#4984)
    • FIX-#4993: Return _default_to_pandas in df.attrs (#4995)
    • FIX-#5043: Fix execute function in ASV utils failed if len(partitions) == 0 (#5044)
    • FIX-#4597: Refactor Partition handling of func, args, kwargs (#4715)
    • FIX-#4996: Evaluate BenchmarkMode at each function call (#4997)
    • FIX-#4022: Fixed empty data frame with index (#4910)
    • FIX-#4090: Fixed check if the index is trivial (#4936)
    • FIX-#4966: Fix to_timedelta to return Series instead of TimedeltaIndex (#5028)
    • FIX-#5042: Fix series getitem with invalid strings (#5048)
    • FIX-#4691: Fix binary operations between virtual partitions (#5049)
    • FIX-#5045: Fix ray virtual_partition.wait with duplicate object refs (#5058)
  • Performance enhancements
    • PERF-#4182: Add cell-wise execution for binary ops, fix bin ops for empty dataframes (#4391)
    • PERF-#4288: Improve perf of groupby.mean for narrow data (#4591)
    • PERF-#4772: Remove df.copy call from from_pandas since it is not needed for Ray and Dask (#4781)
    • PERF-#4325: Improve perf of multi-column assignment in __setitem__ when no new column names are assigning (#4455)
    • PERF-#3844: Improve perf of drop operation (#4694)
    • PERF-#4727: Improve perf of concat operation (#4728)
    • PERF-#4705: Improve perf of arithmetic operations between Series objects with shared .index (#4689)
    • PERF-#4703: Improve performance in accessing ser.cat.categories, ser.cat.ordered, and ser.__array_priority__ (#4704)
    • PERF-#4305: Parallelize read_parquet over row groups (#4700)
    • PERF-#4773: Compute lengths and widths in put method of Dask partition like Ray do (#4780)
    • PERF-#4732: Avoid overwriting already-evaluated PandasOnRayDataframePartition._length_cache and PandasOnRayDataframePartition._width_cache (#4754)
    • PERF-#4862: Don't call compute_sliced_len.remote when row_labels/col_labels == slice(None) (#4863)
    • PERF-#4713: Stop overriding the ray MacOS object store size limit (#4792)
    • PERF-#4944: Avoid default_to_pandas in Series.cat.codes, Series.dt.tz, and Series.dt.to_pytimedelta (#4833)
    • PERF-#4851: Compute dtypes for binary operations that can only return bool type and the right operand is not a Modin object (#4852)
    • PERF-#4842: copy should not trigger any previous computations (#4843)
    • PERF-#4849: Compute dtypes in concat also for ROW_WISE case when possible (#4850)
    • PERF-#4929: Compute dtype when using Series.dt accessor (#4930)
    • PERF-#4892: Compute lengths in rebalance_partitions when possible (#4893)
    • PERF-#4794: Compute caches in _propagate_index_objs (#4888)
    • PERF-#4860: PandasDataframeAxisPartition.deploy_axis_func should be serialized only once (#4861)
    • PERF-#4890: PandasDataframeAxisPartition.drain should be serialized only once (#4891)
    • PERF-#4870: Avoid index materialization in __getattribute__ and __getitem__ (4911)
    • PERF-#4886: Use lazy index and columns evaluation in query method (#4887)
    • PERF-#4866: iloc function that used in partition.mask should be serialized only once (#4901)
    • PERF-#4920: Avoid index and cache computations in take_2d_labels_or_positional unless they are needed (#4921)
    • PERF-#4999: don't call apply in virtual partition' drain_call_queue if call_queue is empty (#4975)
    • PERF-#4268: Implement partition-parallel getitem for bool Series masks (#4753)
    • PERF-#5017: reset_index shouldn't trigger index materialization if possible (#5018)
    • PERF-#4963: Use partition width/length methods instead of _compute_axis_labels_and_lengths if index is already known (#4964)
    • PERF-#4940: Optimize categorical dtype check in concatenate (#4953)
  • Benchmarking enhancements
    • TEST-#5066: Add outer join case for TimeConcat benchmark (#5067)
    • TEST-#5083: Add merge op with categorical data (#5084)
    • FEAT-#4706: Add Modin ClassLogger to PandasDataframePartitionManager (#4707)
    • TEST-#5014: Simplify adding new ASV benchmarks (#5015)
    • TEST-#5064: Update TimeConcat benchmark with new parameter ignore_index (#5065)
    • TEST-#5068: Add binary op benchmark for Series (#5069)
  • Refactor Codebase
    • REFACTOR-#4530: Standardize access to physical data in partitions (#4563)
    • REFACTOR-#4534: Replace logging meta class with class decorator (#4535)
    • REFACTOR-#4708: Delete combine dtypes (#4709)
    • REFACTOR-#4629: Add type annotations to modin/config (#4685)
    • REFACTOR-#4717: Improve PartitionMgr.get_indices() usage (#4718)
    • REFACTOR-#4730: make Indexer immutable (#4731)
    • REFACTOR-#4774: remove _build_treereduce_func call from _compute_dtypes (#4775)
    • REFACTOR-#4750: Delete BaseDataframeAxisPartition.shuffle (#4751)
    • REFACTOR-#4722: Stop suppressing undefined name lint (#4723)
    • REFACTOR-#4832: unify split_result_of_axis_func_pandas (#4831)
    • REFACTOR-#4796: Introduce constant for reduced column name (#4799)
    • REFACTOR-#4000: Remove code duplication for PandasOnRayDataframePartitionManager (#4895)
    • REFACTOR-#3780: Remove code duplication for PandasOnDaskDataframe (#3781)
    • REFACTOR-#4530: Unify access to physical data for any partition type (#4829)
    • REFACTOR-#4978: Align modin/core/execution/dask/common/__init__.py with modin/core/execution/ray/common/__init__.py (#4979)
    • REFACTOR-#4949: Remove code duplication in default2pandas/dataframe.py and default2pandas/any.py (#4950)
    • REFACTOR-#4976: Rename RayTask to RayWrapper in accordance with Dask (#4977)
    • REFACTOR-#4885: De-duplicated take_2d_labels_or_positional methods (#4883)
    • REFACTOR-#5005: Use finalize method instead of list comprehension + drain_call_queue (#5006)
    • REFACTOR-#5001: Remove jenkins stuff (#5002)
    • REFACTOR-#5026: Change exception names to simplify grepping (#5027)
    • REFACTOR-#4970: Rewrite base implementations of a partition' width/length (#4971)
    • REFACTOR-#4942: Remove call method in favor of register due to duplication (4943)
    • REFACTOR-#4922: Helpers for take_2d_labels_or_positional (#4865)
    • REFACTOR-#5024: Make _row_lengths and `_column...
Read more

Modin 0.15.3

07 Sep 16:41
138a954
Compare
Choose a tag to compare

This release adds support for pandas 1.4.4 and includes a bunch of
bugfixes.

Key Features and Updates

  • Stability and Bugfixes
    • FIX-#4593: Ensure Modin warns when setting columns via attributes (#4621)
    • FIX-#4604: Fix groupby + agg in case when multicolumn can arise (#4642)
    • FIX-#4641: Reindex pandas partitions in df.describe() (#4651)
    • FIX-#4634: Check for FrozenList as by in df.groupby() (#4667)
    • FIX-#2064: Fix iloc/loc assignment when dataframe is empty (#4677)
    • FIX-#4658: Expand exception handling for read_* functions from s3 storages (#4659)
    • FIX-#4672: Fix incorrect warning when setting frame.index or frame.columns (#4721)
    • FIX-#4686: Propagate metadata and drain call queue in unwrap_partitions (#4697)
    • FIX-#4680: Fix read_csv that started defaulting to pandas again in case of reading from a buffer and when a buffer has a non-zero starting position (#4681)
    • FIX-#4808: Set dtypes correctly after column rename (#4809)
    • FIX-#4811: Apply dataframe -> not_dataframe functions to virtual partitions (#4812)
    • FIX-#4848: Fix rebalancing partitions when NPartitions == 1 (#4874)
    • FIX-#4838: Bump up modin-spreadsheet to latest master (#4839)
    • FIX-#4840: Change modin-spreadsheet version for notebook requirements (#4841)
    • FIX-#4657: Use fsspec for handling s3/http-like paths instead of s3fs (#4710)
    • FIX-#4639: Fix storage_options usage for read_csv and read_csv_glob (#4644)
  • Update testing suite
    • TEST-#4875: XFail tests failing due to file gone missing (#4876)
  • Dependencies

Contributors

@helmeleegy
@YarShev
@anmyachev
@pyrito
@prutskov
@jbrockmendel
@mvashishtha
@RehanSD
@vnlitvinov

Modin 0.15.2

25 Jun 00:03
0.15.2
Compare
Choose a tag to compare

This release adds support for pandas 1.4.3, pins protobuf < 4.0.0 to ensure compatibility with
ray < 1.13, and includes a bugfix for modifying columns via attribute access.

Key Features and Updates

  • Stability and Bugfixes
    • FIX-#4589: Pin protobuf<4.0.0 to fix ray (#4590)
    • FIX-#4577: Set attribute of Modin dataframe to updated value (#4588)
  • Dependencies
    • FEAT-#4598: Add support for pandas 1.4.3 (#4599)

Contributors

@mvashishtha
@pyrito
@RehanSD