Support Pandas==2.x in Apache Beam. #27221

JorgeCardona · 2023-06-22T17:56:07Z

What happened?

Beam doesn't work with pandas>=2.0.0

Reproducible Example

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions()
pipeline = beam.Pipeline(options=options)
data = pipeline | 'CreateData' >> beam.Create(['Pandas, 2.0,2'])
data | 'PrintData' >> beam.Map(print)
pipeline.run()

Issue Description

Pipeline fails with a runtime error: `AttributeError: type object 'Series' has no attribute 'append'`. Sample stacktrace:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[2], line 8
      4 options = PipelineOptions()
      6 pipeline = beam.Pipeline(options=options)
----> 8 data = pipeline | 'CreateData' >> beam.Create(['Pandas, 2.0,2'])
     10 data | 'PrintData' >> beam.Map(print)
     11 # Ejecutar el pipeline

File /usr/local/lib/python3.11/site-packages/apache_beam/transforms/ptransform.py:1092, in _NamedPTransform.__ror__(self, pvalueish, _unused)
   1091 def __ror__(self, pvalueish, _unused=None):
-> 1092   return self.transform.__ror__(pvalueish, self.label)

File /usr/local/lib/python3.11/site-packages/apache_beam/transforms/ptransform.py:614, in PTransform.__ror__(self, left, label)
    612 pvalueish = _SetInputPValues().visit(pvalueish, replacements)
    613 self.pipeline = p
--> 614 result = p.apply(self, pvalueish, label)
    615 if deferred:
    616   return result

File /usr/local/lib/python3.11/site-packages/apache_beam/pipeline.py:666, in Pipeline.apply(self, transform, pvalueish, label)
    664 old_label, transform.label = transform.label, label
    665 try:
--> 666   return self.apply(transform, pvalueish)
    667 finally:
    668   transform.label = old_label

File /usr/local/lib/python3.11/site-packages/apache_beam/pipeline.py:674, in Pipeline.apply(self, transform, pvalueish, label)
    670 # Attempts to alter the label of the transform to be applied only when it's
    671 # a top-level transform so that the cell number will not be prepended to
    672 # every child transform in a composite.
    673 if self._current_transform() is self._root_transform():
--> 674   alter_label_if_ipython(transform, pvalueish)
    676 full_label = '/'.join(
    677     [self._current_transform().full_label, label or
    678      transform.label]).lstrip('/')
    679 if full_label in self.applied_labels:

File /usr/local/lib/python3.11/site-packages/apache_beam/utils/interactive_utils.py:71, in alter_label_if_ipython(transform, pvalueish)
     59 """Alters the label to an interactive label with ipython prompt metadata
     60 prefixed for the given transform if the given pvalueish belongs to a
     61 user-defined pipeline and current code execution is within an ipython kernel.
   (...)
     68 `Cell {prompt}: {original_label}`.
     69 """
     70 if is_in_ipython():
---> 71   from apache_beam.runners.interactive import interactive_environment as ie
     72   # Tracks user defined pipeline instances in watched scopes so that we only
     73   # alter labels for any transform to pvalueish belonging to those pipeline
     74   # instances, excluding any transform to be applied in other pipeline
     75   # instances the Beam SDK creates implicitly.
     76   ie.current_env().track_user_pipelines()

File /usr/local/lib/python3.11/site-packages/apache_beam/runners/interactive/interactive_environment.py:41
     39 from apache_beam.runners.direct import direct_runner
     40 from apache_beam.runners.interactive import cache_manager as cache
---> 41 from apache_beam.runners.interactive.messaging.interactive_environment_inspector import InteractiveEnvironmentInspector
     42 from apache_beam.runners.interactive.recording_manager import RecordingManager
     43 from apache_beam.runners.interactive.sql.sql_chain import SqlChain

File /usr/local/lib/python3.11/site-packages/apache_beam/runners/interactive/messaging/interactive_environment_inspector.py:26
     23 # pytype: skip-file
     25 import apache_beam as beam
---> 26 from apache_beam.runners.interactive.utils import as_json
     27 from apache_beam.runners.interactive.utils import obfuscate
     30 class InteractiveEnvironmentInspector(object):

File /usr/local/lib/python3.11/site-packages/apache_beam/runners/interactive/utils.py:33
     30 import pandas as pd
     32 import apache_beam as beam
---> 33 from apache_beam.dataframe.convert import to_pcollection
     34 from apache_beam.dataframe.frame_base import DeferredBase
     35 from apache_beam.internal.gcp import auth

File /usr/local/lib/python3.11/site-packages/apache_beam/dataframe/convert.py:33
     31 from apache_beam.dataframe import expressions
     32 from apache_beam.dataframe import frame_base
---> 33 from apache_beam.dataframe import transforms
     34 from apache_beam.dataframe.schemas import element_typehint_from_dataframe_proxy
     35 from apache_beam.dataframe.schemas import generate_proxy

File /usr/local/lib/python3.11/site-packages/apache_beam/dataframe/transforms.py:33
     31 from apache_beam import transforms
     32 from apache_beam.dataframe import expressions
---> 33 from apache_beam.dataframe import frames  # pylint: disable=unused-import
     34 from apache_beam.dataframe import partitionings
     35 from apache_beam.utils import windowed_value

File /usr/local/lib/python3.11/site-packages/apache_beam/dataframe/frames.py:1231
   1224       return func(*args, **kwargs)
   1226     return func(self, *args, **kwargs)
   1229 @populate_not_implemented(pd.Series)
   1230 @frame_base.DeferredFrame._register_for(pd.Series)
-> 1231 class DeferredSeries(DeferredDataFrameOrSeries):
   1232   def __repr__(self):
   1233     return (
   1234         f'DeferredSeries(name={self.name!r}, dtype={self.dtype}, '
   1235         f'{self._render_indexes()})')

File /usr/local/lib/python3.11/site-packages/apache_beam/dataframe/frames.py:1338, in DeferredSeries()
   1331 transpose = frame_base._elementwise_method('transpose', base=pd.Series)
   1332 shape = property(
   1333     frame_base.wont_implement_method(
   1334         pd.Series, 'shape', reason="non-deferred-result"))
   1336 @frame_base.with_docs_from(pd.Series)
   1337 @frame_base.args_to_kwargs(pd.Series)
-> 1338 @frame_base.populate_defaults(pd.Series)
   1339 def append(self, to_append, ignore_index, verify_integrity, **kwargs):
   1340   """``ignore_index=True`` is not supported, because it requires generating an
   1341   order-sensitive index."""
   1342   if not isinstance(to_append, DeferredSeries):

File /usr/local/lib/python3.11/site-packages/apache_beam/dataframe/frame_base.py:600, in populate_defaults.<locals>.wrap(func)
    599 def wrap(func):
--> 600   base_argspec = getfullargspec(unwrap(getattr(base_type, func.__name__)))
    601   if not base_argspec.defaults:
    602     return func

AttributeError: type object 'Series' has no attribute 'append'

Expected Behavior

the code must be the same result that version 1.5.3 from pandas

Installed Versions

Replace this line with the output of pd.show_versions()

version that fails Pandas 2.0.2

version that working ok Pandas 1.5.3

Response when I reported the issue on the panda's project

pandas-dev/pandas#53799 (comment)

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

The text was updated successfully, but these errors were encountered:

AnandInguva · 2023-06-23T03:12:04Z

Yes, Apache Beam doesn't yet support Pandas 2.x since Pandas 2.x removed append from Series and Dataframe

AnandInguva · 2023-06-23T15:58:28Z

Moving this to P2 since its a support feature rather than a bug.

tvalentyn · 2023-07-11T00:41:03Z

Thanks for reporting! This is a known issue, but not worked on to my knowledge. Contributions to fix the issue or further investigation/breakdown on what it would take to support Pandas 2.x are welcome.

tjni · 2023-08-06T04:23:06Z

I started working on this and could use some opinions on the best way to proceed.

For methods that have been removed from Series or DataFrame in Pandas, would we like to reimplement them? I presume, but I am not sure what Beam's deprecation strategy is for these APIs.
I found at least one method (Series.mad and DataFrame.mad), for which I am not confident I can reimplement without missing something subtle. The migration guide is lean on details and does not cover nuances like what to do for various values of the method arguments. I am getting closer to figuring out how it works from reading the source code, but I am still worried that I might miss something subtle. I can think of several ways to handle this, but I'd love to get some other unbiased opinions.

apache-beam has no compatibility with pandas>=2 see apache/beam#27221

caneff · 2023-09-12T12:38:30Z

.take-issue

Cleaning this up. The goal is that the Pandas API stays consistent with Pandas. So I'm gating removals on Pandas version.

tvalentyn · 2023-09-13T19:29:17Z

Thanks a lot for stepping in to help with this effort, @caneff .

apache/beam#27221

JorgeCardona added awaiting triage bug labels Jun 22, 2023

github-actions bot added python P1 labels Jun 22, 2023

AnandInguva added P2 and removed P1 labels Jun 23, 2023

AnandInguva changed the title ~~[Bug]: Incompatible apache-beam==2.48.0 with Pandas==2.0.2~~ Incompatible apache-beam==2.48.0 with Pandas==2.0.2 Jun 23, 2023

tvalentyn changed the title ~~Incompatible apache-beam==2.48.0 with Pandas==2.0.2~~ Support Pandas==2.x in Apache Beam. Jul 11, 2023

tvalentyn added new feature and removed awaiting triage bug labels Jul 11, 2023

This was referenced Jul 11, 2023

[Bug]: Importing Python module apache_beam.dataframe.convert raises AttributeError #27410

Closed

[Feature Request]: Upgrade pandas to 2.x #26155

Closed

tjni mentioned this issue Aug 11, 2023

python310Packages.apache-beam build failure on x86_64-linux as of cf73a86c NixOS/nixpkgs#248445

Open

natsukium added a commit to natsukium/nixpkgs that referenced this issue Aug 17, 2023

python310Packages.apache-beam: mark as broken

728dff5

apache-beam has no compatibility with pandas>=2 see apache/beam#27221

natsukium mentioned this issue Aug 17, 2023

python310Packages.apache-beam: mark as broken NixOS/nixpkgs#249681

Merged

12 tasks

github-actions bot assigned caneff Sep 12, 2023

This was referenced Sep 12, 2023

Fix numeric_only logic in frames_test for Pandas 2 #28422

Merged

Make frame_base.py work with kwonly args for Pandas 2 #28417

Closed

caneff mentioned this issue Sep 14, 2023

Support kw only arguments for frame methods for Pandas 2 #28454

Merged

tvalentyn closed this as completed in #28499 Sep 18, 2023

github-actions bot added this to the 2.51.0 Release milestone Sep 18, 2023

tvalentyn reopened this Sep 18, 2023

tvalentyn closed this as completed in #28422 Sep 18, 2023

caneff mentioned this issue Sep 19, 2023

Change handling of copy=None defaults for Pandas 2 #28523

Merged

tvalentyn reopened this Sep 19, 2023

This was referenced Sep 19, 2023

Remove inplace argument from set_axis for Pandas 2 #28536

Merged

Fix remaining tests for pandas 2 compatibility #28524

Merged

damccorm removed this from the 2.51.0 Release milestone Sep 20, 2023

caneff mentioned this issue Sep 24, 2023

Add support for pandas 2.0 #28636

Merged

AnandInguva closed this as completed in #28636 Jan 10, 2024

github-actions bot added this to the 2.54.0 Release milestone Jan 10, 2024

damccorm added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Jan 23, 2024

apraga added a commit to apraga/nixpkgs that referenced this issue Jul 14, 2024

Re-enabling apache-beam as the issue seems to be closed for pandas >= 2

c183e4c

apache/beam#27221

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Pandas==2.x in Apache Beam. #27221

Support Pandas==2.x in Apache Beam. #27221

JorgeCardona commented Jun 22, 2023 •

edited by tvalentyn

Loading

Expected Behavior

Installed Versions

Response when I reported the issue on the panda's project

Issue Priority

Issue Components

AnandInguva commented Jun 23, 2023

AnandInguva commented Jun 23, 2023

tvalentyn commented Jul 11, 2023

tjni commented Aug 6, 2023

caneff commented Sep 12, 2023

tvalentyn commented Sep 13, 2023

Support Pandas==2.x in Apache Beam. #27221

Support Pandas==2.x in Apache Beam. #27221

Comments

JorgeCardona commented Jun 22, 2023 • edited by tvalentyn Loading

What happened?

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

Response when I reported the issue on the panda's project

Issue Priority

Issue Components

AnandInguva commented Jun 23, 2023

AnandInguva commented Jun 23, 2023

tvalentyn commented Jul 11, 2023

tjni commented Aug 6, 2023

caneff commented Sep 12, 2023

tvalentyn commented Sep 13, 2023

JorgeCardona commented Jun 22, 2023 •

edited by tvalentyn

Loading