Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Series.flatmap, DataFrame.flatmap #8517

Closed
kay1793 opened this issue Oct 9, 2014 · 21 comments
Closed

Feature request: Series.flatmap, DataFrame.flatmap #8517

kay1793 opened this issue Oct 9, 2014 · 21 comments
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question

Comments

@kay1793
Copy link

kay1793 commented Oct 9, 2014

I'm working on some language analysis and using pandas to munge the data and grab some descriptive stats. This is just an illustrative example, I'm doing all kinds of slighty different things.

Suppose I have a series containing chunks of text, and I want to turn the line into multiple lines, preserving the index values. Here are the naive results:

In [53]: s=pd.Series(['This is text No 1.', 'and here is no. 2','and 3'],index=['Alice','Bob','Alice'])
    ...: s
Out[53]: 
Alice    This is text No 1.
Bob       and here is no. 2
Alice                 and 3
dtype: object

In [54]: s.map(lambda x: x.split(' '))
Out[54]: 
Alice    [This, is, text, No, 1.]
Bob       [and, here, is, no., 2]
Alice                    [and, 3]
dtype: object

In [55]: s.apply(lambda x: pd.Series(x.split(' ')))
Out[55]: 
          0     1     2    3    4
Alice  This    is  text   No   1.
Bob     and  here    is  no.    2
Alice   and     3   NaN  NaN  NaN

What I'd like is to be able to do is (Made up example):

In [67]: s.flatmap(lambda x: x.split(' '))
Out[67]: 
Alice    This
Alice    is
Alice    text
Alice    No
Alice    1.
Bob     and
Bob    here
Bob    is
Bob    no.
Bob    2
Alice   and
Alice  3
dtype: object

In general, I'd like to be able to explode a single row in a dataframe into multiple rows, by transforming one column value into multiple values, each becoming a new row with the value of other columns prserved, for example:

In [69]: df=pd.DataFrame([['2014-01-01','Alice',"A B"],['2014-01-02','Bob','C D']],columns=['dt','name','text'])
    ...: df
Out[69]: 
           dt   name text
0  2014-01-01  Alice  A B
1  2014-01-02    Bob  C D

In [70]: df.flatmap(lambda x: x.split(),on='text')
           dt   name text
0  2014-01-01  Alice  A
1  2014-01-01  Alice  B
2  2014-01-01    Bob  C
3  2014-01-01    Bob  D

Perhaps there's another way to do this, but that's how my natural instict suggests this should be done, flatmap is a fairly universal concept.
Groupby already does similar things based on return type, It doesn't have to be limited to groupby though.

@jreback
Copy link
Contributor

jreback commented Oct 9, 2014

This is pretty straightforward (and pretty fast), to simply do this
#8428 will make the first part slightly easier (the .split())
as it will return a frame directly

In [8]: s.str.split('\s+').apply(Series).unstack().dropna()
Out[8]: 
0  Alice    This
   Bob       and
   Alice     and
1  Alice      is
   Bob      here
   Alice       3
2  Alice    text
   Bob        is
3  Alice      No
   Bob       no.
4  Alice      1.
   Bob         2
dtype: object

@jreback jreback added Usage Question Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 9, 2014
@kay1793
Copy link
Author

kay1793 commented Oct 10, 2014

s=pd.Series(['This is text No 1.', 'and here is no. 2','and 3'])
s.str.split('\s+').apply(Series).unstack().T.dropna().reset_index(1,drop=True)
Out[36]: 
0    This
0     and
0     and
1      is
1    here
1       3
2    text
2      is
3      No
3     no.
4      1.
4       2
dtype: object

That does work with slight tweaking, thanks. However I have about 15 MB of text chunks and it
doesn't seem very fast compared with spliting() a text string of that size, it's orders slower.
Actually, it exhausts my box's memory when I try to run it for all of them, consuming several GB.
If one string has an unusually large word count, this generates tons of useless NA's for every shorter string. If it can't handle 15MB of data... Is there a less hacky way?

This seems like a natural operation to have supported. in terms of map-reduce pandas is great at processing rows but converting one row into multiple rows (the map stage) just isn't there as a stock operation.

@jreback
Copy link
Contributor

jreback commented Oct 10, 2014

not really sure what you are doing. If you don't need the indices then just iterate, split, and concatenate.

@immerrr
Copy link
Contributor

immerrr commented Oct 10, 2014

As much as I like flatMap interface, I'm not sure how would you apply it to a multi-column DataFrame, except for probably replicating whole rows (is that operation useful?).

For Series of strings, there's nothing to vectorize and unless you (ab-)use character-wise representation of strings (which you can't do with pandas) there's not much you can do to beat the performance of something along the lines of

In [38]: s = pd.Series(['This is text No 1.', 'and here is no. 2','and 3']); s
Out[38]: 
0    This is text No 1.
1     and here is no. 2
2                 and 3
dtype: object

In [39]: mapper = lambda x: pd.lib.list_to_object_array(x.split())            

In [40]: vals = [(k, mapper(v)) for k, v in s.iteritems()]                    

In [41]: pd.Series(                                                           
    np.concatenate([v[1] for v in vals]),
    np.concatenate([np.repeat(v[0], len(v[1])) for v in vals]),
    name=s.name)
Out[41]: 
0    This
0      is
0    text
0      No
0      1.
1     and
1    here
1      is
1     no.
1       2
2     and
2       3
dtype: object

@kay1793
Copy link
Author

kay1793 commented Oct 10, 2014

"is that operation useful?" it is for what I'm trying to do, I'd argure for text data it would be very useful. Perhaps with the index gaining another level to numerate the generated sub-results within the original index label.

Thanks for the example, lots of footwork, but it doesn't throw away memory like jreback's hack. You're both basically saying do it somewhere else then throw it in a DataFrame. Not awesome , but of course always an option.

groupby.apply() converts returned series into multiple rows doesn't it? why can't pandas have similar functionality for an DataFrame apply-style operation?

@jreback
Copy link
Contributor

jreback commented Oct 10, 2014

@kay1793 why don't you show a copy/pastable example end-to-end of what you are actually trying to do. Are you trying to say get work counts? or a dummies matrix? (that's what my soln does, but if you have a large work freq it WILL blow up memory, but that assumes you want to track the index). apparently you don't.

@immerrr soln is just fine for what you are showing. But then you are not harnessing the power of pandas at all. Its just list manipulation.

an example is worth 1000's of words.

@kay1793
Copy link
Author

kay1793 commented Oct 10, 2014

Fair enough.

How do I get from this:

In [8]: df=pd.DataFrame([['2014-01-01T00:01:00','Alice',"It was a dark and stormy night. Alice and bob were at their computer"],
   ...:                  ['2014-01-02T00:02:23','Bob',"Detective Morgan was 9 feet tall. Then the mirror broke. He cut his left foot on the shards"]],columns=['ts','name','text'])
   ...: df

to this:

In [10]: df=pd.DataFrame([['2014-01-01T00:01:00','Alice',"It was a dark and stormy night"],
    ...:                  ['2014-01-01T00:01:00','Alice',"Alice and bob were at their computer"],
    ...:                  ['2014-01-02T00:02:23','Bob',"Detective Morgan was 9 feet tall"],
    ...:                  ['2014-01-02T00:02:23','Bob',"Then the mirror broke"],
    ...:                  ['2014-01-02T00:02:23','Bob',"He cut his left foot on the shards"]],
    ...:                  columns=['ts','name','text'],index=pd.MultiIndex.from_arrays([[0,0,1,1,1],[0,1,0,1,2]]))
    ...: df
    ...: 
Out[10]: 
                      ts   name                                  text
0 0  2014-01-01T00:01:00  Alice        It was a dark and stormy night
  1  2014-01-01T00:01:00  Alice  Alice and bob were at their computer
1 0  2014-01-02T00:02:23    Bob     Detective Morgan was 9 feet tall 
  1  2014-01-02T00:02:23    Bob                 Then the mirror broke
  2  2014-01-02T00:02:23    Bob       He cut his left foot the shards

The index is unique, nothing else is in general. One chunk may have 50 (or 1000) paragraphs, others just one. Once I get it in this form, I can easily do summary stats and selection queries, looking for patterns. Basically, I'm using a pandas dataframe as my corpus.

@immerrr
Copy link
Contributor

immerrr commented Oct 11, 2014

But then you are not harnessing the power of pandas at all. Its just list manipulation.

Which was exactly my point: if we're doing native object manipulation, there's not much pandas can help with performance-wise.

As for the example, it's helpful indeed. So the desired output is equivalent to taking out that one column, flatmap it as a series adding an integer index level and join it back with the rest of the dataframe on the original index column.

@kay1793
Copy link
Author

kay1793 commented Oct 11, 2014

Which was exactly my point: if we're doing native object manipulation, there's not 
much pandas can help with performance-wise.

I think the performance of Sol #1 was due to bad complexity otherwise non-vectorized is good enough for my data sizes.

the desired output is equivalent to <...>

Yes, but with 2 sugars please.

I've come to think of pandas as THE way to work with data and now that I'm working with a corpus It lacks some operations that appear also generally useful and common. The groupby point I made earlier seems solid to me, I hope you'll agree.

@jreback
Copy link
Contributor

jreback commented Oct 20, 2015

closing, but pls reopen if the issue persists.

@jreback jreback closed this as completed Oct 20, 2015
@rcompton
Copy link

rcompton commented Feb 2, 2016

Just came across this and I think it's worth reopening.

Some sugar for flatmap/explode would be nice. There are several workarounds circulating and none of them are quite the same, eg.

@jreback
Copy link
Contributor

jreback commented Feb 12, 2016

We would take a cookbook of the recipes (the https://stackoverflow.com/questions/31080258/pysparks-flatmap-in-pandas) is most idiomatic.

@timotta
Copy link

timotta commented Sep 17, 2016

This how I'm doing for now. It's not the more performatic way, but works:

def flatmap(df, func):
    rows = []
    cols = df.columns
    for r in df.values:
        index = 0
        row = {}
        for col in cols:
            row[col] = r[index]
            index += 1
        multrows = func(row)
        for rr in multrows:
            rows.append(rr)
    return pd.DataFrame.from_records(rows)
flatfunc = lambda r: [{"userId": r["userId"], "rating": r["rating"], "genre": genre} for genre in r["genreList"]]
flattened = flatmap(df, flatfunc)

@wesm
Copy link
Member

wesm commented Sep 17, 2016

i'm supportive of adding a flatmap function to pandas. Even more so I would like to have a first class list/array type.

@rcalsaverini
Copy link

rcalsaverini commented Oct 11, 2016

Is this being considered? I think having a flatMap or bind operation would be very interesting and useful (besides being theoretically satisfying for people with a functional programming background, but this isn't really an argument).

I had a use-case today. I have a dataframe of events, with start time and end time:

event start_time end_time
A 2014-01-01 12:00:00 2014-01-01 13:45:00
B 2014-01-01 12:45:00 2014-01-01 12:55:00
... ... ...

And we'd like to transform this to a dataframe that has a row for each minute where an event where actually happening:

event time
A 2014-01-01 12:00:00
A 2014-01-01 12:01:00
A 2014-01-01 12:02:00
A 2014-01-01 12:03:00
... ...
A 2014-01-01 13:44:00
A 2014-01-01 13:45:00
B ...

There are lots of ways to do this, but if we had a flatMap method this would be almost trivial.

def make_time_table(row):
    return DataFrame(
        {'event': row['event'], 'time': time}
        for time in pandas.date_range(...)
    )

df.flatMap(make_time_table)

Or something like it.

@rcompton
Copy link

I've been using @timotta 's function above which has been surprisingly fast

@hkhatod
Copy link

hkhatod commented Dec 19, 2017

I have a long text in one of the dataframe columns I need to be able to split the text 2 or more columns depending on the length of that cell

@allComputableThings
Copy link

Was this closed because of opposition of storing lists in dataframes/series?

I don't think that flatmap requires this, and the lack of flatmap is a fairly big problem with non-trivial and non-performant solutions. (+scala, spark, many other functional system have it as a first class operation).

@jreback
Copy link
Contributor

jreback commented Mar 14, 2019

see #16538 which has a PR
associated with it

@tantofish
Copy link

tantofish commented May 31, 2019

This how I'm doing for now. It's not the more performatic way, but works:

def flatmap(df, func):
    rows = []
    cols = df.columns
    for r in df.values:
        index = 0
        row = {}
        for col in cols:
            row[col] = r[index]
            index += 1
        multrows = func(row)
        for rr in multrows:
            rows.append(rr)
    return pd.DataFrame.from_records(rows)
flatfunc = lambda r: [{"userId": r["userId"], "rating": r["rating"], "genre": genre} for genre in r["genreList"]]
flattened = flatmap(df, flatfunc)

Works great for me. Thanks @timotta!

Slightly modified to simplify and be more similar to the original poster's wishing syntax:

def flatmap(self, func):
    rows = []
    for idx, row in self.iterrows():
        multrows = func(row)
        rows.extend(multrows)
    return pd.DataFrame.from_records(rows)

pd.DataFrame.flatmap = flatmap

df.flatmap(lambda x: ....)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question
Projects
None yet
Development

No branches or pull requests