Feature request: Series.flatmap, DataFrame.flatmap #8517

kay1793 · 2014-10-09T03:35:17Z

I'm working on some language analysis and using pandas to munge the data and grab some descriptive stats. This is just an illustrative example, I'm doing all kinds of slighty different things.

Suppose I have a series containing chunks of text, and I want to turn the line into multiple lines, preserving the index values. Here are the naive results:

In [53]: s=pd.Series(['This is text No 1.', 'and here is no. 2','and 3'],index=['Alice','Bob','Alice'])
    ...: s
Out[53]: 
Alice    This is text No 1.
Bob       and here is no. 2
Alice                 and 3
dtype: object

In [54]: s.map(lambda x: x.split(' '))
Out[54]: 
Alice    [This, is, text, No, 1.]
Bob       [and, here, is, no., 2]
Alice                    [and, 3]
dtype: object

In [55]: s.apply(lambda x: pd.Series(x.split(' ')))
Out[55]: 
          0     1     2    3    4
Alice  This    is  text   No   1.
Bob     and  here    is  no.    2
Alice   and     3   NaN  NaN  NaN

What I'd like is to be able to do is (Made up example):

In [67]: s.flatmap(lambda x: x.split(' '))
Out[67]: 
Alice    This
Alice    is
Alice    text
Alice    No
Alice    1.
Bob     and
Bob    here
Bob    is
Bob    no.
Bob    2
Alice   and
Alice  3
dtype: object

In general, I'd like to be able to explode a single row in a dataframe into multiple rows, by transforming one column value into multiple values, each becoming a new row with the value of other columns prserved, for example:

In [69]: df=pd.DataFrame([['2014-01-01','Alice',"A B"],['2014-01-02','Bob','C D']],columns=['dt','name','text'])
    ...: df
Out[69]: 
           dt   name text
0  2014-01-01  Alice  A B
1  2014-01-02    Bob  C D

In [70]: df.flatmap(lambda x: x.split(),on='text')
           dt   name text
0  2014-01-01  Alice  A
1  2014-01-01  Alice  B
2  2014-01-01    Bob  C
3  2014-01-01    Bob  D

Perhaps there's another way to do this, but that's how my natural instict suggests this should be done, flatmap is a fairly universal concept.
Groupby already does similar things based on return type, It doesn't have to be limited to groupby though.

jreback · 2014-10-09T12:45:38Z

This is pretty straightforward (and pretty fast), to simply do this
#8428 will make the first part slightly easier (the .split())
as it will return a frame directly

In [8]: s.str.split('\s+').apply(Series).unstack().dropna()
Out[8]: 
0  Alice    This
   Bob       and
   Alice     and
1  Alice      is
   Bob      here
   Alice       3
2  Alice    text
   Bob        is
3  Alice      No
   Bob       no.
4  Alice      1.
   Bob         2
dtype: object

kay1793 · 2014-10-10T08:33:06Z

s=pd.Series(['This is text No 1.', 'and here is no. 2','and 3'])
s.str.split('\s+').apply(Series).unstack().T.dropna().reset_index(1,drop=True)
Out[36]: 
0    This
0     and
0     and
1      is
1    here
1       3
2    text
2      is
3      No
3     no.
4      1.
4       2
dtype: object

That does work with slight tweaking, thanks. However I have about 15 MB of text chunks and it
doesn't seem very fast compared with spliting() a text string of that size, it's orders slower.
Actually, it exhausts my box's memory when I try to run it for all of them, consuming several GB.
If one string has an unusually large word count, this generates tons of useless NA's for every shorter string. If it can't handle 15MB of data... Is there a less hacky way?

This seems like a natural operation to have supported. in terms of map-reduce pandas is great at processing rows but converting one row into multiple rows (the map stage) just isn't there as a stock operation.

jreback · 2014-10-10T11:35:41Z

not really sure what you are doing. If you don't need the indices then just iterate, split, and concatenate.

immerrr · 2014-10-10T12:03:19Z

As much as I like flatMap interface, I'm not sure how would you apply it to a multi-column DataFrame, except for probably replicating whole rows (is that operation useful?).

For Series of strings, there's nothing to vectorize and unless you (ab-)use character-wise representation of strings (which you can't do with pandas) there's not much you can do to beat the performance of something along the lines of

In [38]: s = pd.Series(['This is text No 1.', 'and here is no. 2','and 3']); s
Out[38]: 
0    This is text No 1.
1     and here is no. 2
2                 and 3
dtype: object

In [39]: mapper = lambda x: pd.lib.list_to_object_array(x.split())            

In [40]: vals = [(k, mapper(v)) for k, v in s.iteritems()]                    

In [41]: pd.Series(                                                           
    np.concatenate([v[1] for v in vals]),
    np.concatenate([np.repeat(v[0], len(v[1])) for v in vals]),
    name=s.name)
Out[41]: 
0    This
0      is
0    text
0      No
0      1.
1     and
1    here
1      is
1     no.
1       2
2     and
2       3
dtype: object

kay1793 · 2014-10-10T18:46:30Z

"is that operation useful?" it is for what I'm trying to do, I'd argure for text data it would be very useful. Perhaps with the index gaining another level to numerate the generated sub-results within the original index label.

Thanks for the example, lots of footwork, but it doesn't throw away memory like jreback's hack. You're both basically saying do it somewhere else then throw it in a DataFrame. Not awesome , but of course always an option.

groupby.apply() converts returned series into multiple rows doesn't it? why can't pandas have similar functionality for an DataFrame apply-style operation?

jreback · 2014-10-10T19:01:44Z

@kay1793 why don't you show a copy/pastable example end-to-end of what you are actually trying to do. Are you trying to say get work counts? or a dummies matrix? (that's what my soln does, but if you have a large work freq it WILL blow up memory, but that assumes you want to track the index). apparently you don't.

@immerrr soln is just fine for what you are showing. But then you are not harnessing the power of pandas at all. Its just list manipulation.

an example is worth 1000's of words.

kay1793 · 2014-10-10T21:05:59Z

Fair enough.

How do I get from this:

In [8]: df=pd.DataFrame([['2014-01-01T00:01:00','Alice',"It was a dark and stormy night. Alice and bob were at their computer"],
   ...:                  ['2014-01-02T00:02:23','Bob',"Detective Morgan was 9 feet tall. Then the mirror broke. He cut his left foot on the shards"]],columns=['ts','name','text'])
   ...: df

to this:

In [10]: df=pd.DataFrame([['2014-01-01T00:01:00','Alice',"It was a dark and stormy night"],
    ...:                  ['2014-01-01T00:01:00','Alice',"Alice and bob were at their computer"],
    ...:                  ['2014-01-02T00:02:23','Bob',"Detective Morgan was 9 feet tall"],
    ...:                  ['2014-01-02T00:02:23','Bob',"Then the mirror broke"],
    ...:                  ['2014-01-02T00:02:23','Bob',"He cut his left foot on the shards"]],
    ...:                  columns=['ts','name','text'],index=pd.MultiIndex.from_arrays([[0,0,1,1,1],[0,1,0,1,2]]))
    ...: df
    ...: 
Out[10]: 
                      ts   name                                  text
0 0  2014-01-01T00:01:00  Alice        It was a dark and stormy night
  1  2014-01-01T00:01:00  Alice  Alice and bob were at their computer
1 0  2014-01-02T00:02:23    Bob     Detective Morgan was 9 feet tall 
  1  2014-01-02T00:02:23    Bob                 Then the mirror broke
  2  2014-01-02T00:02:23    Bob       He cut his left foot the shards

The index is unique, nothing else is in general. One chunk may have 50 (or 1000) paragraphs, others just one. Once I get it in this form, I can easily do summary stats and selection queries, looking for patterns. Basically, I'm using a pandas dataframe as my corpus.

immerrr · 2014-10-11T06:09:37Z

But then you are not harnessing the power of pandas at all. Its just list manipulation.

Which was exactly my point: if we're doing native object manipulation, there's not much pandas can help with performance-wise.

As for the example, it's helpful indeed. So the desired output is equivalent to taking out that one column, flatmap it as a series adding an integer index level and join it back with the rest of the dataframe on the original index column.

kay1793 · 2014-10-11T06:27:07Z

Which was exactly my point: if we're doing native object manipulation, there's not 
much pandas can help with performance-wise.

I think the performance of Sol #1 was due to bad complexity otherwise non-vectorized is good enough for my data sizes.

the desired output is equivalent to <...>

Yes, but with 2 sugars please.

I've come to think of pandas as THE way to work with data and now that I'm working with a corpus It lacks some operations that appear also generally useful and common. The groupby point I made earlier seems solid to me, I hope you'll agree.

jreback · 2015-10-20T23:14:36Z

closing, but pls reopen if the issue persists.

rcompton · 2016-02-02T21:59:15Z

Just came across this and I think it's worth reopening.

Some sugar for flatmap/explode would be nice. There are several workarounds circulating and none of them are quite the same, eg.

jreback · 2016-02-12T19:17:41Z

We would take a cookbook of the recipes (the https://stackoverflow.com/questions/31080258/pysparks-flatmap-in-pandas) is most idiomatic.

timotta · 2016-09-17T16:20:04Z

This how I'm doing for now. It's not the more performatic way, but works:

def flatmap(df, func):
    rows = []
    cols = df.columns
    for r in df.values:
        index = 0
        row = {}
        for col in cols:
            row[col] = r[index]
            index += 1
        multrows = func(row)
        for rr in multrows:
            rows.append(rr)
    return pd.DataFrame.from_records(rows)

flatfunc = lambda r: [{"userId": r["userId"], "rating": r["rating"], "genre": genre} for genre in r["genreList"]]
flattened = flatmap(df, flatfunc)

wesm · 2016-09-17T20:15:05Z

i'm supportive of adding a flatmap function to pandas. Even more so I would like to have a first class list/array type.

rcalsaverini · 2016-10-11T19:20:12Z

Is this being considered? I think having a flatMap or bind operation would be very interesting and useful (besides being theoretically satisfying for people with a functional programming background, but this isn't really an argument).

I had a use-case today. I have a dataframe of events, with start time and end time:

event	start_time	end_time
A	2014-01-01 12:00:00	2014-01-01 13:45:00
B	2014-01-01 12:45:00	2014-01-01 12:55:00
...	...	...

And we'd like to transform this to a dataframe that has a row for each minute where an event where actually happening:

event	time
A	2014-01-01 12:00:00
A	2014-01-01 12:01:00
A	2014-01-01 12:02:00
A	2014-01-01 12:03:00
...	...
A	2014-01-01 13:44:00
A	2014-01-01 13:45:00
B	...

There are lots of ways to do this, but if we had a flatMap method this would be almost trivial.

def make_time_table(row):
    return DataFrame(
        {'event': row['event'], 'time': time}
        for time in pandas.date_range(...)
    )

df.flatMap(make_time_table)

Or something like it.

rcompton · 2016-10-12T18:13:53Z

I've been using @timotta 's function above which has been surprisingly fast

BLMeltdown · 2017-05-30T17:08:41Z

The topic I reference in the above link (https://stackoverflow.com/questions/12680754/split-pandas-dataframe-string-entry-to-separate-rows) also leads to https://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-rows.

hkhatod · 2017-12-19T01:27:39Z

I have a long text in one of the dataframe columns I need to be able to split the text 2 or more columns depending on the length of that cell

allComputableThings · 2019-03-13T23:49:11Z

Was this closed because of opposition of storing lists in dataframes/series?

I don't think that flatmap requires this, and the lack of flatmap is a fairly big problem with non-trivial and non-performant solutions. (+scala, spark, many other functional system have it as a first class operation).

jreback · 2019-03-14T00:15:52Z

see #16538 which has a PR
associated with it

tantofish · 2019-05-31T23:43:32Z

This how I'm doing for now. It's not the more performatic way, but works:

def flatmap(df, func):
    rows = []
    cols = df.columns
    for r in df.values:
        index = 0
        row = {}
        for col in cols:
            row[col] = r[index]
            index += 1
        multrows = func(row)
        for rr in multrows:
            rows.append(rr)
    return pd.DataFrame.from_records(rows)

flatfunc = lambda r: [{"userId": r["userId"], "rating": r["rating"], "genre": genre} for genre in r["genreList"]]
flattened = flatmap(df, flatfunc)

Works great for me. Thanks @timotta!

Slightly modified to simplify and be more similar to the original poster's wishing syntax:

def flatmap(self, func):
    rows = []
    for idx, row in self.iterrows():
        multrows = func(row)
        rows.extend(multrows)
    return pd.DataFrame.from_records(rows)

pd.DataFrame.flatmap = flatmap

df.flatmap(lambda x: ....)

jreback added Usage Question Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 9, 2014

jreback closed this as completed Oct 20, 2015

wesm mentioned this issue Sep 17, 2016

First class array/list type wesm/pandas2#25

Open

jreback mentioned this issue May 30, 2017

ENH: (explode) Splitting a column content over multiple rows while duplicating other columns content to these rows #16538

Closed

jreback mentioned this issue Dec 30, 2017

DOC: section on caveats of storing lists inside DataFrame/Series #17027

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Series.flatmap, DataFrame.flatmap #8517

Feature request: Series.flatmap, DataFrame.flatmap #8517

kay1793 commented Oct 9, 2014

jreback commented Oct 9, 2014

kay1793 commented Oct 10, 2014

jreback commented Oct 10, 2014

immerrr commented Oct 10, 2014

kay1793 commented Oct 10, 2014

jreback commented Oct 10, 2014

kay1793 commented Oct 10, 2014

immerrr commented Oct 11, 2014

kay1793 commented Oct 11, 2014

jreback commented Oct 20, 2015

rcompton commented Feb 2, 2016

jreback commented Feb 12, 2016

timotta commented Sep 17, 2016

wesm commented Sep 17, 2016

rcalsaverini commented Oct 11, 2016 •

edited

Loading

rcompton commented Oct 12, 2016

BLMeltdown commented May 30, 2017

hkhatod commented Dec 19, 2017

allComputableThings commented Mar 13, 2019

jreback commented Mar 14, 2019

tantofish commented May 31, 2019 •

edited

Loading

Feature request: Series.flatmap, DataFrame.flatmap #8517

Feature request: Series.flatmap, DataFrame.flatmap #8517

Comments

kay1793 commented Oct 9, 2014

jreback commented Oct 9, 2014

kay1793 commented Oct 10, 2014

jreback commented Oct 10, 2014

immerrr commented Oct 10, 2014

kay1793 commented Oct 10, 2014

jreback commented Oct 10, 2014

kay1793 commented Oct 10, 2014

immerrr commented Oct 11, 2014

kay1793 commented Oct 11, 2014

jreback commented Oct 20, 2015

rcompton commented Feb 2, 2016

jreback commented Feb 12, 2016

timotta commented Sep 17, 2016

wesm commented Sep 17, 2016

rcalsaverini commented Oct 11, 2016 • edited Loading

rcompton commented Oct 12, 2016

BLMeltdown commented May 30, 2017

hkhatod commented Dec 19, 2017

allComputableThings commented Mar 13, 2019

jreback commented Mar 14, 2019

tantofish commented May 31, 2019 • edited Loading

rcalsaverini commented Oct 11, 2016 •

edited

Loading

tantofish commented May 31, 2019 •

edited

Loading