Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help with Dataframe as function argument or mask assignement #28591

Closed
RomainCendre opened this issue Sep 24, 2019 · 5 comments
Closed

Help with Dataframe as function argument or mask assignement #28591

RomainCendre opened this issue Sep 24, 2019 · 5 comments

Comments

@RomainCendre
Copy link

RomainCendre commented Sep 24, 2019

Problem description

Hi everyone,
I'm here because I want to achieve some stuff, but I don't know how to achive it in the best way. I browse a lot of topics on stackoverflow without a clue. Here is the thing, I want to achieve some utilities function for machine learning stuff.

I want to make several function that add new data to an existing dataframe, and manage some filtering of it, without making any data assignement.

Idea #1
# Method call
transform(inputs[mask], {'datum': 'Data'}, 'PCA', PCA())
# Method definition
def transform(dataframe, tags, out, model):
# Check mandatory fields
mandatory = ['datum']
if not isinstance(tags, dict) or not all(elem in mandatory for elem in tags.keys()):
raise Exception(f'Not a dict or missing tag: {mandatory}.')

    features = model.transform(dataframe.loc[tags['datum']].to_numpy())
    dataframe.loc[out] = features.tolist()

Idea #2
# Method call
transform(inputs, {'datum': 'Data'}, 'PCA', PCA(), mask)
# Method definition
def transform(dataframe, tags, out, model, mask=None):
# Check mandatory fields
mandatory = ['datum']
if not isinstance(tags, dict) or not all(elem in mandatory for elem in tags.keys()):
raise Exception(f'Not a dict or missing tag: {mandatory}.')

    # Mask creation (see pandas view / copy mechanism)
    if mask is None:
        mask = [True] * len(dataframe.index)

    features = model.transform(dataframe.loc[mask, tags['datum']].to_numpy())
    dataframe.loc[mask, out] = features.tolist()

Input

Row ; Data ; Label
0 ; [61953.017837947686, 9.505037089204054, 74.585... ] ;0
1 ; [80832.69302693632, 9.524642547991316, 83.9228... ] ;1

Expected Output

Row ; Data ; Label ; PCA
0 ; [61953.017837947686, 9.505037089204054, 74.585... ] ;0 ; [74.585... ]
1 ; [80832.69302693632, 9.524642547991316, 83.9228... ] ;1 ; [92.578... ]

I'm doing 'features = model.transform(dataframe.loc[mask, tags['datum']].to_numpy())' to manage my data as a matrix and not a line by line operation by use of apply method().

  • First idea doesn't seems to work, as it's no possible to pass as an argument a view and change data in it from a function, as the datframe would be convert in a get_item when I pass it to the function, it will became a copy of the dataframe.

  • Second idea doesn't seems to work, as 'dataframe.loc[mask, out] = features.tolist()' is returning ValueError: 'Must have equal len keys and value when setting with an ndarray', as it seem to deal with the mask element by element...

EDIT:
I'm doing it in two steps, not the most intuitive way for now...
https://stackoverflow.com/questions/58064179/pandas-masked-dataframe-assign-2d-array

I don't have any ideas at this point...
If you have any advices, I would be thanful,
Best regards

@RomainCendre RomainCendre changed the title Mask and Dataframe Help with Dataframe as function argument or mask assignement Sep 24, 2019
@RomainCendre
Copy link
Author

Ok I found a way to go through this by exploring the first idea.

Switching :
dataframe.loc[mask, out] = features.tolist()
To:
dataframe.loc[mask, out] = pd.Series(features.tolist())
Solved it, but I don't understand why both are not valid?

@TomAugspurger
Copy link
Contributor

Is this a bug report? We recommend stackoverflow for usage questions.

@RomainCendre
Copy link
Author

RomainCendre commented Sep 24, 2019 via email

@RomainCendre
Copy link
Author

I've been further in given solution, and dataframe.loc[mask, out] = pd.Series(features.tolist()) seems to assign only first row and doesn't care about the mask...

@mroeschke
Copy link
Member

If reporting a bug, we would need a minimal, reproducible example of the buggy behavior. Feel free to reopen when you can post an example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants