Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

general principles of data manipulation for dicussion #2509

Closed
ppalmes opened this issue Nov 1, 2020 · 10 comments
Closed

general principles of data manipulation for dicussion #2509

ppalmes opened this issue Nov 1, 2020 · 10 comments

Comments

@ppalmes
Copy link
Contributor

ppalmes commented Nov 1, 2020

Typically, we use dataframes because we like its support to different column types. However, typical data processing operations are to filter rows and columns satisfying certain constraints and apply transformations which may not preserve column names or dataframe structure.

Typical operations involve statistical operations which require one to filter certain columns and apply stat/math operations forcing one to transform the data into matrix form which doesn't preserve column names and one needs extra steps to plug them back to dataframe. If one is not careful, the column names may not align or not in sync from matrix back to dataframe because of the slicing operations.

If we follow the unix pipe principles, input and output of any filter must be a dataframe. Unix uses grep to filter rows, cut to filter columns and tr/sed/awk to transform filtered rows/cols. In dataframe, we want the filtering and math/stat operations to be closure operations (meaning their output should be a dataframe preserving column names).

Here are some typical column oriented workflow. Assume df to have dates, numeric, categories, columns and spans many columns such that enumerating them is tedious.

df |> filter-date-cols |> extract-day/hours/dayofweek
df |> filter-categorical-cols |> hot-encode
df |> filter-numeric-cols |> log/sqrt/scale or pca/ica/svd
df |> filter-numeric-cols |> summarize(mean, median, other stats)
df |> filter-numeric-cols |> filter-cols-with-missing |> summarize(freq of missing)
df |> filter-cols-with-missing |> summarize(freq)

In a more complex workflow, we can filter-out NA rows, filter-out columns with NAs greater than 50%, impute remaining df, filter numeric cols and do transformation, filter categorical and do transformation, and filter dates and do transformation and concatenate them in one line:

df |> row-NA-rm |> col-NA-rm |>
    ( (filter-numeric |> scale) +  
       (filter-date |> extract-hour) + 
       (filter-cat |> hotbit-encode)
     ) |> CSV.write("training.csv")

Since each transformation outputs a dataframe, you can extract each, transform, and concatenate them in one line. It becomes easy also to see the operations horizontally than vertically because you can read it from left to write without the need to create temporary variables which is prone to bugs and logical errors.

@bkamins
Copy link
Member

bkamins commented Nov 1, 2020

It seems that what you propose is better suited for the extension package rather than DataFrames.jl.

For things related to your request we have now open #2508 and #2417. They are to provide primitives that the functionalities like you want can be built on top of.

If you have any low-level API that would complement these proposal then we can discuss it here. Otherwise I think we can close this issue as it is better placed in high-level API for data manipulation packages.

@ppalmes
Copy link
Contributor Author

ppalmes commented Nov 1, 2020

yeah, i think if these two issues cover this, then it's ok to close this.

@ppalmes ppalmes closed this as completed Nov 1, 2020
@ppalmes
Copy link
Contributor Author

ppalmes commented Nov 1, 2020

can we have a wrapper function such that it automatically preserves column names if we pass ordinary functions such as mean/mode/median to a dataframe? dfops(df,sum) -> df or is this already addressed by combine/select? this can be helpful in pipe operations such that dataframe structure is always the output

@bkamins
Copy link
Member

bkamins commented Nov 1, 2020

I am not sure what you mean exactly, but it seems you want mapcols(df, sum).

@ppalmes
Copy link
Contributor Author

ppalmes commented Nov 1, 2020

oh yeah, that is handy. so if i can have a nice column filter, i can use this for summary

@ppalmes
Copy link
Contributor Author

ppalmes commented Nov 1, 2020

btw, there is mapcol but no maprow. if i do map(fn, eachrow/eachcol), it doesn't return a dataframe. it will be nice if we have closure operations where you map to a dataframe and result is a dataframe.

@ppalmes
Copy link
Contributor Author

ppalmes commented Nov 1, 2020

basically, row and col filter returning dataframe as well as generic map that returns dataframe should cover a lot of cases.

you can have row-filter |> column-filter |> map-by-row-or-col |> col-filter |> summarize

we may not need macros in many cases if we have these operations.

@bkamins
Copy link
Member

bkamins commented Nov 1, 2020

btw, there is mapcol but no maprow.

it is:

select(df, All() => ByRow(your_function) => outcols)

we might have map support for AbstractDataFrame in the future though, but the design of such map is by far non-obvious so this is left for post 1.0 release (in particular it is not clear what should be the type of map return value).

@ppalmes
Copy link
Contributor Author

ppalmes commented Nov 1, 2020

ok, it's just somehow expected to think that if there is mapcol, maprow seems to be its corresponding row transform function.

@ppalmes
Copy link
Contributor Author

ppalmes commented Nov 1, 2020

i propose all operations for df should return a dataframe so that it's trivial to join or concatenate subresults to make it follow closure operations. you operate an integer, should return integer. you operate on dataframe should return a dataframe. i just realized i meant closure property instead of closed operations. it's like type stability thing. any operations on dataframe should be a dataframe so that the succeeding filters in the pipeline can expect dataframe input and dataframe output for a consistent data interchange format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants