-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas to vw text format #2426
Pandas to vw text format #2426
Conversation
This is really cool! Thanks for working on this. So it looks to me that you define how to use each column of the DataFrame and then process each row accordingly. I wonder if it makes sense to make it clearer what are column names and what are other strings in this formula? For instance it looks like namespaces and feature names aren't substituted, but feature values are. Maybe it could be like a format string where {column_name} is used in the formula in a generic way. It seems that additionally there are limitations on how the label is constructed. Some labels have the form Have you put any thought into how multi line examples may be handled in this sort of scheme? |
Yes, the general pattern is specified in the formula. Then I proceed column by column (to take advantage of vectorization) to finally get a unique column where each element is a a string that define line. The unique column is then convert to list to get the list of lines.
Yes ! I too hesitated between the current formulation and using
I did not know such case was present (I use Input format page from wiki as a reference). Could you please tell me more about this pattern (or link me to a page where I can see examples) ? More specifically, in Also, I thought of using a method that check the correctness of the formula using regex (in the same fashion as https://hunch.net/~vw/validate.html). What do you think of it ? Does this regex pattern already exist somewhere in the project (so I reuse it and slightly accomodate it for the Thanks 😃 ! |
Here is an example of the label I mentioned: https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Logged-Contextual-Bandit-Example There are several different label types available, the best list at the moment is the polylabel union. However, if you use the {} approach for substitution this will allow users to structure the labels as needed and you won't need to worry about handling all of these. That validate page just handles the one label type I believe, but it is a good start. |
…heck for absent cols at initialization, change formulas in tests
…/vowpal_wabbit into pandas_to_vw_text_format
Ok done ! I simplified the class to allow more flexible schema for the formula.
Do not hesitate to tell me if this formula is too restrictive (or too flexible 😄 ) |
Sounds good
This is actually permitted. If you supply something like
It is less restrictive than that, the only characters not permitted in a feature name are
Sounds good. |
python/vowpalwabbit/pyvw.py
Outdated
) | ||
) | ||
|
||
def __init__(self, df, formula): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just as a thought, is there a way we can deduce or use types to drive this string formula? maybe using reflection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth pursuing a sort of code based approach to defining the formula:
"{y} {idx}|test_ns col_x:{x}"
Might become:
formula = Formula(label=SimpleLabel(ColumnBinding("y")), tag=ColumnBinding("idx"), namespaces=[Namespace(name="test_ns", features=[Feature(name="col_x", value=ColumnBinding("x"))])]
It is significantly more text to write out, but it may be easier to construct for a newcomer that has an IDE at their disposal. Thoughts?
The regex does not accept:
The |
python/tests/test_pyvw.py
Outdated
|
||
def test_oneline_with_multiple_namespaces(): | ||
df = pd.DataFrame({"y": [1], "a": [2], "b": [3]}) | ||
conv = DataFrameToVW(df, "{y} |FirstNameSpace {a} |DoubleIt:2 {b}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also consider an api for DataFrameToVW that takes in some sort of spec of what we want to output. Just a sketch not actual code:
[
Namespace(name="FirstNamespace", features=[Feature(name=at("a"))]),
, Namespace(name="DoubleIt", value=2, features=[Feature(name=at("b"))])
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @lalo, I think this is good too !
To sum up, for the following formula (which is a bit more complex to take more cases)
"{y} {tag_col}|FirstNameSpace {a} {b}:2 |DoubleIt:2 ColC:{c} |Constant :1"
We could have something like :
DataFrameToVW(
df,
Targets = [Label(col("y")), Tag(col("a"))],
Features = [
Namespace(name="FirstNamespace",
features=[Feature(value=col("a")), Feature(name=col("b"), value=2)]),
Namespace(name="DoubleIt", value=2,
features=[Feature(name="colC", value=col("c"))]),
Namespace(name="Constant"),
features=[Feature(value=1)]
])
In my opinion, the approach of using extensive args is more explicit and does not require complex regex checking. However it's more tedious to write, which can be a drawback if you have a lot of features.
Regarding the approach based on string formula, the main benefit is that is fast to write and familiar to people from statistical background that are used to R-formula like (as in R or the python package statsmodels).
What do you think @jackgerrits, @lalo ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also add convenience methods for example have some sort of default mapping that assumes no namespaces or values, anything more complicated than that someone has to define the explicit mapping. That and a filtered one where you send only the columns from that frame that convert into features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree convenience methods for common ways to map it can be useful.
I find the Targets
value a little confusing as in the text format everything to the left of the first |
is the label (minus the tag). Therefore I think there should be a top level property which takes a Label object of a specific label type. We would need to specify all of those, but we can easily just start with SimpleLabel
type.
You could also have the string formula generate this programmatic formula. Could be another nice extension convenience maybe? Maintaining the regex maybe not desirable though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again with pseudocode to sketch out the api:
we could have something like
DataFrameToVW(
df,
VWMappingDefault(df.columns()))
or
DataFrameToVW(
df,
VWMappingDefault(filter(df.columns,["a","b"])) # in case df has more columns we don't care about
Then we would have to make sure that VWMappingDefault just generates the structure of having no namespace, no values, etc just grabbing whatever value from the column into a feature.
Also a benefit on leaning on extra types is that if we change or update the input format we don't break all the users of the api (i.e. changing the '|' to '#', not that this will ever happen). We would be able to transform their types into the new input format cleanly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also add convenience methods for example have some sort of default mapping that assumes no namespaces or values, anything more complicated than that someone has to define the explicit mapping.
You mean no feature name ? OK for the default mapping !
I find the Targets value a little confusing as in the text format everything to the left of the first | is the label (minus the tag). Therefore I think there should be a top level property which takes a Label object of a specific label type. We would need to specify all of those, but we can easily just start with SimpleLabel type.
In the vowpal input format wiki [Label] [Importance] [Base] [Tag]|Namespace Features |Namespace Features ... |Namespace Features
. So at the left of the first |
there can also be Importance and Base. Also, as you mentionned earlier there is also the polylabel union (for CB).
We could define, for the LHS of the first "|" the following properties/types : Label
, Importance
, Base
, Tag
and UnionLabel/PolyLabel
(for the CB case) and the following properties for the RHS of the first "|": Namespace
, Feature
. What do you think ?
Again with pseudocode to sketch out the api:
we could have something like
DataFrameToVW(
df,
VWMappingDefault(df.columns()))
orDataFrameToVW(
df,
VWMappingDefault(filter(df.columns,["a","b"])) # in case df has more columns we don't care about
Then we would have to make sure that VWMappingDefault just generates the structure of having no namespace, no values, etc just grabbing whatever value from the column into a feature.
Sounds good ! The by default the first column of the DataFrame would be the target ? Alternatively we could ask to supply the column name of the target and the list of names of the features as in VWMappingDefault(y="y", x=["a", "b"])
.
Also a benefit on leaning on extra types is that if we change or update the input format we don't break all the users of the api (i.e. changing the '|' to '#', not that this will ever happen). We would be able to transform their types into the new input format cleanly.
Yes, I totally agree !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Label] [Importance] [Base] [Tag]|
is only true for simple_label
. Unfortunately, this is a case of the wiki not quite being up to date.
LHS of the first |
will depend on the label type, and therefore it must be specific per label type. This is why I suggest providing a rich label object and the tag. But RHS of the first |
will be the same for every label type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Label] [Importance] [Base] [Tag]| is only true for simple_label. Unfortunately, this is a case of the wiki not quite being up to date.
Another case is the pattern for CB (action:cost:probability | features
). Is there other cases ?
LHS of the first | will depend on the label type, and therefore it must be specific per label type. This is why I suggest providing a rich label object and the tag. But RHS of the first | will be the same for every label type.
I have been thinking about the design and I could do the following an abstract class FeatureHandler
with abstract method process
and concrete method get_col_or_value
. The following 2 classes will inherit of this abstract class :
SimpleLabel
, which has an attribute name and implements process.Feature
, which has an attribute name and value and implements process.
In both class, the attributes name (and value) can either receive an object of type col (that specified the column to extract as in col("a")) or a value that will be considered as it. The concrete method get_col_or_value
will extract the column from the dataframe (if col) or build the column with the repeat value if a value is passed.
The implementation of the process
method will build the appropriate column of string (pandas.Series) according to the type of the object (SimpleLabel, Feature or PolyLabel)
A third class PolyLabel
with attributes action, cost, probability can easily be added.
What do you think of it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of it and simplified the classes (no abstract class, and no subtypes Label/Importance/Base/Tag).
Here is the UML design of the classes I wrote :
Usages
Let's build a toy dataset
import numpy as np
import pandas as pd
df_toy = pd.DataFrame(
{
"y": [-1, -1, 1],
"p": np.random.uniform(size=3),
"c": [4.5, 6.7, 9.6],
"a": [1, 2, 3],
"b": np.random.normal(size=3)
}
)
# out
y p c a b
0 -1 0.525440 4.5 1 0.616254
1 -1 0.262586 6.7 2 -1.133934
2 1 0.705830 9.6 3 -0.452018
1. Using just colnames to specify target and features
Same idea as @lalo 's VWMappingDefault. We call the method DFtoVW.from_colnames
:
DFtoVW.from_colnames(y="y", X =["a"], df=df_toy).process_df()
# out
['-1 | 1', '-1 | 2', '1 | 3']
DFtoVW.from_colnames(y="y", X = set(df_toy) - set(["y", "p"]), df=df_toy).process_df()
# out
['-1 | 0.6162539137336357 1 4.5',
'-1 | -1.1339344282053312 2 6.7',
'1 | -0.45201750087182024 3 9.6']
DFtoVW.from_colnames(y=["a", 'c', "p"], X =["b"], df=df_toy, poly_label=True).process_df()
# out
['1:4.5:0.6121907426717043 | -0.43104977544310435',
'2:6.7:0.6773696976137632 | 2.45363493233382',
'3:9.6:0.5955350558877885 | 1.2190748658325201']
2. Using the interface that we talked about
DFtoVW(label=SimpleLabel(Col("y")),
namespaces=Namespace(features=[Feature(value=Col("a"))]),
df=df_toy).process_df()
# out
['-1 | 1', '-1 | 2', '1 | 3']
complex_namespaces = [
Namespace(name="FirstNamespace", features=[Feature(name="ColA", value=Col("a"))]),
Namespace(name="DoubleIt", value=2, features=[Feature(value=Col("b"))])
]
DFtoVW(label=SimpleLabel(Col("y")),
namespaces=complex_namespaces,
df=df_toy).process_df()
# out
['-1 |FirstNamespace ColA:1 |DoubleIt:2 0.6162539137336357',
'-1 |FirstNamespace ColA:2 |DoubleIt:2 -1.1339344282053312',
'1 |FirstNamespace ColA:3 |DoubleIt:2 -0.45201750087182024']
DFtoVW(label=PolyLabel(action=Col("a"), cost=Col("c"), proba=Col("p")),
namespaces=Namespace(features=[Feature(value=Col("b"))]),
df=df_toy).process_df()
# out
['1:4.5:0.6121907426717043 | -0.43104977544310435',
'2:6.7:0.6773696976137632 | 2.45363493233382',
'3:9.6:0.5955350558877885 | 1.2190748658325201']
Is that class design okay for you ?
In general I really like this! Especially scheme 2.
Yes, in general. Any label can have entirely it's own form. The list of possible labels is here: vowpal_wabbit/vowpalwabbit/example.h Line 36 in 19fda26
So I think for 1, the default being One thing that we really need to consider while looking at this design is multiline examples. I am not sure if you are familiar with them or not. So for contextual bandit examples with action dependent features for example there will be one line which is called the shared example and then there will be a line for every action that can be taken. Then when you learn from this in VW you actually pass a list of examples to VW. One idea to handle this is that you define a column in the dataframe as a grouping id, and for every row that has the same grouping id they form this list of examples. Then you need to be able to have a different formula for shared vs action examples, so you could define a column which specifies which formula to use for this row. This bit of extra work on multiline examples turns this from very useful to extremely useful :) |
Unfortunately I don't know C, thus it is a bit hard for me to read the types in
Is that right ?
Nope I was not familiar with it. I look in the wiki it the format and found the following ressources :
Is there additionnal ressources I must know to treat this multilines case ? I will read it carefully to understand what it is about. Edited: Ok I found the tutorials on vowpalwabbit.org who explained it quite well
Ok ! Will try this approach when I understand more what's behind multilines.
Ok great ! 😄 |
Totally understandable if you aren't familiar with C/C++.
Yep I agree.
Not quite, it corresponds to
Features is a different part of the example, so it doesn't correspond to a label. Here's what I think might make sense. I think what you've proposed is really solid and works (right now) for the SimpleLabel, single line example case. A little bit more work is needed to support further labels and multiline examples. But I think we should merge an initial implementation that is just supporting SimpleLabel and then we can work on adding to it. What do you think? |
ok !
ok !
Yes sure we can merge the initial implementation with the SimpleLabel, single line example case. I need to rewrite the tests and verify the error handling process before we can merge :) I am currently working on the multilines examples and it appears a bit more clear to me. However I would be glad if you could provide me with some pointers about the dataframe that would generate the multilines examples. If I take the multilines examples from the wiki :
The part I am a bit confused is the following : Is the fact that the 2nd multilines example has a shared features line and labels for each action that are different than the 1st multilines a intentional choice by the user or the result of the availability of the data ? action cost proba a b c shared1 shared2
index
1 1 NaN NaN 1.0 0.5 NaN NaN NaN
1 2 0.1 0.75 0.5 1.0 2.0 NaN NaN
2 1 1.0 0.50 1.0 1.0 1.0 s_1 s_2
2 2 NaN NaN 0.5 2.0 1.0 s_1 s_2 Am I missing something here ? Thanks in advance for your explanation ! |
Intentional choice by the user. In a contextual bandit scenario the shared example describes the features common to all actions, or in other words, the worlds features. In my mind the example you gave may come from something like this:
The key thing that makes this work in my mind though is that you can select between two different formulae based on the value of |
Yes
This part is still not clear. I thought the share example line was just shown to explain what's possible, not that it help choose between two different formulas. To be more concrete, let's use the analogy with the news recommender system of the tutorial on contextual bandit. Let's define :
We would have as many different index as unique users connections. That's a lot of different index for which to define a specific formula. Hence how can the choice of using shared parameters or choosing a given features be made for each different index since they are so many index/user connections ? Thanks again for your explanation ! |
Shared and action convey different things to VW and both are required. (As labels get more complex we need to support more structures such as this) The two different formulae we're talk about here have the form:
No there's only two, shared and action.
The choice is made with whether the "shared" column is true or false in this case. Index is merely used to know when to move onto the next block of examples (multiex). That part is a bit cumbersome, I'm open to ideas here. |
Thanks !
Ok, I think there was misunderstanding from my part. From this exchange:
I understood that, for each index, the features choices (which I incorrectly called labels) were also an intentional choice by the user. From what I just understood, it is only the choice of including shared parameters that is up to the user. So in this example table (in one of your previous message), the choice to retains features a and b for index 1 only is due to the availability of the data, right ?
By the way, I think the data that the users might have is more likely to be formatted as the table below (as in log files). If ever the shared features are on a different table, it's easier to left join the shared features table to the original table (with action labels/features) than it is to insert new a line below each index (that have shared parameters).
What do you think ?
Ok ! So the formula could be: DFtoVW(namespaces=Namespace([Feature(Col("a")), Feature(Col("b")), Feature(Col("c"))]),
label=CBLabel(cost=Col("cost"), proba=Col("proba")),
multilines=MultiLines(
id=Col("index"),
shared=Share(id=Col("shared"), features=[Feature(Col("shared_1")), Feature(Col("shared_2"))]
)
) (The What do you think ? |
Yes, it is sparse so it doesn't matter if features are or aren't supplied for each individual example. It looks like in the second table you provided the id's don't match to what . You wouldn't have two shared examples with the same ID. I would expect this table to read instead as:
Okay this is really interesting. So, if you think about multiline example formula definitions then I see three things as required:
So if I convert the example you gave to follow this in a generic manner then I came up with: DFtoVW(
multilines=True,
formulae={
"action":
Formula(namespaces=Namespace(
[Feature(Col("a")),
Feature(Col("b")),
Feature(Col("c"))]),
label=CBLabel(cost=Col("cost"), proba=Col("proba"))),
"shared":
Formula(features=[Feature(Col("shared_1")),
Feature(Col("shared_2"))],
label=CBLabel(shared=True))
},
id=Col("index"),
typeMap=Col("type")) Do note though that now you need a column called "type" that has either "action" or "shared" in it. Also, wow we're getting pretty complex here, but I think it may be necessary to allow full expressiveness? |
Ok, perfect :)
What I called
Hm, I get that you'd like a clear separation between 'action' and 'shared' but I see the following drawbacks :
More importantly it seems quite different from the previous interface. I think the progression toward greater complexity should appear as seamless as possible for the user. For instance, from really simple to hell :
DFtoVW(label=SimpleLabel(Col("y")),
namespaces=Namespace(Feature(Col("x")))
DFtoVW(label=CBLabel(action=Col("a"), prob=Col("p"), cost=Col("c")),
namespaces=Namespace([Feature(Col("feat1"), Feature(Col("feat2"))]
DFtoVW(label=CBLabel(action=Col("a"), prob=Col("p"), cost=Col("c")),
namespaces=Namespace([Feature(Col("a")), Feature(Col("b")), Feature(Col("c"))],
multilines=Multilines(id=Col("index")))
DFtoVW(label=CBLabel(cost=Col("cost"), proba=Col("proba")),
namespaces=Namespace([Feature(Col("a")), Feature(Col("b")), Feature(Col("c"))]),
multilines=MultiLines(
id=Col("index"),
shared=Share(id=Col("shared"), features=[Feature(Col("shared_1")), Feature(Col("shared_2"))]
)
) What do you think ?
I do not get why it is needed if shared has its dedicated columns (
haha yes 😄 but since the expected format is complex too, better this than frustrate the user (for the time spent to understand the format and do formatting + the risk of ill-formatted file) ! |
Let's forget about multiline for now, it seems natural for users to already have a representation that is compatible with the simple label but seems like a stretch that someone would have an already compatible structure with multiline on their dataframes. Don't get me wrong, the work you've done is great but we can't seem to justify on the need to convert multiline for now. Let's keep the scope of this PR limited and iterate before growing out the design. Does that sound good @etiennekintzler? |
I've included the modifications asked ! In my opinion the only dark spots that remain are the use of SimpleLabel and the type checking. I will try to clarify these two points below.
SimpleLabel has an unique attribute What could be done instead is to just remove the
Also the type checking done in |
I think that's okay for now. We can add more strict type checking for this at a later point. |
Ok ! So I will create a Tag class to differentiate from the SimpleLabel. Is that ok for you ?
Yes I think too. Thinking of the overall design (so maybe not for this PR) maybe the reference to a column of the dataframe should be the default behavior, and constant values would be supplied using |
Since tag is just a string I think either a Col("") or a string can be provided directly? Yeah potentially about Const vs Col. Making the binding behavior explicit makes sense to me but I can also see your point |
If the tag is just a string, it would the same tag for all examples, which is wrong no ? Can't it just be a Col ?
ok |
As a user I kind of expect that each of the places I can use |
…ctor, make not found columns method more explicit
This pull request introduces 1 alert when merging 8fff168 into 388d551 - view on LGTM.com new alerts:
|
Thanks so much for all of the work you've done here! Just need to fix CI and the warning and then I think we are good to get this in!
|
… collections import
Glad to contribute ! I fixed the problem mentioned in the previous CI job and remove the unused import. However the current Linux CI job has started 18h ago but is still in progress (while usually it's finished after 10-20 minutes). The tests seem to have been passed though : https://dev.azure.com/vowpalwabbit/Vowpal%20Wabbit/_build/results?buildId=9662&view=results |
Looks like the status check never called back to say it was done. Can you please push an empty commit to retrigger CI? |
Done ! |
conv = DFtoVW( | ||
label=SimpleLabel(Col("y")), | ||
tag=Col("idx"), | ||
namespaces=Namespace([Feature(name="", value=2)]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are empty feature names allowed on VW's input at all? cc @jackgerrits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood that it is according @jackgerrits 's answer
no space allow at left/right of ":" (or "*" as I saw this character in previous version). For example : "a :b" will raise error while "a:b" is of course ok
This is actually permitted. If you supply something like
| :1
then it means it is a single feature with a value of 1.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
@etiennekintzler would you be able to update the PR description to match where we landed with the discussion? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution, discussion and awesome work you've done here @etiennekintzler
Yep I am on it ! Just fixing one last anomaly when the method process_df is called multiple times and then push again. |
Done ! Let me know if you want more detailed explanation !
Thanks I really appreciate it ! |
This is awesome @etiennekintzler! Thanks for all of your hard work here and congrats on your first PR into VW 😄 |
|
Thanks a lot, I do believe that easier integration with python and its ecosystem (pandas, sklearn) will widen the user base ! NB: The class name I used Also the class Col and functions _get_col_or_value _get_all_cols and _check_type are duplicated in the source file. |
1. Overview
The goal of this PR is to fix the issue #2308.
The PR introduces a new class
DFToVW
invowpalwabbit.pyvw
that takes as input thepandas.DataFrame
and special types (SimpleLabel
,Feature
,Namespace
) that specify the desired VW conversion.These classes make extensive use of a class
Col
that refers to a given column in the user specified dataframe.A simpler interface
DFtoVW.from_colnames
also be used for the simple use-cases. The main benefit is that the user need not use the specific types.Below are some usages of this class. They all rely on the following
pandas.DataFrame
calleddf
:2. Simple usage using
DFtoVW.from_colnames
Let say we want to build a VW dataset with the target
need_new_roof
and the featureage
:Then we can use the method
process_df
:that outputs the following list:
This list can then directly be consumed by the method
pyvw.model.learn
.3. Advanced usages using default constructor
The class
DFtoVW
also allow the following patterns in its default constructor :To use these more complex patterns we need to import them using:
3.1. Named namespace with scaling, and named feature
Let's create a VW dataset that include a named namespace (with scaling) and a named feature:
which yields:
3.2. Multiple namespaces, multiple features, and tag
Let's create a more complex example with a tag and multiples namespaces with multiples features.
which yields:
4. Implementation details
DFtoVW
and the specific types are located invowpalwabbit/pyvw.py
. The class only depends on thepandas
module.tests/test_pyvw.py
5. Extensions
chunksize
and process each chunk at a time. I could also implement this functionnality directly in the class using generator. The generator would then be consumed by either a VW learning interface or could be written to external file (for conversion purpose only).