-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas to vw text format #2426
Pandas to vw text format #2426
Changes from 11 commits
1774056
76b62e0
566a81e
3d00f04
55043af
31aa448
ae870db
a55705e
58c6017
4d22355
66de092
13a4441
6eed274
8956151
17dbda1
f4329c3
d455cce
d280994
b469318
e1f1f56
b852caf
8fff168
ac6bd4e
736a569
883f256
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,8 @@ | |
from __future__ import division | ||
import pylibvw | ||
import warnings | ||
import pandas as pd | ||
import re | ||
|
||
class SearchTask(): | ||
"""Search task class""" | ||
|
@@ -1354,3 +1356,123 @@ def get_label(self, label_class=simple_label): | |
simple_label | ||
""" | ||
return label_class(self) | ||
|
||
|
||
class DataFrameToVW: | ||
"""DataFrameToVW class""" | ||
|
||
re_parse_col = re.compile(pattern="{([^{}]*)}") | ||
|
||
feature_name_pattern = "(?:\w+[:*])" | ||
feature_value_pattern = "{[^{}]+}" | ||
const_value_pattern = "[\w.]+" | ||
re_check_formula = re.compile( | ||
"(?:\s*\|?\s*{}?(?:{}|{})\s*)*".format( | ||
feature_name_pattern, feature_value_pattern, const_value_pattern | ||
) | ||
) | ||
|
||
def __init__(self, df, formula): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just as a thought, is there a way we can deduce or use types to drive this string formula? maybe using reflection There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It may be worth pursuing a sort of code based approach to defining the formula:
Might become: formula = Formula(label=SimpleLabel(ColumnBinding("y")), tag=ColumnBinding("idx"), namespaces=[Namespace(name="test_ns", features=[Feature(name="col_x", value=ColumnBinding("x"))])] It is significantly more text to write out, but it may be easier to construct for a newcomer that has an IDE at their disposal. Thoughts? |
||
""" | ||
Convert a pandas DataFrame to the vowpal wabbit format defined by the user in formula parameter. | ||
Formula is a string where the feature value of a given column is specified using | ||
the curly braces syntax (e.g: {name_of_the_column}). The part of the formula not specified | ||
in curly braces will be considered constant and repeated on each line. See examples | ||
for more details. | ||
|
||
The following column names cannot be used in the formula : | ||
- column names that contain the character '{' or '}' | ||
- the empty string '' | ||
|
||
Parameters | ||
---------- | ||
df : pandas.DataFrame | ||
The DataFrame to convert | ||
formula : str | ||
The formula specifying the desired vowpal wabbit input format. | ||
|
||
Examples | ||
-------- | ||
|
||
>>> from vowpalwabbit import DataFrameToVW | ||
>>> from pandas as pd | ||
>>> df = pd.DataFrame({"y": [0], "x": [1]}) | ||
>>> conv = DataFrameToVW(df, "{y} | {x}") | ||
>>> vw_lines = conv.process_df() | ||
|
||
>>> df2 = pd.DataFrame({"y": [0], "x": [1], "z": [2]}) | ||
>>> conv2 = DataFrameToVW(df2, '{y} |AllFeatures {x} {z}') | ||
>>> vw_lines2 = conv2.process_df() | ||
|
||
Returns | ||
------- | ||
self: DataFrameToVW | ||
""" | ||
self.df = df | ||
self.n_rows = df.shape[0] | ||
self.column_names = set(df.columns) | ||
self.formula = re.sub("\s+", " ", formula).strip() | ||
self.check_formula() | ||
self.check_absent_cols() | ||
|
||
def check_formula(self): | ||
""" | ||
Check if formula is of appropriate format | ||
""" | ||
match = self.re_check_formula.match(self.formula) | ||
valid_formula = match.group() == self.formula | ||
if not valid_formula: | ||
valid_part = self.formula[: match.end()] | ||
invalid_part = self.formula[match.end() :] | ||
raise ValueError( | ||
"Error parsing formula.\nValid: '{}'\nNot valid: '{}'".format( | ||
valid_part, invalid_part | ||
) | ||
) | ||
|
||
def check_absent_cols(self): | ||
""" | ||
Helper function that check if any of the column specified in the formula is missing. | ||
The function raises value error if any of the column is absent. | ||
|
||
Raises | ||
------ | ||
ValueError | ||
If the column specified in the formula does not exist in the dataframe | ||
|
||
""" | ||
|
||
all_cols = self.re_parse_col.findall(self.formula) | ||
absent_cols = [col for col in all_cols if col not in self.column_names] | ||
if any(absent_cols): | ||
absent_cols_str = str(absent_cols)[1:-1] | ||
raise ValueError( | ||
"Column(s) {} not in the DataFrame".format(absent_cols_str) | ||
) | ||
|
||
def process_df(self): | ||
""" | ||
Convert pandas.DataFrame to a suitable vowpal wabbit input format | ||
|
||
Returns | ||
------- | ||
out | ||
The list of the lines of the DataFrame in vowpal wabbit input format | ||
|
||
""" | ||
matches = list(self.re_parse_col.finditer(self.formula)) | ||
out = pd.Series([""] * self.n_rows) | ||
|
||
current_pos = 0 | ||
for match in matches: | ||
col_name = match.group()[1:-1] | ||
start_pos, end_pos = match.span() | ||
str_part = self.formula[current_pos:start_pos] | ||
value_part = self.df[col_name].apply(str) | ||
out += str_part + value_part | ||
current_pos = end_pos | ||
out += self.formula[current_pos : len(self.formula)] | ||
|
||
return out.to_list() | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also consider an api for DataFrameToVW that takes in some sort of spec of what we want to output. Just a sketch not actual code:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @lalo, I think this is good too !
To sum up, for the following formula (which is a bit more complex to take more cases)
"{y} {tag_col}|FirstNameSpace {a} {b}:2 |DoubleIt:2 ColC:{c} |Constant :1"
We could have something like :
In my opinion, the approach of using extensive args is more explicit and does not require complex regex checking. However it's more tedious to write, which can be a drawback if you have a lot of features.
Regarding the approach based on string formula, the main benefit is that is fast to write and familiar to people from statistical background that are used to R-formula like (as in R or the python package statsmodels).
What do you think @jackgerrits, @lalo ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also add convenience methods for example have some sort of default mapping that assumes no namespaces or values, anything more complicated than that someone has to define the explicit mapping. That and a filtered one where you send only the columns from that frame that convert into features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree convenience methods for common ways to map it can be useful.
I find the
Targets
value a little confusing as in the text format everything to the left of the first|
is the label (minus the tag). Therefore I think there should be a top level property which takes a Label object of a specific label type. We would need to specify all of those, but we can easily just start withSimpleLabel
type.You could also have the string formula generate this programmatic formula. Could be another nice extension convenience maybe? Maintaining the regex maybe not desirable though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again with pseudocode to sketch out the api:
we could have something like
or
Then we would have to make sure that VWMappingDefault just generates the structure of having no namespace, no values, etc just grabbing whatever value from the column into a feature.
Also a benefit on leaning on extra types is that if we change or update the input format we don't break all the users of the api (i.e. changing the '|' to '#', not that this will ever happen). We would be able to transform their types into the new input format cleanly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean no feature name ? OK for the default mapping !
In the vowpal input format wiki
[Label] [Importance] [Base] [Tag]|Namespace Features |Namespace Features ... |Namespace Features
. So at the left of the first|
there can also be Importance and Base. Also, as you mentionned earlier there is also the polylabel union (for CB).We could define, for the LHS of the first "|" the following properties/types :
Label
,Importance
,Base
,Tag
andUnionLabel/PolyLabel
(for the CB case) and the following properties for the RHS of the first "|":Namespace
,Feature
. What do you think ?Sounds good ! The by default the first column of the DataFrame would be the target ? Alternatively we could ask to supply the column name of the target and the list of names of the features as in
VWMappingDefault(y="y", x=["a", "b"])
.Yes, I totally agree !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Label] [Importance] [Base] [Tag]|
is only true forsimple_label
. Unfortunately, this is a case of the wiki not quite being up to date.LHS of the first
|
will depend on the label type, and therefore it must be specific per label type. This is why I suggest providing a rich label object and the tag. But RHS of the first|
will be the same for every label type.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another case is the pattern for CB (
action:cost:probability | features
). Is there other cases ?I have been thinking about the design and I could do the following an abstract class
FeatureHandler
with abstract methodprocess
and concrete methodget_col_or_value
. The following 2 classes will inherit of this abstract class :SimpleLabel
, which has an attribute name and implements process.Feature
, which has an attribute name and value and implements process.In both class, the attributes name (and value) can either receive an object of type col (that specified the column to extract as in col("a")) or a value that will be considered as it. The concrete method
get_col_or_value
will extract the column from the dataframe (if col) or build the column with the repeat value if a value is passed.The implementation of the
process
method will build the appropriate column of string (pandas.Series) according to the type of the object (SimpleLabel, Feature or PolyLabel)A third class
PolyLabel
with attributes action, cost, probability can easily be added.What do you think of it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of it and simplified the classes (no abstract class, and no subtypes Label/Importance/Base/Tag).
Here is the UML design of the classes I wrote :
Usages
Let's build a toy dataset
1. Using just colnames to specify target and features
Same idea as @lalo 's VWMappingDefault. We call the method
DFtoVW.from_colnames
:2. Using the interface that we talked about
Is that class design okay for you ?