Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas to vw text format #2426

Merged
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
1774056
Add class that converts pandas.DataFrame to VW input format
etiennekintzler May 1, 2020
76b62e0
fix docstring
etiennekintzler May 1, 2020
566a81e
fix typo in test_pyvw.py
etiennekintzler May 1, 2020
3d00f04
fix docstring in pyvw.py
etiennekintzler May 1, 2020
55043af
add test to DataFrameToVW to test conversion when no target is presen…
etiennekintzler May 1, 2020
31aa448
specify col in formula using {}, enable more freedom in formatting, c…
etiennekintzler May 5, 2020
ae870db
Merge branch 'master' into pandas_to_vw_text_format
etiennekintzler May 5, 2020
a55705e
add check formula conformity + fix docstring. Add test for absent col…
etiennekintzler May 5, 2020
58c6017
Merge branch 'pandas_to_vw_text_format' of github.com:etiennekintzler…
etiennekintzler May 5, 2020
4d22355
fix pattern to allow decimal value
etiennekintzler May 5, 2020
66de092
fix typo in docstring of DataFrameToVW.__init__
etiennekintzler May 5, 2020
13a4441
create class based formula for the conversion of datafame to vw input…
etiennekintzler May 13, 2020
6eed274
Merge branch 'master' into pandas_to_vw_text_format
etiennekintzler May 13, 2020
8956151
remove abc class, did simple functions instead of inheriting from For…
etiennekintzler May 15, 2020
17dbda1
Merge branch 'master' into pandas_to_vw_text_format
etiennekintzler May 15, 2020
f4329c3
fix typo on import DFtoVW class
etiennekintzler May 15, 2020
d455cce
handle the different init for OrderedDict in python 2.7
etiennekintzler May 15, 2020
d280994
Merge branch 'master' into pandas_to_vw_text_format
etiennekintzler May 20, 2020
b469318
Merge branch 'master' into pandas_to_vw_text_format
etiennekintzler May 21, 2020
e1f1f56
clean docstring and fix typos, add undescore for internal function
etiennekintzler May 21, 2020
b852caf
Merge branch 'master' into pandas_to_vw_text_format
etiennekintzler May 22, 2020
8fff168
simplify tag parameter, add type checking for 'from_colnames' constru…
etiennekintzler May 22, 2020
ac6bd4e
fix type checking for x in 'from_colnames' constructor, remove unused…
etiennekintzler May 22, 2020
736a569
change name of function process_label_and_value to process_label_and_tag
etiennekintzler May 26, 2020
883f256
fix anomaly when calling process_df multiple times
etiennekintzler May 26, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions python/tests/test_pyvw.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

from vowpalwabbit import pyvw
from vowpalwabbit.pyvw import vw
from vowpalwabbit.pyvw import DataFrameToVW
import pytest
import pandas as pd

BIT_SIZE = 18

Expand Down Expand Up @@ -344,3 +346,49 @@ def check_error_raises(type, argument):
"""
with pytest.raises(type) as error:
argument()


def test_oneline_simple_conversion():
df = pd.DataFrame({"y": [1], "x": [2]})
conv = DataFrameToVW(df, "{y} | {x}")
lines_list = conv.process_df()
first_line = lines_list[0]
assert first_line == "1 | 2"


def test_oneline_with_column_renaming_and_tag():
df = pd.DataFrame({"idx": ["id_1"], "y": [1], "x": [2]})
conv = DataFrameToVW(df, "{y} {idx}| col_x:{x}")
lines_list = conv.process_df()
first_line = lines_list[0]
assert first_line == "1 id_1| col_x:2"


def test_multiple_lines_conversion():
df = pd.DataFrame({"y": [1, -1], "x": [1, 2]})
conv = DataFrameToVW(df, "{y} | {x}")
lines_list = conv.process_df()
assert lines_list == ["1 | 1", "-1 | 2"]


def test_oneline_with_multiple_namespaces():
df = pd.DataFrame({"y": [1], "a": [2], "b": [3]})
conv = DataFrameToVW(df, "{y} |FirstNameSpace {a} |DoubleIt:2 {b}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also consider an api for DataFrameToVW that takes in some sort of spec of what we want to output. Just a sketch not actual code:

[
 Namespace(name="FirstNamespace", features=[Feature(name=at("a"))]),
, Namespace(name="DoubleIt", value=2, features=[Feature(name=at("b"))])
]

Copy link
Contributor Author

@etiennekintzler etiennekintzler May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @lalo, I think this is good too !
To sum up, for the following formula (which is a bit more complex to take more cases)
"{y} {tag_col}|FirstNameSpace {a} {b}:2 |DoubleIt:2 ColC:{c} |Constant :1"

We could have something like :

DataFrameToVW(
    df, 
    Targets = [Label(col("y")), Tag(col("a"))], 
    Features = [
    Namespace(name="FirstNamespace", 
              features=[Feature(value=col("a")), Feature(name=col("b"), value=2)]),
    Namespace(name="DoubleIt", value=2, 
              features=[Feature(name="colC", value=col("c"))]),
    Namespace(name="Constant"), 
              features=[Feature(value=1)]
])

In my opinion, the approach of using extensive args is more explicit and does not require complex regex checking. However it's more tedious to write, which can be a drawback if you have a lot of features.

Regarding the approach based on string formula, the main benefit is that is fast to write and familiar to people from statistical background that are used to R-formula like (as in R or the python package statsmodels).

What do you think @jackgerrits, @lalo ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add convenience methods for example have some sort of default mapping that assumes no namespaces or values, anything more complicated than that someone has to define the explicit mapping. That and a filtered one where you send only the columns from that frame that convert into features.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree convenience methods for common ways to map it can be useful.

I find the Targets value a little confusing as in the text format everything to the left of the first | is the label (minus the tag). Therefore I think there should be a top level property which takes a Label object of a specific label type. We would need to specify all of those, but we can easily just start with SimpleLabel type.

You could also have the string formula generate this programmatic formula. Could be another nice extension convenience maybe? Maintaining the regex maybe not desirable though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again with pseudocode to sketch out the api:

we could have something like

DataFrameToVW(
    df,
    VWMappingDefault(df.columns()))

or

DataFrameToVW(
    df,
    VWMappingDefault(filter(df.columns,["a","b"])) # in case df has more columns we don't care about

Then we would have to make sure that VWMappingDefault just generates the structure of having no namespace, no values, etc just grabbing whatever value from the column into a feature.

Also a benefit on leaning on extra types is that if we change or update the input format we don't break all the users of the api (i.e. changing the '|' to '#', not that this will ever happen). We would be able to transform their types into the new input format cleanly.

Copy link
Contributor Author

@etiennekintzler etiennekintzler May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add convenience methods for example have some sort of default mapping that assumes no namespaces or values, anything more complicated than that someone has to define the explicit mapping.

You mean no feature name ? OK for the default mapping !

I find the Targets value a little confusing as in the text format everything to the left of the first | is the label (minus the tag). Therefore I think there should be a top level property which takes a Label object of a specific label type. We would need to specify all of those, but we can easily just start with SimpleLabel type.

In the vowpal input format wiki [Label] [Importance] [Base] [Tag]|Namespace Features |Namespace Features ... |Namespace Features. So at the left of the first | there can also be Importance and Base. Also, as you mentionned earlier there is also the polylabel union (for CB).

We could define, for the LHS of the first "|" the following properties/types : Label, Importance, Base, Tag and UnionLabel/PolyLabel (for the CB case) and the following properties for the RHS of the first "|": Namespace, Feature. What do you think ?

Again with pseudocode to sketch out the api:

we could have something like

DataFrameToVW(
df,
VWMappingDefault(df.columns()))
or

DataFrameToVW(
df,
VWMappingDefault(filter(df.columns,["a","b"])) # in case df has more columns we don't care about
Then we would have to make sure that VWMappingDefault just generates the structure of having no namespace, no values, etc just grabbing whatever value from the column into a feature.

Sounds good ! The by default the first column of the DataFrame would be the target ? Alternatively we could ask to supply the column name of the target and the list of names of the features as in VWMappingDefault(y="y", x=["a", "b"]).

Also a benefit on leaning on extra types is that if we change or update the input format we don't break all the users of the api (i.e. changing the '|' to '#', not that this will ever happen). We would be able to transform their types into the new input format cleanly.

Yes, I totally agree !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Label] [Importance] [Base] [Tag]| is only true for simple_label. Unfortunately, this is a case of the wiki not quite being up to date.

LHS of the first | will depend on the label type, and therefore it must be specific per label type. This is why I suggest providing a rich label object and the tag. But RHS of the first | will be the same for every label type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Label] [Importance] [Base] [Tag]| is only true for simple_label. Unfortunately, this is a case of the wiki not quite being up to date.

Another case is the pattern for CB (action:cost:probability | features). Is there other cases ?

LHS of the first | will depend on the label type, and therefore it must be specific per label type. This is why I suggest providing a rich label object and the tag. But RHS of the first | will be the same for every label type.

I have been thinking about the design and I could do the following an abstract class FeatureHandler with abstract method process and concrete method get_col_or_value. The following 2 classes will inherit of this abstract class :

  • SimpleLabel, which has an attribute name and implements process.
  • Feature, which has an attribute name and value and implements process.

In both class, the attributes name (and value) can either receive an object of type col (that specified the column to extract as in col("a")) or a value that will be considered as it. The concrete method get_col_or_value will extract the column from the dataframe (if col) or build the column with the repeat value if a value is passed.

The implementation of the process method will build the appropriate column of string (pandas.Series) according to the type of the object (SimpleLabel, Feature or PolyLabel)

A third class PolyLabel with attributes action, cost, probability can easily be added.

What do you think of it ?

Copy link
Contributor Author

@etiennekintzler etiennekintzler May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of it and simplified the classes (no abstract class, and no subtypes Label/Importance/Base/Tag).

Here is the UML design of the classes I wrote :

image

Usages

Let's build a toy dataset

import numpy as np
import pandas as pd

df_toy = pd.DataFrame(
    {
        "y": [-1, -1, 1],
        "p": np.random.uniform(size=3),
        "c": [4.5, 6.7, 9.6],
        "a": [1, 2, 3],
        "b": np.random.normal(size=3)
    }
)
# out
   y         p    c  a         b
0 -1  0.525440  4.5  1  0.616254
1 -1  0.262586  6.7  2 -1.133934
2  1  0.705830  9.6  3 -0.452018

1. Using just colnames to specify target and features
Same idea as @lalo 's VWMappingDefault. We call the method DFtoVW.from_colnames :

DFtoVW.from_colnames(y="y", X =["a"], df=df_toy).process_df()
# out
['-1 | 1', '-1 | 2', '1 | 3']


DFtoVW.from_colnames(y="y", X = set(df_toy) - set(["y", "p"]), df=df_toy).process_df()
# out
['-1 | 0.6162539137336357 1 4.5',
 '-1 | -1.1339344282053312 2 6.7',
 '1 | -0.45201750087182024 3 9.6']


DFtoVW.from_colnames(y=["a", 'c', "p"], X =["b"], df=df_toy, poly_label=True).process_df()
# out
['1:4.5:0.6121907426717043 | -0.43104977544310435',
 '2:6.7:0.6773696976137632 | 2.45363493233382',
 '3:9.6:0.5955350558877885 | 1.2190748658325201']

2. Using the interface that we talked about

DFtoVW(label=SimpleLabel(Col("y")), 
       namespaces=Namespace(features=[Feature(value=Col("a"))]),
       df=df_toy).process_df()
# out
['-1 | 1', '-1 | 2', '1 | 3']


complex_namespaces = [
    Namespace(name="FirstNamespace", features=[Feature(name="ColA", value=Col("a"))]),
    Namespace(name="DoubleIt", value=2, features=[Feature(value=Col("b"))])
]
DFtoVW(label=SimpleLabel(Col("y")), 
       namespaces=complex_namespaces, 
       df=df_toy).process_df()
# out
['-1 |FirstNamespace ColA:1 |DoubleIt:2 0.6162539137336357',
 '-1 |FirstNamespace ColA:2 |DoubleIt:2 -1.1339344282053312',
 '1 |FirstNamespace ColA:3 |DoubleIt:2 -0.45201750087182024']


DFtoVW(label=PolyLabel(action=Col("a"), cost=Col("c"), proba=Col("p")),
       namespaces=Namespace(features=[Feature(value=Col("b"))]),
       df=df_toy).process_df()
# out
['1:4.5:0.6121907426717043 | -0.43104977544310435',
 '2:6.7:0.6773696976137632 | 2.45363493233382',
 '3:9.6:0.5955350558877885 | 1.2190748658325201']

Is that class design okay for you ?

lines_list = conv.process_df()
first_line = lines_list[0]
assert first_line == "1 |FirstNameSpace 2 |DoubleIt:2 3"


def test_oneline_without_target():
df = pd.DataFrame({"a": [2], "b": [3]})
conv = DataFrameToVW(df, "| {a} {b}")
lines_list = conv.process_df()
first_line = lines_list[0]
assert first_line == "| 2 3"


def test_absent_col_error():
with pytest.raises(ValueError) as value_error:
df = pd.DataFrame({"a": [1]})
conv = DataFrameToVW(df, "{a} | {b} {c}")
assert "Column(s) 'b', 'c' not in the DataFrame" == str(value_error.value)
122 changes: 122 additions & 0 deletions python/vowpalwabbit/pyvw.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
from __future__ import division
import pylibvw
import warnings
import pandas as pd
import re

class SearchTask():
"""Search task class"""
Expand Down Expand Up @@ -1354,3 +1356,123 @@ def get_label(self, label_class=simple_label):
simple_label
"""
return label_class(self)


class DataFrameToVW:
"""DataFrameToVW class"""

re_parse_col = re.compile(pattern="{([^{}]*)}")

feature_name_pattern = "(?:\w+[:*])"
feature_value_pattern = "{[^{}]+}"
const_value_pattern = "[\w.]+"
re_check_formula = re.compile(
"(?:\s*\|?\s*{}?(?:{}|{})\s*)*".format(
feature_name_pattern, feature_value_pattern, const_value_pattern
)
)

def __init__(self, df, formula):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just as a thought, is there a way we can deduce or use types to drive this string formula? maybe using reflection

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth pursuing a sort of code based approach to defining the formula:

"{y} {idx}|test_ns col_x:{x}"

Might become:

formula = Formula(label=SimpleLabel(ColumnBinding("y")), tag=ColumnBinding("idx"), namespaces=[Namespace(name="test_ns", features=[Feature(name="col_x", value=ColumnBinding("x"))])]

It is significantly more text to write out, but it may be easier to construct for a newcomer that has an IDE at their disposal. Thoughts?

"""
Convert a pandas DataFrame to the vowpal wabbit format defined by the user in formula parameter.
Formula is a string where the feature value of a given column is specified using
the curly braces syntax (e.g: {name_of_the_column}). The part of the formula not specified
in curly braces will be considered constant and repeated on each line. See examples
for more details.

The following column names cannot be used in the formula :
- column names that contain the character '{' or '}'
- the empty string ''

Parameters
----------
df : pandas.DataFrame
The DataFrame to convert
formula : str
The formula specifying the desired vowpal wabbit input format.

Examples
--------

>>> from vowpalwabbit import DataFrameToVW
>>> from pandas as pd
>>> df = pd.DataFrame({"y": [0], "x": [1]})
>>> conv = DataFrameToVW(df, "{y} | {x}")
>>> vw_lines = conv.process_df()

>>> df2 = pd.DataFrame({"y": [0], "x": [1], "z": [2]})
>>> conv2 = DataFrameToVW(df2, '{y} |AllFeatures {x} {z}')
>>> vw_lines2 = conv2.process_df()

Returns
-------
self: DataFrameToVW
"""
self.df = df
self.n_rows = df.shape[0]
self.column_names = set(df.columns)
self.formula = re.sub("\s+", " ", formula).strip()
self.check_formula()
self.check_absent_cols()

def check_formula(self):
"""
Check if formula is of appropriate format
"""
match = self.re_check_formula.match(self.formula)
valid_formula = match.group() == self.formula
if not valid_formula:
valid_part = self.formula[: match.end()]
invalid_part = self.formula[match.end() :]
raise ValueError(
"Error parsing formula.\nValid: '{}'\nNot valid: '{}'".format(
valid_part, invalid_part
)
)

def check_absent_cols(self):
"""
Helper function that check if any of the column specified in the formula is missing.
The function raises value error if any of the column is absent.

Raises
------
ValueError
If the column specified in the formula does not exist in the dataframe

"""

all_cols = self.re_parse_col.findall(self.formula)
absent_cols = [col for col in all_cols if col not in self.column_names]
if any(absent_cols):
absent_cols_str = str(absent_cols)[1:-1]
raise ValueError(
"Column(s) {} not in the DataFrame".format(absent_cols_str)
)

def process_df(self):
"""
Convert pandas.DataFrame to a suitable vowpal wabbit input format

Returns
-------
out
The list of the lines of the DataFrame in vowpal wabbit input format

"""
matches = list(self.re_parse_col.finditer(self.formula))
out = pd.Series([""] * self.n_rows)

current_pos = 0
for match in matches:
col_name = match.group()[1:-1]
start_pos, end_pos = match.span()
str_part = self.formula[current_pos:start_pos]
value_part = self.df[col_name].apply(str)
out += str_part + value_part
current_pos = end_pos
out += self.formula[current_pos : len(self.formula)]

return out.to_list()