Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas to vw text format #2426

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
1774056
Add class that converts pandas.DataFrame to VW input format
etiennekintzler May 1, 2020
76b62e0
fix docstring
etiennekintzler May 1, 2020
566a81e
fix typo in test_pyvw.py
etiennekintzler May 1, 2020
3d00f04
fix docstring in pyvw.py
etiennekintzler May 1, 2020
55043af
add test to DataFrameToVW to test conversion when no target is presen…
etiennekintzler May 1, 2020
31aa448
specify col in formula using {}, enable more freedom in formatting, c…
etiennekintzler May 5, 2020
ae870db
Merge branch 'master' into pandas_to_vw_text_format
etiennekintzler May 5, 2020
a55705e
add check formula conformity + fix docstring. Add test for absent col…
etiennekintzler May 5, 2020
58c6017
Merge branch 'pandas_to_vw_text_format' of github.com:etiennekintzler…
etiennekintzler May 5, 2020
4d22355
fix pattern to allow decimal value
etiennekintzler May 5, 2020
66de092
fix typo in docstring of DataFrameToVW.__init__
etiennekintzler May 5, 2020
13a4441
create class based formula for the conversion of datafame to vw input…
etiennekintzler May 13, 2020
6eed274
Merge branch 'master' into pandas_to_vw_text_format
etiennekintzler May 13, 2020
8956151
remove abc class, did simple functions instead of inheriting from For…
etiennekintzler May 15, 2020
17dbda1
Merge branch 'master' into pandas_to_vw_text_format
etiennekintzler May 15, 2020
f4329c3
fix typo on import DFtoVW class
etiennekintzler May 15, 2020
d455cce
handle the different init for OrderedDict in python 2.7
etiennekintzler May 15, 2020
d280994
Merge branch 'master' into pandas_to_vw_text_format
etiennekintzler May 20, 2020
b469318
Merge branch 'master' into pandas_to_vw_text_format
etiennekintzler May 21, 2020
e1f1f56
clean docstring and fix typos, add undescore for internal function
etiennekintzler May 21, 2020
b852caf
Merge branch 'master' into pandas_to_vw_text_format
etiennekintzler May 22, 2020
8fff168
simplify tag parameter, add type checking for 'from_colnames' constru…
etiennekintzler May 22, 2020
ac6bd4e
fix type checking for x in 'from_colnames' constructor, remove unused…
etiennekintzler May 22, 2020
736a569
change name of function process_label_and_value to process_label_and_tag
etiennekintzler May 26, 2020
883f256
fix anomaly when calling process_df multiple times
etiennekintzler May 26, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions python/tests/test_pyvw.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

from vowpalwabbit import pyvw
from vowpalwabbit.pyvw import vw
from vowpalwabbit.pyvw import DFtoVW, SimpleLabel, Feature, Namespace, Col
import pytest
import pandas as pd

BIT_SIZE = 18

Expand Down Expand Up @@ -344,3 +346,95 @@ def check_error_raises(type, argument):
"""
with pytest.raises(type) as error:
argument()

def test_from_colnames_constructor():
df = pd.DataFrame({"y": [1], "x": [2]})
conv = DFtoVW.from_colnames(y="y", x=["x"], df=df)
lines_list = conv.process_df()
first_line = lines_list[0]
assert first_line == "1 | 2"


def test_feature_column_renaming_and_tag():
df = pd.DataFrame({"idx": ["id_1"], "y": [1], "x": [2]})
conv = DFtoVW(
label=SimpleLabel(Col("y")),
tag=Col("idx"),
namespaces=Namespace([Feature(name="col_x", value=Col("x"))]),
df=df,
)
first_line = conv.process_df()[0]
assert first_line == "1 id_1| col_x:2"


def test_feature_constant_column_with_empty_name():
df = pd.DataFrame({"idx": ["id_1"], "y": [1], "x": [2]})
conv = DFtoVW(
label=SimpleLabel(Col("y")),
tag=Col("idx"),
namespaces=Namespace([Feature(name="", value=2)]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are empty feature names allowed on VW's input at all? cc @jackgerrits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood that it is according @jackgerrits 's answer

no space allow at left/right of ":" (or "*" as I saw this character in previous version). For example : "a :b" will raise error while "a:b" is of course ok

This is actually permitted. If you supply something like | :1 then it means it is a single feature with a value of 1.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

df=df,
)
first_line = conv.process_df()[0]
assert first_line == "1 id_1| :2"


def test_feature_variable_column_name():
df = pd.DataFrame({"y": [1], "x": [2], "a": ["col_x"]})
conv = DFtoVW(
label=SimpleLabel(Col("y")),
namespaces=Namespace(Feature(name=Col("a"), value=Col("x"))),
df=df,
)
first_line = conv.process_df()[0]
assert first_line == "1 | col_x:2"


def test_multiple_lines_conversion():
df = pd.DataFrame({"y": [1, -1], "x": [1, 2]})
conv = DFtoVW(
label=SimpleLabel(Col("y")),
namespaces=Namespace(Feature(value=Col("x"))),
df=df,
)
lines_list = conv.process_df()
assert lines_list == ["1 | 1", "-1 | 2"]


def test_multiple_namespaces():
df = pd.DataFrame({"y": [1], "a": [2], "b": [3]})
conv = DFtoVW(
df=df,
label=SimpleLabel(Col("y")),
namespaces=[
Namespace(name="FirstNameSpace", features=Feature(Col("a"))),
Namespace(name="DoubleIt", value=2, features=Feature(Col("b"))),
],
)
first_line = conv.process_df()[0]
assert first_line == "1 |FirstNameSpace 2 |DoubleIt:2 3"


def test_without_target():
df = pd.DataFrame({"a": [2], "b": [3]})
conv = DFtoVW(
df=df, namespaces=Namespace([Feature(Col("a")), Feature(Col("b"))])
)
first_line = conv.process_df()[0]
assert first_line == "| 2 3"


def test_absent_col_error():
with pytest.raises(ValueError) as value_error:
df = pd.DataFrame({"a": [1]})
conv = DFtoVW(
df=df,
label=SimpleLabel(Col("a")),
namespaces=Namespace(
[Feature(Col("a")), Feature(Col("c")), Feature("d")]
),
)
expected = "In argument 'features', column(s) 'c' not found in dataframe"
assert expected == str(value_error.value)


Loading