Pandas to vw text format (#2426)

### 1. Overview The goal of this PR is to fix the issue #2308. The PR introduces a new class `DFToVW` in `vowpalwabbit.pyvw` that takes as input the `pandas.DataFrame` and special types (`SimpleLabel`, `Feature`, `Namespace`) that specify the desired VW conversion. These classes make extensive use of a class `Col` that refers to a given column in the user specified dataframe. A simpler interface `DFtoVW.from_colnames` also be used for the simple use-cases. The main benefit is that the user need not use the specific types. ----- Below are some usages of this class. They all rely on the following `pandas.DataFrame` called `df` : ```python house_id need_new_roof price sqft age year_built 0 id1 0 0.23 0.25 0.05 2006 1 id2 1 0.18 0.15 0.35 1976 2 id3 0 0.53 0.32 0.87 1924 ``` ### 2. Simple usage using `DFtoVW.from_colnames` Let say we want to build a VW dataset with the target `need_new_roof` and the feature `age` : ```python from vowpalwabbit.pyvw import DFtoVW conv = DFtoVW.from_colnames(y="need_new_roof", x=["age", "year_built"], df=df) ``` Then we can use the method `process_df`: ```python conv.process_df() ``` that outputs the following list: ```python ['0 | 0.05 2006', '1 | 0.35 1976', '0 | 0.87 1924'] ``` This list can then directly be consumed by the method `pyvw.model.learn`. ### 3. Advanced usages using default constructor The class `DFtoVW` also allow the following patterns in its default constructor : - tag - (named) namespaces, with scaling factor - (named) features, with constant feature possible To use these more complex patterns we need to import them using: ```python from vowpalwabbit.pyvw import SimpleLabel, Namespace, Feature, Col ``` #### 3.1. Named namespace with scaling, and named feature Let's create a VW dataset that include a named namespace (with scaling) and a named feature: ```python conv = DFtoVW( df=df, label=SimpleLabel(Col("need_new_roof")), namespaces=Namespace(name="Imperial", value=0.092, features=Feature(value=Col("sqft"), name="sqm")) ) conv.process_df() ``` which yields: ```python ['0 |Imperial:0.092 sqm:0.25', '1 |Imperial:0.092 sqm:0.15', '0 |Imperial:0.092 sqm:0.32'] ``` #### 3.2. Multiple namespaces, multiple features, and tag Let's create a more complex example with a tag and multiples namespaces with multiples features. ```python conv = DFtoVW( df=df, label=SimpleLabel(Col("need_new_roof")), tag=Col("house_id"), namespaces=[ Namespace(name="Imperial", value=0.092, features=Feature(value=Col("sqft"), name="sqm")), Namespace(name="DoubleIt", value=2, features=[Feature(value=Col("price")), Feature(Col("age"))]) ] ) conv.process_df() ``` which yields: ```python ['0 id1|Imperial:0.092 sqm:0.25 |DoubleIt:2 0.23 0.05', '1 id2|Imperial:0.092 sqm:0.15 |DoubleIt:2 0.18 0.35', '0 id3|Imperial:0.092 sqm:0.32 |DoubleIt:2 0.53 0.87'] ``` ### 4. Implementation details * The class `DFtoVW` and the specific types are located in `vowpalwabbit/pyvw.py`. The class only depends on the `pandas` module. * the code includes docstrings * 8 tests are included in `tests/test_pyvw.py` ### 5. Extensions * This PR does not yet handle multilines and more complex label types. * To convert very large dataset that can't fit in RAM, one can make use of the pandas import option `chunksize` and process each chunk at a time. I could also implement this functionnality directly in the class using generator. The generator would then be consumed by either a VW learning interface or could be written to external file (for conversion purpose only).
VowpalWabbit · May 27, 2020 · 8077112 · 8077112
1 parent 388d551
commit 8077112
Show file tree

Hide file tree

Showing 2 changed files with 858 additions and 0 deletions.
diff --git a/python/tests/test_pyvw.py b/python/tests/test_pyvw.py
@@ -2,7 +2,9 @@
 
 from vowpalwabbit import pyvw
 from vowpalwabbit.pyvw import vw
+from vowpalwabbit.pyvw import DFtoVW, SimpleLabel, Feature, Namespace, Col
 import pytest
+import pandas as pd
 
 BIT_SIZE = 18
 
@@ -344,3 +346,95 @@ def check_error_raises(type, argument):
     """
     with pytest.raises(type) as error:
         argument()
+
+def test_from_colnames_constructor():
+    df = pd.DataFrame({"y": [1], "x": [2]})
+    conv = DFtoVW.from_colnames(y="y", x=["x"], df=df)
+    lines_list = conv.process_df()
+    first_line = lines_list[0]
+    assert first_line == "1 | 2"
+
+
+def test_feature_column_renaming_and_tag():
+    df = pd.DataFrame({"idx": ["id_1"], "y": [1], "x": [2]})
+    conv = DFtoVW(
+        label=SimpleLabel(Col("y")),
+        tag=Col("idx"),
+        namespaces=Namespace([Feature(name="col_x", value=Col("x"))]),
+        df=df,
+    )
+    first_line = conv.process_df()[0]
+    assert first_line == "1 id_1| col_x:2"
+
+
+def test_feature_constant_column_with_empty_name():
+    df = pd.DataFrame({"idx": ["id_1"], "y": [1], "x": [2]})
+    conv = DFtoVW(
+        label=SimpleLabel(Col("y")),
+        tag=Col("idx"),
+        namespaces=Namespace([Feature(name="", value=2)]),
+        df=df,
+    )
+    first_line = conv.process_df()[0]
+    assert first_line == "1 id_1| :2"
+
+
+def test_feature_variable_column_name():
+    df = pd.DataFrame({"y": [1], "x": [2], "a": ["col_x"]})
+    conv = DFtoVW(
+        label=SimpleLabel(Col("y")),
+        namespaces=Namespace(Feature(name=Col("a"), value=Col("x"))),
+        df=df,
+    )
+    first_line = conv.process_df()[0]
+    assert first_line == "1 | col_x:2"
+
+
+def test_multiple_lines_conversion():
+    df = pd.DataFrame({"y": [1, -1], "x": [1, 2]})
+    conv = DFtoVW(
+        label=SimpleLabel(Col("y")),
+        namespaces=Namespace(Feature(value=Col("x"))),
+        df=df,
+    )
+    lines_list = conv.process_df()
+    assert lines_list == ["1 | 1", "-1 | 2"]
+
+
+def test_multiple_namespaces():
+    df = pd.DataFrame({"y": [1], "a": [2], "b": [3]})
+    conv = DFtoVW(
+        df=df,
+        label=SimpleLabel(Col("y")),
+        namespaces=[
+            Namespace(name="FirstNameSpace", features=Feature(Col("a"))),
+            Namespace(name="DoubleIt", value=2, features=Feature(Col("b"))),
+        ],
+    )
+    first_line = conv.process_df()[0]
+    assert first_line == "1 |FirstNameSpace 2 |DoubleIt:2 3"
+
+
+def test_without_target():
+    df = pd.DataFrame({"a": [2], "b": [3]})
+    conv = DFtoVW(
+        df=df, namespaces=Namespace([Feature(Col("a")), Feature(Col("b"))])
+    )
+    first_line = conv.process_df()[0]
+    assert first_line == "| 2 3"
+
+
+def test_absent_col_error():
+    with pytest.raises(ValueError) as value_error:
+        df = pd.DataFrame({"a": [1]})
+        conv = DFtoVW(
+            df=df,
+            label=SimpleLabel(Col("a")),
+            namespaces=Namespace(
+                [Feature(Col("a")), Feature(Col("c")), Feature("d")]
+            ),
+        )
+    expected = "In argument 'features', column(s) 'c' not found in dataframe"
+    assert expected == str(value_error.value)
+
+