Skip to content

Commit

Permalink
Pandas to vw text format (#2426)
Browse files Browse the repository at this point in the history
### 1. Overview

The goal of this PR is to fix the issue #2308. 

The PR introduces a new class `DFToVW` in `vowpalwabbit.pyvw`  that takes as input the `pandas.DataFrame` and special types (`SimpleLabel`, `Feature`, `Namespace`) that specify the desired VW conversion. 

These classes make extensive use of a class `Col` that refers to a given column in the user specified dataframe. 

A simpler interface `DFtoVW.from_colnames` also be used for the simple use-cases. The main benefit is that the user need not use the specific types.

-----

Below are some usages of this class. They all rely on the following `pandas.DataFrame` called `df` : 
```python
  house_id  need_new_roof  price  sqft   age  year_built
0      id1              0   0.23  0.25  0.05        2006
1      id2              1   0.18  0.15  0.35        1976
2      id3              0   0.53  0.32  0.87        1924
```

### 2. Simple usage using `DFtoVW.from_colnames`

Let say we want to build a VW dataset with the target `need_new_roof` and the feature `age` :
```python
from vowpalwabbit.pyvw import DFtoVW
conv = DFtoVW.from_colnames(y="need_new_roof", x=["age", "year_built"], df=df)
```
Then we can use the method `process_df`:
```python
conv.process_df()
```
that outputs the following list:
```python
['0 | 0.05 2006', '1 | 0.35 1976', '0 | 0.87 1924']
```
This list can then directly be consumed by the method `pyvw.model.learn`.

### 3. Advanced usages using default constructor
The class `DFtoVW` also allow the following patterns in its default constructor : 
- tag
- (named) namespaces, with scaling factor
- (named) features, with constant feature possible

To use these more complex patterns we need to import them using:
```python
from vowpalwabbit.pyvw import SimpleLabel, Namespace, Feature, Col
```
#### 3.1. Named namespace with scaling, and named feature
Let's create a VW dataset that include a named namespace (with scaling) and a named feature:
```python
conv = DFtoVW(
        df=df,
        label=SimpleLabel(Col("need_new_roof")),
        namespaces=Namespace(name="Imperial", value=0.092, features=Feature(value=Col("sqft"), name="sqm"))
        )
conv.process_df()
```
which yields:
```python
['0 |Imperial:0.092 sqm:0.25',
 '1 |Imperial:0.092 sqm:0.15',
 '0 |Imperial:0.092 sqm:0.32']
```

#### 3.2. Multiple namespaces, multiple features, and tag
Let's create a more complex example with a tag and multiples namespaces with multiples features.
```python
conv = DFtoVW(
        df=df, 
        label=SimpleLabel(Col("need_new_roof")),
        tag=Col("house_id"),
        namespaces=[
                Namespace(name="Imperial", value=0.092, features=Feature(value=Col("sqft"), name="sqm")),
                Namespace(name="DoubleIt", value=2, features=[Feature(value=Col("price")), Feature(Col("age"))])
                ]
        )
conv.process_df()
```
which yields: 

```python
['0 id1|Imperial:0.092 sqm:0.25 |DoubleIt:2 0.23 0.05',
 '1 id2|Imperial:0.092 sqm:0.15 |DoubleIt:2 0.18 0.35',
 '0 id3|Imperial:0.092 sqm:0.32 |DoubleIt:2 0.53 0.87']
```

### 4. Implementation details
* The class `DFtoVW` and the specific types are located in `vowpalwabbit/pyvw.py`. The class only depends on the `pandas` module. 
* the code includes docstrings 
* 8 tests are included in `tests/test_pyvw.py`

### 5. Extensions
* This PR does not yet handle multilines and more complex label types.
* To convert very large dataset that can't fit in RAM, one can make use of the pandas import option `chunksize` and process each chunk at a time.  I could also implement this functionnality directly in the class using generator. The generator would then be consumed by either a VW learning interface or could be written to external file (for conversion purpose only).
  • Loading branch information
etiennekintzler authored May 27, 2020
1 parent 388d551 commit 8077112
Show file tree
Hide file tree
Showing 2 changed files with 858 additions and 0 deletions.
94 changes: 94 additions & 0 deletions python/tests/test_pyvw.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

from vowpalwabbit import pyvw
from vowpalwabbit.pyvw import vw
from vowpalwabbit.pyvw import DFtoVW, SimpleLabel, Feature, Namespace, Col
import pytest
import pandas as pd

BIT_SIZE = 18

Expand Down Expand Up @@ -344,3 +346,95 @@ def check_error_raises(type, argument):
"""
with pytest.raises(type) as error:
argument()

def test_from_colnames_constructor():
df = pd.DataFrame({"y": [1], "x": [2]})
conv = DFtoVW.from_colnames(y="y", x=["x"], df=df)
lines_list = conv.process_df()
first_line = lines_list[0]
assert first_line == "1 | 2"


def test_feature_column_renaming_and_tag():
df = pd.DataFrame({"idx": ["id_1"], "y": [1], "x": [2]})
conv = DFtoVW(
label=SimpleLabel(Col("y")),
tag=Col("idx"),
namespaces=Namespace([Feature(name="col_x", value=Col("x"))]),
df=df,
)
first_line = conv.process_df()[0]
assert first_line == "1 id_1| col_x:2"


def test_feature_constant_column_with_empty_name():
df = pd.DataFrame({"idx": ["id_1"], "y": [1], "x": [2]})
conv = DFtoVW(
label=SimpleLabel(Col("y")),
tag=Col("idx"),
namespaces=Namespace([Feature(name="", value=2)]),
df=df,
)
first_line = conv.process_df()[0]
assert first_line == "1 id_1| :2"


def test_feature_variable_column_name():
df = pd.DataFrame({"y": [1], "x": [2], "a": ["col_x"]})
conv = DFtoVW(
label=SimpleLabel(Col("y")),
namespaces=Namespace(Feature(name=Col("a"), value=Col("x"))),
df=df,
)
first_line = conv.process_df()[0]
assert first_line == "1 | col_x:2"


def test_multiple_lines_conversion():
df = pd.DataFrame({"y": [1, -1], "x": [1, 2]})
conv = DFtoVW(
label=SimpleLabel(Col("y")),
namespaces=Namespace(Feature(value=Col("x"))),
df=df,
)
lines_list = conv.process_df()
assert lines_list == ["1 | 1", "-1 | 2"]


def test_multiple_namespaces():
df = pd.DataFrame({"y": [1], "a": [2], "b": [3]})
conv = DFtoVW(
df=df,
label=SimpleLabel(Col("y")),
namespaces=[
Namespace(name="FirstNameSpace", features=Feature(Col("a"))),
Namespace(name="DoubleIt", value=2, features=Feature(Col("b"))),
],
)
first_line = conv.process_df()[0]
assert first_line == "1 |FirstNameSpace 2 |DoubleIt:2 3"


def test_without_target():
df = pd.DataFrame({"a": [2], "b": [3]})
conv = DFtoVW(
df=df, namespaces=Namespace([Feature(Col("a")), Feature(Col("b"))])
)
first_line = conv.process_df()[0]
assert first_line == "| 2 3"


def test_absent_col_error():
with pytest.raises(ValueError) as value_error:
df = pd.DataFrame({"a": [1]})
conv = DFtoVW(
df=df,
label=SimpleLabel(Col("a")),
namespaces=Namespace(
[Feature(Col("a")), Feature(Col("c")), Feature("d")]
),
)
expected = "In argument 'features', column(s) 'c' not found in dataframe"
assert expected == str(value_error.value)


Loading

0 comments on commit 8077112

Please sign in to comment.