Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
### 1. Overview The goal of this PR is to fix the issue #2308. The PR introduces a new class `DFToVW` in `vowpalwabbit.pyvw` that takes as input the `pandas.DataFrame` and special types (`SimpleLabel`, `Feature`, `Namespace`) that specify the desired VW conversion. These classes make extensive use of a class `Col` that refers to a given column in the user specified dataframe. A simpler interface `DFtoVW.from_colnames` also be used for the simple use-cases. The main benefit is that the user need not use the specific types. ----- Below are some usages of this class. They all rely on the following `pandas.DataFrame` called `df` : ```python house_id need_new_roof price sqft age year_built 0 id1 0 0.23 0.25 0.05 2006 1 id2 1 0.18 0.15 0.35 1976 2 id3 0 0.53 0.32 0.87 1924 ``` ### 2. Simple usage using `DFtoVW.from_colnames` Let say we want to build a VW dataset with the target `need_new_roof` and the feature `age` : ```python from vowpalwabbit.pyvw import DFtoVW conv = DFtoVW.from_colnames(y="need_new_roof", x=["age", "year_built"], df=df) ``` Then we can use the method `process_df`: ```python conv.process_df() ``` that outputs the following list: ```python ['0 | 0.05 2006', '1 | 0.35 1976', '0 | 0.87 1924'] ``` This list can then directly be consumed by the method `pyvw.model.learn`. ### 3. Advanced usages using default constructor The class `DFtoVW` also allow the following patterns in its default constructor : - tag - (named) namespaces, with scaling factor - (named) features, with constant feature possible To use these more complex patterns we need to import them using: ```python from vowpalwabbit.pyvw import SimpleLabel, Namespace, Feature, Col ``` #### 3.1. Named namespace with scaling, and named feature Let's create a VW dataset that include a named namespace (with scaling) and a named feature: ```python conv = DFtoVW( df=df, label=SimpleLabel(Col("need_new_roof")), namespaces=Namespace(name="Imperial", value=0.092, features=Feature(value=Col("sqft"), name="sqm")) ) conv.process_df() ``` which yields: ```python ['0 |Imperial:0.092 sqm:0.25', '1 |Imperial:0.092 sqm:0.15', '0 |Imperial:0.092 sqm:0.32'] ``` #### 3.2. Multiple namespaces, multiple features, and tag Let's create a more complex example with a tag and multiples namespaces with multiples features. ```python conv = DFtoVW( df=df, label=SimpleLabel(Col("need_new_roof")), tag=Col("house_id"), namespaces=[ Namespace(name="Imperial", value=0.092, features=Feature(value=Col("sqft"), name="sqm")), Namespace(name="DoubleIt", value=2, features=[Feature(value=Col("price")), Feature(Col("age"))]) ] ) conv.process_df() ``` which yields: ```python ['0 id1|Imperial:0.092 sqm:0.25 |DoubleIt:2 0.23 0.05', '1 id2|Imperial:0.092 sqm:0.15 |DoubleIt:2 0.18 0.35', '0 id3|Imperial:0.092 sqm:0.32 |DoubleIt:2 0.53 0.87'] ``` ### 4. Implementation details * The class `DFtoVW` and the specific types are located in `vowpalwabbit/pyvw.py`. The class only depends on the `pandas` module. * the code includes docstrings * 8 tests are included in `tests/test_pyvw.py` ### 5. Extensions * This PR does not yet handle multilines and more complex label types. * To convert very large dataset that can't fit in RAM, one can make use of the pandas import option `chunksize` and process each chunk at a time. I could also implement this functionnality directly in the class using generator. The generator would then be consumed by either a VW learning interface or could be written to external file (for conversion purpose only).
- Loading branch information