-
Notifications
You must be signed in to change notification settings - Fork 695
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #281 from maciejkula/data_builder
Data builder
- Loading branch information
Showing
13 changed files
with
1,140 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,223 @@ | ||
Building datasets | ||
================= | ||
|
||
In this example, we'll use LightFM's built-in ``Dataset`` class to build | ||
an interaction dataset from raw data. The goal is to demonstrate how to | ||
go from raw data (lists of interactions and perhaps item and user | ||
features) to ``scipy.sparse`` matrices that can be used to fit a LightFM | ||
model. | ||
|
||
Getting the data | ||
---------------- | ||
|
||
We're going to use a sample from | ||
`Goodbooks-10k <https://github.com/zygmuntz/goodbooks-10k>`__ as our | ||
example dataset. Let's download the data first. | ||
|
||
.. code:: python | ||
import os | ||
import zipfile | ||
import csv | ||
import requests | ||
def _download(url: str, dest_path: str): | ||
req = requests.get(url, stream=True) | ||
req.raise_for_status() | ||
with open(dest_path, "wb") as fd: | ||
for chunk in req.iter_content(chunk_size=2 ** 20): | ||
fd.write(chunk) | ||
def get_data(): | ||
ratings_url = ("http://www2.informatik.uni-freiburg.de/" "~cziegler/BX/BX-CSV-Dump.zip") | ||
if not os.path.exists("data"): | ||
os.makedirs("data") | ||
_download(ratings_url, "data/data.zip") | ||
with zipfile.ZipFile("data/data.zip") as archive: | ||
return ( | ||
csv.DictReader( | ||
(x.decode("utf-8", "ignore") for x in archive.open("BX-Book-Ratings.csv")), | ||
delimiter=";", | ||
), | ||
csv.DictReader( | ||
(x.decode("utf-8", "ignore") for x in archive.open("BX-Books.csv")), delimiter=";" | ||
), | ||
) | ||
def get_ratings(): | ||
return get_data()[0] | ||
def get_book_features(): | ||
return get_data()[1] | ||
The data consists of book ratings and book details: | ||
|
||
.. code:: python | ||
import json | ||
from itertools import islice | ||
ratings, book_features = get_data() | ||
Ratings look like this: | ||
|
||
.. code:: python | ||
for line in islice(ratings, 2): | ||
print(json.dumps(line, indent=4)) | ||
:: | ||
|
||
{ | ||
"User-ID": "276725", | ||
"ISBN": "034545104X", | ||
"Book-Rating": "0" | ||
} | ||
{ | ||
"User-ID": "276726", | ||
"ISBN": "0155061224", | ||
"Book-Rating": "5" | ||
} | ||
|
||
and book features look like this: | ||
|
||
.. code:: python | ||
for line in islice(book_features, 1): | ||
print(json.dumps(line, indent=4)) | ||
:: | ||
|
||
{ | ||
"ISBN": "0195153448", | ||
"Book-Title": "Classical Mythology", | ||
"Book-Author": "Mark P. O. Morford", | ||
"Year-Of-Publication": "2002", | ||
"Publisher": "Oxford University Press", | ||
"Image-URL-S": | ||
"http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg", | ||
"Image-URL-M": | ||
"http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg", | ||
"Image-URL-L": | ||
"http://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg" | ||
} | ||
|
||
Building the ID mappings | ||
------------------------ | ||
|
||
The first thing we need to do is to create a mapping between the user | ||
and item ids from our input data to indices that will be used internally | ||
by our model. | ||
|
||
We do this because LightFM works with user and item ids that are | ||
consecutive non-negative integers. The ``Dataset`` class allow us to | ||
create a mapping between the IDs we use in our systems and the | ||
consecutive indices preferred by the model. | ||
|
||
To do this, we create a dataset and call its ``fit`` method. The first | ||
argument is an iterable of all user ids in our data, and the second is | ||
an iterable of all item ids. In this case, we use generator expressions | ||
to lazily iterate over our data and yield user and item ids: | ||
|
||
.. code:: python | ||
from lightfm.data import Dataset | ||
dataset = Dataset() | ||
dataset.fit((x['User-ID'] for x in get_ratings()), | ||
(x['ISBN'] for x in get_ratings())) | ||
This call will assign an internal numerical id to every user and item id | ||
we pass in. These will be contiguous (from 0 to however many users and | ||
items we have), and will also determine the dimensions of the resulting | ||
LightFM model. | ||
|
||
We can check that the mappings have been created by querying the dataset | ||
on how many users and books it knows about: | ||
|
||
.. code:: python | ||
num_users, num_items = dataset.interactions_shape() | ||
print('Num users: {}, num_items {}.'.format(num_users, num_items)) | ||
:: | ||
|
||
Num users: 105283, num_items 340553. | ||
|
||
Note that if we don't have all user and items ids at once, we can | ||
repeatedly call ``fit_partial`` to supply additional ids. In this case, | ||
we will use this capability to add some item feature mappings: | ||
|
||
.. code:: python | ||
dataset.fit_partial(items=(x['ISBN'] for x in get_book_features()), | ||
item_features=(x['Book-Author'] for x in get_book_features())) | ||
This will create a feature for every unique author name in the dataset. | ||
|
||
(Note that we fit some more item ids: this is to make sure our mappings | ||
are complete even if there are items in the features dataset that are | ||
not in the interactions set.) | ||
|
||
Building the interactions matrix | ||
-------------------------------- | ||
|
||
Having created the mapping, we build the interaction matrix: | ||
|
||
.. code:: python | ||
(interactions, weights) = dataset.build_interactions(((x['User-ID'], x['ISBN']) | ||
for x in get_ratings())) | ||
print(repr(interactions)) | ||
:: | ||
|
||
<105283x341762 sparse matrix of type '<class 'numpy.int32'>' | ||
with 1149780 stored elements in COOrdinate format> | ||
|
||
This is main input into a LightFM model: it encodes the interactions | ||
betwee users and items. | ||
|
||
Since we have item features, we can also create the item features | ||
matrix: | ||
|
||
.. code:: python | ||
item_features = dataset.build_item_features(((x['ISBN'], [x['Book-Author']]) | ||
for x in get_book_features())) | ||
print(repr(item_features)) | ||
:: | ||
|
||
<341762x443805 sparse matrix of type '<class 'numpy.float32'>' | ||
with 613141 stored elements in Compressed Sparse Row format> | ||
|
||
Building a model | ||
---------------- | ||
|
||
This is all we need to build a LightFM model: | ||
|
||
.. code:: python | ||
from lightfm import LightFM | ||
model = LightFM(loss='bpr') | ||
model.fit(interactions, item_features=item_features) | ||
:: | ||
|
||
<lightfm.lightfm.LightFM at 0x7f5f0e8f7c88> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
Dataset construction | ||
==================== | ||
|
||
.. automodule:: lightfm.data | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
.PHONY: html | ||
html: | ||
pweave -f pandoc2html dataset.pmd | ||
|
||
.PHONY: rst | ||
rst: | ||
pweave -f markdown dataset.pmd | ||
pandoc -s -t rst dataset.md -o dataset.rst |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
# Building datasets | ||
|
||
In this example, we'll use LightFM's built-in `Dataset` class to build an interaction dataset from raw data. The goal is to demonstrate how to go from raw data (lists of interactions and perhaps item and user features) to `scipy.sparse` matrices that can be used to fit a LightFM model. | ||
|
||
## Getting the data | ||
We're going to use a sample from [Goodbooks-10k](https://github.com/zygmuntz/goodbooks-10k) as our example dataset. Let's download the data first. | ||
|
||
```{python, source="download.py", echo=True} | ||
``` | ||
|
||
The data consists of book ratings and book details: | ||
```python | ||
import json | ||
from itertools import islice | ||
|
||
ratings, book_features = get_data() | ||
``` | ||
|
||
Ratings look like this: | ||
```python | ||
for line in islice(ratings, 2): | ||
print(json.dumps(line, indent=4)) | ||
``` | ||
and book features look like this: | ||
```python | ||
for line in islice(book_features, 1): | ||
print(json.dumps(line, indent=4)) | ||
``` | ||
|
||
## Building the ID mappings | ||
The first thing we need to do is to create a mapping between the user and item ids from our input data to indices that will be used internally by our model. | ||
|
||
We do this because LightFM works with user and item ids that are consecutive non-negative integers. The `Dataset` class allow us to create a mapping between the IDs we use in our systems and the consecutive indices preferred by the model. | ||
|
||
To do this, we create a dataset and call its `fit` method. The first argument is an iterable of all user ids in our data, and the second is an iterable of all item ids. In this case, we use generator expressions to lazily iterate over our data and yield user and item ids: | ||
```python | ||
from lightfm.data import Dataset | ||
|
||
dataset = Dataset() | ||
dataset.fit((x['User-ID'] for x in get_ratings()), | ||
(x['ISBN'] for x in get_ratings())) | ||
``` | ||
|
||
This call will assign an internal numerical id to every user and item id we pass in. These will be contiguous (from 0 to however many users and items we have), and will also determine the dimensions of the resulting LightFM model. | ||
|
||
We can check that the mappings have been created by querying the dataset on how many users and books it knows about: | ||
```python | ||
num_users, num_items = dataset.interactions_shape() | ||
print('Num users: {}, num_items {}.'.format(num_users, num_items)) | ||
``` | ||
|
||
Note that if we don't have all user and items ids at once, we can repeatedly call `fit_partial` to supply additional ids. In this case, we will use this capability to add some item feature mappings: | ||
```python | ||
dataset.fit_partial(items=(x['ISBN'] for x in get_book_features()), | ||
item_features=(x['Book-Author'] for x in get_book_features())) | ||
``` | ||
This will create a feature for every unique author name in the dataset. | ||
|
||
(Note that we fit some more item ids: this is to make sure our mappings are complete even if there are items in the features dataset that are not in the interactions set.) | ||
|
||
## Building the interactions matrix | ||
Having created the mapping, we build the interaction matrix: | ||
```python | ||
(interactions, weights) = dataset.build_interactions(((x['User-ID'], x['ISBN']) | ||
for x in get_ratings())) | ||
|
||
print(repr(interactions)) | ||
``` | ||
|
||
This is main input into a LightFM model: it encodes the interactions betwee users and items. | ||
|
||
Since we have item features, we can also create the item features matrix: | ||
```python | ||
item_features = dataset.build_item_features(((x['ISBN'], [x['Book-Author']]) | ||
for x in get_book_features())) | ||
print(repr(item_features)) | ||
``` | ||
|
||
## Building a model | ||
This is all we need to build a LightFM model: | ||
```python | ||
from lightfm import LightFM | ||
|
||
model = LightFM(loss='bpr') | ||
model.fit(interactions, item_features=item_features) | ||
``` |
Oops, something went wrong.