Slick is a webscraping framework based on scrapy.
It adds an opinionaded structure on top of scrapy
and sqlalchemy
, to allow you to quickly
define and create models and corresponding items that are then easily queriable, joinable,
and uploadable to backends such as Google Sheets.
This project also contains two example uses of Slick:
steam
scrapes steam and steamcharts for information about games.whisky
crawls whisky sites for pricing information.
See the makefile for common commands, view help by typing make
.
To get started, install: make install
, then run make up
. This
will start the necessary background processes.
Then, type make crawl_forum
or make crawl_tags
to start one of
the crawl spiders.
When you're done, make down
will stop the system. It will persist the data to pick back up later.
If you want to remove the data, you will have to clear the persisted volume with docker volume rm
.
You must have a docker engine running, which you can download here.
make install
will build images and create tables
For ease and concistency, slick
is written with Docker in mind. It ships with a defined
docker environment where scrapers are run, and uses docker-compose
to run a local
mysql instance where data is kept persistent between runs.
This project assumes you are already familiar with docker and docker-compose, there's no documentation for it outside the makefile.
In scrapy, a crawl is defined as a project, which can include one or more spiders. Spiders
are subclasses of scrapy.Spider
, and define urls and patterns for crawling sites.
To create a new crawl project, use the scrapy command line tool:
scrapy startproject [mycrawl]
At the core of scrapy
is Items, a glorified dictionary composed of named
fields.
They are intended as a place to register how to find pieces of data on a page and how
to clean them, using xpaths or css selectors (and a host of other getters/processors).
For more information on items and item loaders, check the scrapy docs for items.
Scrapy also has the concept of middleware. Slick uses middleware such things as tracking processed urls and SQL transaction recovery.
Scrapy has a host of configuration options, configured in settings.py
files.
Note that this is where scrapy
middlewares and pipelines are configured, so if you add new classes, be
sure to register them there.
sqlalchemy
is a python-based SQL ORM from the flask
project, and is a handy way to express
database tables using python classes. In the code, we refer to these as Models to distinguish
them from scrapy
Items.
Slick is an opinionated framework that glues scrapy
and sqlalchemy
together. It uses
sqlalchemy
Models to define both DB schema for storage and functions for automatically building
scrapy
Items from them and then querying and joining them without the need for redundant code.
Many spiders use a SQL url cache. Urls are inserted in a pending state, and marked as crawled when successfully parsed. This intended to help us restart crawls, but might be better done with scrapy jobs.
In the meantime, you might want to clear the url cache from time to time by running
truncate url_cache;
inside mysql.
Many crawls start with a list of urls. We tend to start off by shuffling these, to make sure we don't get stuck on the same place. By default, crawls are depth first, which isn't always ideal. You can change it using the instructions here.
We use two concepts to handle data transitions throughout the systems. Items
is for data scraped from a page,
Models
is for data stored in SQL.
Models are typed mysql models that uses sqlalchemy
as an ORM. Each field on the DB model should correspond
to a piece of data that can be scraped from a page, as well as related pieces of data using foreign keys
and relationships.
Fields in models are typed, which affects what kind of processing scrapy will do to them.
In slick
, you create items by using the realize_item_class
function from the model module:
MyItemClass = model.realize_item_class("MyItemClass", MyModelClass)
This will create an item class that automatically wires all the model attributes as fields on the item, and allows you to look up instances in the db based on a (partially) filled item.
To create item, you use the slick BaseLoader
, located in item module.
def load_something(response):
loader = item.BaseLoader(MyItemClass(), response=response)
loader.add_css('property1', '.some-class-name')
loader.add_xpath('property2', '//*[@id="some-xpath"]')
return loader.load_item()
This will ensure your items is wired correctly.
Like in scrapy
a scraper is a collection of spiders nested under a module directory.
In slick
, you will find utility modules for the various scrapy concepts, named in the singular:
item
- defines base Item and ItemLoader classes you should inherit from.model
- defines logic for mapping models to items, and varioussqlalchemy
utilities.parser
- defines utility functions for parsing and casting data from crawls to python types.pipeline
- defines DBPipeline, which you should use as the pipeline in your scrapysettings.py
which will handle translating items into their corresponding models and upserting them into the DB.spider
- provides mixins for use in thescrapy.Spider
you define in your crawls.util
- provides various utility functions, such as listing all registered models, items and spiders.
The suggested convention is that you lay out your project with items/models/etc separated by file and module, and name them in the plural.
├── myscrape/
│ ├── __init__.py
│ ├── items.py
│ ├── models.py
│ ├── middlewares.py
│ ├── parsers.py
│ ├── settings.py
│ ├── spiders/
│ │ ├── __init__.py
│ │ ├── spiderone.py
│ │ ├── spidertwo.py
│ myscrape.py
To make item building easy slick uses the class-building function model.realize_item_class
to make items from models.
Item classes using this function will have fields from all the models columns and relationships, with the exception
of primary and foreign keys. You should use this function to build all your item classes.
import sqlalchemy
from slick import model
class MyModel(model.BaseModel):
__tablename__ = "mymodel"
id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
field = sqlalchemy.Column(sqlalchemy.String(64))
MyItem = model.realize_item_class("MyItem", MyModel)
my_item = MyItem(field='some-field')
# field is now an attribute on MyItem
If you need to wire behavior on your item classes, use inheritence:
class MySpecialItem(MyItem):
def foo(self):
return "bar"
slick
uses the scrapy
concept of ItemLoader
to create all items. A slick BaseLoader
is defined in the model
module, and you should use this for loading of items when you scrape.
See more on dependent models below.
In order to map parsed items to models that have already been inserted into the db,
we need to define how to map an instantiated item to an instantiated model (pulled from the DB).
This is done through the _lookup_attributes
property registered on the model class.
This property takes a list of attributes names that map to columns registered on the model.
When looking up a model, slick
will automatically create a sql query
(a sqlalchemy query.filter()
to be exact).
As of this writing, it will simpy AND
all lookup attributes together, but this could easily be extended.
import sqlalchemy
from slick import model
class MyModel(model.BaseModel):
__tablename__ = "mymodel"
id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
field = sqlalchemy.Column(sqlalchemy.String(64))
MyItem = model.realize_item_class("MyItem", MyModel)
# normally items are yielded during regular parsing with their fields filled in
my_item = MyItem(field="normally-looked-up-in-parsing")
# the get_from_item fn is used in the slick pipeline to do this lookup automatically
my_model = MyModel.get_from_item(my_item)
# None
inserted = model.upsert(MyModel, my_item)
my_model_again = MyModel.get_from_item(my_item)
# my_model_again == inserted
In scraping, it's quite common that two pieces of data on different pages are related. For example,
if you're crawling recipes, you might want to store recipes in one table and authors in another.
In sql, you could model this as a relation from recipes to authors using a foreign key relationship.
With slick
you would just define the model with the relation as you would in sqlalchemy
,
and instruct the item loader how to join the two using the add_dependent
function:
import sqlalchemy
from slick import model
class Recipe(model.BaseModel):
__tablename__ = "recipe"
_lookup_attributes = ('name', )
id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
name = sqlalchemy.Column(sqlalchemy.String(128), index=True)
author_id = ForeignKey("game.id")
author = relationship("Author")
class Author(model.BaseModel):
__tablename__ = 'author'
_lookup_attributes = ('name', )
id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
name = sqlalchemy.Column(sqlalchemy.String(128), index=True)
recipe = relationship("Recipe")
RecipeItem = model.realize_item_class("RecipeItem", Recipe)
AuthorItem = model.realize_item_class("AuthorItem", Recipe)
def load_recipe(response):
recipe_loader = model.BaseLoader(Recipe(), response=response)
author_loader = model.BaseLoader(Author(), response=response)
author_loader.add_css('name', '.author-name')
recipe_loader.add_css('name', '.recipe-name')
author_item = author_loader.load_item()
recipe_loader.add_dependent('author', author_item)
return recipe_loader.load_item()
This will instruct the item to model mapper to look up dependent models when it's constructing the new recipe model
from the recipe item.
In this example case, it will look up an author model by the dependent author
we registered, i.e. looking it up
by the parsed name attribute (since that's what's registered as its lookup_attributes). If there's a match,
slick pipeline processing will map to the existing entry, otherwise it will create and insert a new object.
There's no need to parse or lookup author_id
in the item loader or anywhere in the code.
Sometimes you want to parse information from a page that you don't want to insert into the DB,
perhaps used to correlate with other pieces of data on the system. You can register these placeholder
fields on a model by using the slick model.Placeholder
class.
import sqlalchemy
from slick import model
class PlaceholderModel(model.BaseModel):
__tablename__ = 'placeholder'
id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
placeholder = model.Placeholder()
PlaceholderItem = model.realize_item_class("PlaceholderItem", PlaceholderModel)
#placeholder field is registered on PlaceholderItem class
slick
ships with a command line tool creator, ArgumentParser
, which you can use to create
easy command line tools for your scrapes. It's located in the cli
module.
To use it, create a python file at the root of this project, instantiate and run it. I haven't yet gotten python importing to a state where I can do this in the project directories themselves, but that will be coming.
#myscraper.py
from slick import cli
if __name__ == "__main__":
parser = cli.ArgumentParser(name="myscraper")
parser.run()
The cli
ships with a handful of useful commands. You can interrogate it by using --help
In particular, you can export data straight to Google Sheets by providing it with the --sheets
parameter.
python myscraper.py MyModel --sheets somelongsheetsidXYZ
This will upload all objects found in MyModel to the sheet.
To get sheets authentication set up, follow the Google setup guide
Tests aren't currently in docker, so you must have python 3 installed. Currently, the makefile
assumes the binary is called python
, I use pyenv
to manage this and encourage you to do the same.
Before you run them, install dependencies using pip -r requirements.txt
.
Tests are defined in test/.
Mysql runs in docker, but it exposes the port on the host machine. This means that you can connect to it on your localhost. The password is in docker-compose.yaml.
When running your scrapers you can point them to a different database by configuring through environment variables:
- MYSQL_HOST
- MYSQL_USER
- MYSQL_PORT
- MYSQL_DB
- MYSQL_PASSWORD
- MYSQL_CHARSET
This is a helpful shorthand for mysql commands you want to be familiar with.
show databases
shows which databases are available
show tables
lists what tables are available
describe {table_name}
shows a table schema
show create table {table_name}
shows the full structure of the table, including foreign key references and indexes.
mysqldump --host={host} --user={user} --password={password} --no-data --skip-add-drop-table
will print the entire sql structure of the DB, essentailly calling show create table
for each table.