-
Notifications
You must be signed in to change notification settings - Fork 63
How to make a data source
4CAT is a modular tool. Its modules come in two varietes: data sources and processors. This article covers the former.
Data sources are a collection of workers, processors and interface elements that extend 4CAT to allow scraping, processing and/or retrieving data for a given platform (such as Instagram, Reddit or Telegram). 4CAT has an API that can do most of the scaffolding around this for you so processors can be quite lightweight and mostly focus on the analysis while 4CAT's back-end takes care of the scheduling, determining where the output should go, et cetera.
Data sources are defined as an arbitrarily-named folder in the datasources
folder in the 4CAT root. It is recommended to use the datasource ID (see below) as the data source folder name. However, since Python files included in the folder will be included as modules by 4CAT, folder names should be allowed as module names. Concretely this means (among other things) that data source folder names cannot start with a number (hence the fourchan
data source).
WARNING: Data sources in multiple ways can define arbitrary code that will be run by either the 4CAT server or client-side browsers. Be careful when running a data source supplied by someone else.
A data source will at least contain the following:
- An
__init__.py
containing data source metadata and initialisation code - A search worker, which can collect data according to provided parameters and format it as a CSV file that 4CAT can work with.
- A web tool interface, which will be loaded as part of the 'Create dataset' page to allow a user to set data source-specific parameters before creating a dataset with the search worker.
It may contain additional components:
- Any processors that are specific to datasets created by this data source
- Views for the web app that allow more advanced behaviour of the web tool interface
- Database or Sphinx index definitions
The instructions below describe how to format and create these components (work in progress!)
The data source root should contain a file __init__.py
which in turn defines the following:
DATASOURCE = "datasource-identifier"
This constant defines the data source ID. This is most importantly used in config.py
to enable the data source.
def init_datasource(database, logger, queue, name):
pass
This function is called when 4CAT starts, if the data source is enabled, and should set up anything the data source needs to function (e.g. queueing any recurring workers). A default implementation of this function can be used instead (and when defining your own, it is advised to still call it as part of your own implementation):
from backend.lib.helpers import init_datasource
The search worker is run when a dataset is created by a user, and collects the data for that dataset (i.e. the posts from the platform matching the given dataset parameters), writing it to the dataset result file. It is contained in an arbitrarily named Python file in the data source root (we recommend search_[datasource].py
). The file should define a class that extends backend.abstract.search.Search
. This class should define the following attributes and methods:
-
type
: Identifier used by the scheduler to know what code to run for jobs for this data source. Should be[datasource-id]-search
,datasource-id
being equal to the ID defined in__init__.py
. -
max_workers
: Optional, default 1. The amount of search workers that may run in parallel for this data source. Usually, you want to keep this at 1, unless you are confident your server can handle multiple parallel workers of this type. -
validate_query(query, request, user) -> dict
: Called statically by the web tool whenever a new dataset is created by the user.query
contains the form fields as set in the web interface; this method should return a sanitised version of that query, containing only fields and values relevant to this search worker. On invalid input, abackend.lib.exceptions.QueryParametersException
should be raised which will prompt a user to change their input and resubmit. -
get_search_mode(self, query) -> str
: Optional. Should returnsimple
orcomplex
, determining what method will be called to actually retrieve data for a query. Thequery
parameter is equivalent to the return value ofvalidate_query()
. By default, returnscomplex
ifquery
contains eitherbody_match
orsubject_match
query. -
get_posts_simple(self, query) -> list
: Takes a set ofquery
parameters (equivalent to the return value ofvalidate_query()
and uses that to retrieve a list of items. The method should return a list of dictionaries, each dictionary being a field -> value map where each field corresponds to a column in the final csv file for the dataset. -
get_post_complex(self, query) -> list
: Equivalent toget_posts_simple()
, but will be called ifget_search_mode()
returnscomplex
. -
after_search(self, posts) -> list
: Optional. If defined, this will be called after all posts have been retrieved with the methods listed above and, if appropriate, any sampling or such. This method should return a list of posts, likeget_posts_*
. You can use it to e.g. perform additional post filtering or processing should your data source require it.
The standard data source allows for various methods of retrieving posts; simply per post, or all posts in a thread containing a matching post, or a random sample (see e.g. /datasources/fourchan/webool/query-form.html
). The following three methods allow this to happen. If your data source does not allow for these retrieval modes, you can define these as stubs (i.e. pass
).
-
fetch_posts(self, post_ids) -> list
: Should be used byget_posts_*()
to retrieve the actual item data. Takes a list ofpost_ids
and retrieves data for those post IDs, e.g. via an API or a local database. -
fetch_threads(self, thread_ids) -> list
: Retrieves all posts for the giventhread_ids
. -
get_thread_lengths(self, thread_ids, min_length) -> dict
: Should return a dictionary with thread IDs as keys and amount of posts per thread as values, for all threads with at leastmin_length
posts.
Additionally, because search workers are (after a fashion) architecturally equivalent to processors, they have access to all the attributes a processor has access to, e.g. dataset
and job
. See the page for processors for more information on these.
People can use 4CAT to create new datasets with your data source. To this end, the data source should define an interface through which dataset parameters may be set. Data sources should contain a folder webtool
with the following files:
-
query-form.html
: A jinja2 template defining form elements that will be shown in the web interface. The submit button is defined by 4CAT's own templates; you only need to define the various form controls (e.g. inputs, radio buttons) used to set parameters for your data source's datasets. These parameters will be passed tovalidate_query()
upon submission to generate a set of sanitised parameters that can be used by the search worker. -
tool.js
: Optional. This file will be loaded when the data source is selected in the web interface and can run arbitrary Javascript code to enhance your query form's behaviour. -
views.py
: Optional. This can define additional views for the 4CAT Flask app. Any function defined in this file will be available as a view via/api/datasource-call/[datasource-id]/[function name]/
. Functions should have the signaturefunction(request, user, **kwargs)
:request
anduser
are objects supplied by Flask,**kwargs
is all HTTP GET parameters as keyword arguments. The function should return an object (remember that in Python everything is an object), which will be serialised as JSON as the view output.
🐈🐈🐈🐈