Distributed download scripts for Common Crawl data.
Python >= 3.7 is required.
Install dependencies by:
pip install -r requirements.txt
libmysqlclient-dev
or an equivalent one is also required on Linux distros:
sudo apt install libmysqlclient-dev
The default config file is located at configs/default.conf
, which lists all the modifiable entries. Their descriptions and default values are listed below:
[database]
drivername = mysql
username = user
password = password
host = localhost
port = 3306
database = common_crawl
[worker]
; The name of this worker
name = unknown
; The interval of retries in seconds
retry_interval = 5
; The number of retries before giving up
retries = 10
; The timeout of internet connections in seconds
socket_timeout = 30
; The download root path
download_path = downloaded
[schedule]
; Whether to restrict download time
enabled = false
; The start of the allowed download time
start_time = 20:00:00
; The end of the allowed download time
end_time = 07:59:59
; The interval of retries when download is restricted
retry_interval = 300
Do not modify the default config file directly. You can create your own local.conf
under the configs
folder and add modified entries in it.
An example of a valid local config file:
[database]
username = common_crawl
password = &WcKLEsX!
host = 10.10.1.217
[schedule]
enabled = true
start_time = 20:00:00
end_time = 07:59:59
Run the following command at the root path of the project:
python src/main.py
Always press CTRL-C
to exit the download process. Killing it directly will cause data loss and inconsistency in database.
Field | Type | Description |
---|---|---|
id | int | Primary Key Data ID |
uri | varchar(256) | The URI of the data, which constitutes the download URL and the folder structure |
size | int | The size of the data in bytes |
started_at | datetime | Download start time (CST) |
finished_at | datetime | Download end time (CST) |
download_state | tinyint | Download state 0 for pending1 for downloading2 for finished3 for failed |
id_worker | int | Foreign Key The ID of the worker that downloads this data |
archive | varchar(30) | The year and month of the data on Common Crawl |
URIs can be obtained from wet.paths
files on Common Crawl website.
An example of a URI:
crawl-data/CC-MAIN-2021-10/segments/1614178347293.1/wet/CC-MAIN-20210224165708-20210224195708-00000.warc.wet.gz
Field | Type | Description |
---|---|---|
id | int | Primary Key Worker ID |
name | varchar(128) | The name of the worker |