This is a multithreaded scraping script for Pastebin. It scrapes the main site for new pastes, downloads their raw content and processes them by a user-defined output format.
Fun.
The usual dance.
pip install -r requirements.txt
Define all required specs in settings.ini
. Should you decide to go with a database output, make sure the respective connector is installed. At the moment MySQL with pymysql
and SQLite with the standard built in Python 3 connector are supported.
Also note that the file output creates a subdirectory output
and dumps every paste as a separate file into it.
ini
is a highly underrated file format. Here are some definitions on what the settings parameter actually do.
PasteLimit
Stop after having scraped n pastes. Set to 0 for indefinite scrapingPBLink
URL to Pastebin or another equivalent siteDownloadWorkers
Number of workers that download the raw paste content and further process itNewPasteCheckInterval
Time to wait before checking the main site for new pastes againIPBlockedWaitTime
Time to wait until checking the main site again after the scraper's IP has been blocked
RotationLog
Location of log file that contains debug outputMaxRotationSize
Size in bytes before another log file is createdRotationBackupCount
Maximum number of log files to keep
Enable
Enable formatted stdout output of paste dataContentDisplayLimit
Maximum amount of characters to show before content is cut off (0 to display all)ShowName
Display the paste nameShowLang
Display the paste languageShowLink
Display the complete paste linkShowData
Display the raw paste contentDataEncoding
Encoding of the raw paste data
Enable
Enable MySQL outputTableName
Main table name to insert data intoHost
MySQL server hostPort
MySQL server portUsername
MySQL server userPassword
User password
Enable
Enable SQLite outputFilename
Filename the db should be saved as (usually ends with .db)TableName
Main table name to insert data into
If you use this thing for some cool data analysis or even research, let me know if I can help!
Inspiration for this scraper was taken from here.