Wikimedia Commons Bulk Downloader By Category
Wikimedia commons have lot of files in various creative commons license. There, all files are categorised. Basically wikimedia commons have various files images, audio, video, ..etc.
We have one audio upload tool called Spell4Wiki that allows to upload the audio files to commons. That tool also categorised the uploaded audio files based on the country code.
For Tamil: Category:Files uploaded by spell4wiki in ta
For English: Category:Files uploaded by spell4wiki in en
More details : Category:Files uploaded by spell4wiki
We can use this uploaded audio files to some other FOSS related projects. So, this script help easy way to download all the files in specific category.
Note: This script not only for audio files we can use this same script for other file format also.
This script required category name and max record count.
- REQUIRED:
category
is the wikimedia commons category name that have list of files: "Category:Files uploaded by spell4wiki in ta" - OPTIONAL:
max_records
is the count of maximum records you want to download.
This script download latest uploaded items to old items. So, max records can help to download the some count of latest items only.
Ref:
Here, Category:Files uploaded by spell4wiki in ta is the category name.
- Download/Clone this Repo
git clone https://github.com/manimaran96/Wiki-Commons-Bulk-Downloader-By-Category.git
- Open the
config.py
file in editor and do change thecategory
,max_records
andlimit
Note: max_records
and limit
are optional
category = "Category:Files uploaded by spell4wiki in CHECK"
max_records = -1
limit = 500
More details to check config.py
- Install following libraries
sudo apt update
sudo apt install python3
sudo apt install python3-pip
pip install beautifulsoup4
pip install aiohttp
pip install asyncio
pip install aiofiles
- Once all are done now we can run the script.
python3 wikimedia-commons-bulk-downloader-by-category.py
If you willing to contibute this code. Please read below todo list and do your contribution. Before start your contribution make sure to create issue and assign your self. Which is help to reduce rework.
- Some packages install so make requirements.txt based on that.
- Fix: While downloding morethan 3000 or large files may failed. Bcz of concurrent download/scraping calls.
Optional
- After downloaded files compressed in .zip file format
- Make webportal for this.
- If you want to get in touch with the developer you can send an email to manimarankumar96@gmail.com or @manimarank in Telegram.
- Feel free to post suggestions, changes, ideas etc. on GitHub or Telegram!