This directory contains the following scrapers:
- nodejs v14.21.2
- npm
Configure your environment:
- Install dependencies
npm install && mkdir -p data
- Create .env.local file from .env.example and fill in the required values:
cp .env.example .env.local
The following command will fetch the data from the sources, merge the datasets, apply the overrides and insert the data into the database:
npm run start:full
If you would like to skip the fetching steps and import your csv files from Nextcloud, you can use the following command:
npm run start
Note: you will need the following files in the data
directory:
detail.csv
- scraped from PracticalPlantspermapeopleRawData.csv
- scraped from PermapeoplereinsaatRawData.csv
- scraped from Reinsaat and merged fromreinsaatRawDataEN.csv
andreinsaatRawDataDE.csv
germanCommonNames.csv
- scraped from wikidata
The following steps describe how to use the scraper to fetch the data from the sources and insert it into the database. The steps are simplified and only the most important commands are listed. For more information, please refer to the documentation of the individual scrapers linked in the first paragraph of this doc.
- Fetch the data
The scraper scrapes the data from the sources and stores it in csv
format in the data
directory:
npm run fetch:practicalplants
npm run fetch:permapeople
npm run fetch:reinsaat && npm run merge:reinsaat
The scraped data is stored in the data
directory:
detail.csv
: This file contains the raw data scraped from the PracticalPlants webpage.permapeopleRawData.csv
: This file contains the raw data scraped from the Permapeople webpage.reinsaatRawDataEN.csv
: This file contains the raw data scraped from the english version of the Reinsaat webpage.reinsaatRawDataDE.csv
: This file contains the raw data scraped from the german version of the Reinsaat webpage.reinsaatRawData.csv
: This file contains the merged data scraped from the english and german version of the Reinsaat webpage.germanCommonNames.csv
: This file contains the German common names fetched from https://www.wikidata.org
- Merge the scraped datasets
The scraper also merges the scraped data of all the sources and stores it in csv
format in the data
directory:
mergedDatasets.csv
: This file contains the merged datasets
This can be done with the following command:
npm run merge:datasets
- Fetch German common names
Goes through all unique names from mergedDatasets.csv and fetches the German common names from https://www.wikidata.org concurrently. Then merges them into mergedDatasets.csv
If it starts throwing 429 errors, reduce MAX_CONCURRENT_REQUESTS to a lower number, such as 10.
npm run fetch:germannames && npm run merge:germannames
- Apply overrides
The scraped data can contain inconsistencies and errors.
In order to correct these mistakes, we can create override files.
data/overrides
may contain any number of csv
files, which are applied consecutively to mergedDatasets.csv
to create finalDataset.csv
For details see data/overrides/README.md
npm run apply:overrides
- Insert the data into the database
The scraper also inserts the scraped data into the database:
npm run insert:plants
- Insert relations into the database
The scraper inserts the relation data into the database.
First you need to download the Companions.csv
and Antagonist.csv
file from the nextcloud server or export them yourself from the current Plant_Relations.ods
.
Copy them into the /data directory and run:
npm run insert:relations