As part of our project ‘Data Scores as Governance’ we have developed a tool to map and investigate the uses of data analytics and algorithms in public services in the UK. Little is still known about the implementation of data-driven systems and algorithmic processes in public services and how citizens are increasingly ‘scored’ based on the collection and combination of data.
This repository handles all aspects of the data collection and analysis.
All development happened on Linux and the code should work well on MacOS as well. Windows support was never tested.
See installation instructions for boot
and the download section of nodejs
to install the required tools. With NodeJS installed type the following to install all dependencies:
npm install
The Sugarcube tool set is used for all data acquisition processes. Those processes can be repeated at any time.
The initial data scrape from DuckDuckGo was done outside of this repository, but still using Sugarcube. This script imports extracts the contents of the search results of the initial data set and imports it into the database. The imported data can be found in materials/ddg-scrapes-clean.csv
.
Scrape DuckDuckGo for search results on government websites (site:.gov.uk
) based on the initial set of search queries.
Scrape DuckDuckGo for search results for auxiliary websites based on the initial set of search queries. The list of auxiliary domains is maintained in queries/aux-sites.txt
.
Import the contents of the FOI requests into the database. The requests themselves are found in materials/foi-requests
.
Scrape DuckDuckGo for articles from media websites. This scripts works slightly different due to the amount of possible scrapes. Those scrape need to run on multiple servers in parallel to reduce the time it takes to scrape.
-
Use the
./scripts/british_newspapers.clj
to create the list of media domains. -
Split the domains into chunks. On every server run (adopt
-n r/4
to the right amount of servers).split -n r/4 --additional-suffix=.txt british-papers-domains.txt papers-
-
Start the scrape on each server:
./bin//search-media-ddg.sh queries/papers-aa.txt
Post processing of data is done using a the following collection of scripts. They are idempotent and can be rerun at any time.
Tag all documents in the data base mentioning any company that is defined in queries/companies.txt
.
Tag all documents in the data base mentioning any system that is defined in queries/systems.txt
.
Tag all documents in the data base mentioning any authority name in combination with any company or system. The lists of data are defined in queries/authorities.txt
, queries/companies.txt
and queries/systems.txt
. This script also matches authority locations that are managed in [queries/coordinates.json
]. If any location is missing, the script will halt. Add to the list of known coordinates to continue.
Tag all documents in the data base mentioning any department name in combination with any company or system. The lists of data are defined in queries/departments.txt
, queries/companies.txt
and queries/systems.txt
.
Flag a set of documents as blacklisted. They will be excluded from any further analysis or by the data-scores-map
application. The list of blacklisted ID's is collected in [queries/blacklist.txt
].
Generate statistics about the occurrences of companies in the existing data set. It will print a sorted CSV data set to the screen.
Generate statistics about the occurrences of systems in the existing data set. It will print a sorted CSV data set to the screen.
Generate statistics about the occurrences of departments in the existing data set. It will print a sorted CSV data set to the screen.
Generate statistics about the occurrences of authorities in the existing data set. It will print a sorted CSV data set to the screen.
This script scrapes a list of all news media from https://www.britishpapers.co.uk/. The resulting newspaper domains are printed to the screen. Use the script like this:
./scripts/british_newspapers.clj | tee ./queries/british-papers-domains.txt
This script is a helper to create a new local index and reindex an existing data set. This was helpful during development to be able to experiment on a data set. Run the script like this:
./scripts/reindex_data.clj http://localhost:9200/data-scores-04 http://localhost:9200/data-scores-05
I used Libreoffice to remove line breaks in the original rawresults.csv
and
exported the file again in materials/ddg-scraoes.csv
. Then I did more
cleaning using the following command:
cat ddg-scrapes.csv | grep -v "^NO" | grep -v "Noresults" | cut -d, -f2- | sed -En "s/_(.*)-(.*)_100taps.json/\1 \2/p" | (echo "search_category,title,description,href" && cat) > ddg-scrapes-clean.csv