Data Scores in the UK

Synopsis

As part of our project ‘Data Scores as Governance’ we have developed a tool to map and investigate the uses of data analytics and algorithms in public services in the UK. Little is still known about the implementation of data-driven systems and algorithmic processes in public services and how citizens are increasingly ‘scored’ based on the collection and combination of data.

This repository handles all aspects of the data collection and analysis.

Installation

Requirements

All development happened on Linux and the code should work well on MacOS as well. Windows support was never tested.

Installation

See installation instructions for boot and the download section of nodejs to install the required tools. With NodeJS installed type the following to install all dependencies:

npm install

Data Processes

The Sugarcube tool set is used for all data acquisition processes. Those processes can be repeated at any time.

`bin/import-ddg-scrapes.sh`

The initial data scrape from DuckDuckGo was done outside of this repository, but still using Sugarcube. This script imports extracts the contents of the search results of the initial data set and imports it into the database. The imported data can be found in materials/ddg-scrapes-clean.csv.

`bin/search-ddg-gov.sh`

Scrape DuckDuckGo for search results on government websites (site:.gov.uk) based on the initial set of search queries.

`bin/search-ddg-auxiliary.sh`

Scrape DuckDuckGo for search results for auxiliary websites based on the initial set of search queries. The list of auxiliary domains is maintained in queries/aux-sites.txt.

`bin/import-foi-requests.sh`

Import the contents of the FOI requests into the database. The requests themselves are found in materials/foi-requests.

`bin/search-media-ddg.sh`

Scrape DuckDuckGo for articles from media websites. This scripts works slightly different due to the amount of possible scrapes. Those scrape need to run on multiple servers in parallel to reduce the time it takes to scrape.

Use the ./scripts/british_newspapers.clj to create the list of media domains.
Split the domains into chunks. On every server run (adopt -n r/4 to the right amount of servers).

split -n r/4 --additional-suffix=.txt british-papers-domains.txt papers-
Start the scrape on each server:

./bin//search-media-ddg.sh queries/papers-aa.txt

Scripts

Post processing of data is done using a the following collection of scripts. They are idempotent and can be rerun at any time.

`scripts/update_companies.clj`

Tag all documents in the data base mentioning any company that is defined in queries/companies.txt.

`scripts/update_systems.clj`

Tag all documents in the data base mentioning any system that is defined in queries/systems.txt.

`scripts/update_authorities.clj`

Tag all documents in the data base mentioning any authority name in combination with any company or system. The lists of data are defined in queries/authorities.txt, queries/companies.txt and queries/systems.txt. This script also matches authority locations that are managed in [queries/coordinates.json]. If any location is missing, the script will halt. Add to the list of known coordinates to continue.

`scripts/update_departments.clj`

Tag all documents in the data base mentioning any department name in combination with any company or system. The lists of data are defined in queries/departments.txt, queries/companies.txt and queries/systems.txt.

`scripts/update_blacklisted.clj`

Flag a set of documents as blacklisted. They will be excluded from any further analysis or by the data-scores-map application. The list of blacklisted ID's is collected in [queries/blacklist.txt].

`scripts/generate-media-sites-queries.js`

`scripts/stats_for_companies.clj`

Generate statistics about the occurrences of companies in the existing data set. It will print a sorted CSV data set to the screen.

`scripts/stats_for_systems.clj`

Generate statistics about the occurrences of systems in the existing data set. It will print a sorted CSV data set to the screen.

`scripts/stats_for_departments.clj`

Generate statistics about the occurrences of departments in the existing data set. It will print a sorted CSV data set to the screen.

`scripts/stats_for_authorities.clj`

Generate statistics about the occurrences of authorities in the existing data set. It will print a sorted CSV data set to the screen.

`scripts/clean_gov_uk_domain_names.clj`

`scripts/british_newspapers.clj`

This script scrapes a list of all news media from https://www.britishpapers.co.uk/. The resulting newspaper domains are printed to the screen. Use the script like this:

./scripts/british_newspapers.clj | tee ./queries/british-papers-domains.txt

`scripts/reindex_data.clj`

This script is a helper to create a new local index and reindex an existing data set. This was helpful during development to be able to experiment on a data set. Run the script like this:

./scripts/reindex_data.clj http://localhost:9200/data-scores-04 http://localhost:9200/data-scores-05

Bits and Pieces

I used Libreoffice to remove line breaks in the original rawresults.csv and exported the file again in materials/ddg-scraoes.csv. Then I did more cleaning using the following command:

cat ddg-scrapes.csv | grep -v "^NO" | grep -v "Noresults" | cut -d, -f2- | sed -En "s/_(.*)-(.*)_100taps.json/\1 \2/p" | (echo "search_category,title,description,href" && cat) > ddg-scrapes-clean.csv

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
bin		bin
configs		configs
lib/whatdotheyknow		lib/whatdotheyknow
logs		logs
materials		materials
pipelines		pipelines
queries		queries
scripts		scripts
src/scores		src/scores
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
boot.properties		boot.properties
build.boot		build.boot
package.json		package.json
restclient		restclient

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Scores in the UK

Synopsis

Contents

Installation

Requirements

Installation

Data Processes

`bin/import-ddg-scrapes.sh`

`bin/search-ddg-gov.sh`

`bin/search-ddg-auxiliary.sh`

`bin/import-foi-requests.sh`

`bin/search-media-ddg.sh`

Scripts

`scripts/update_companies.clj`

`scripts/update_systems.clj`

`scripts/update_authorities.clj`

`scripts/update_departments.clj`

`scripts/update_blacklisted.clj`

`scripts/generate-media-sites-queries.js`

`scripts/stats_for_companies.clj`

`scripts/stats_for_systems.clj`

`scripts/stats_for_departments.clj`

`scripts/stats_for_authorities.clj`

`scripts/clean_gov_uk_domain_names.clj`

`scripts/british_newspapers.clj`

`scripts/reindex_data.clj`

Bits and Pieces

About

Releases

Packages

Languages

License

critocrito/data-scores-in-the-uk

Folders and files

Latest commit

History

Repository files navigation

Data Scores in the UK

Synopsis

Contents

Installation

Requirements

Installation

Data Processes

bin/import-ddg-scrapes.sh

bin/search-ddg-gov.sh

bin/search-ddg-auxiliary.sh

bin/import-foi-requests.sh

bin/search-media-ddg.sh

Scripts

scripts/update_companies.clj

scripts/update_systems.clj

scripts/update_authorities.clj

scripts/update_departments.clj

scripts/update_blacklisted.clj

scripts/generate-media-sites-queries.js

scripts/stats_for_companies.clj

scripts/stats_for_systems.clj

scripts/stats_for_departments.clj

scripts/stats_for_authorities.clj

scripts/clean_gov_uk_domain_names.clj

scripts/british_newspapers.clj

scripts/reindex_data.clj

Bits and Pieces

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`bin/import-ddg-scrapes.sh`

`bin/search-ddg-gov.sh`

`bin/search-ddg-auxiliary.sh`

`bin/import-foi-requests.sh`

`bin/search-media-ddg.sh`

`scripts/update_companies.clj`

`scripts/update_systems.clj`

`scripts/update_authorities.clj`

`scripts/update_departments.clj`

`scripts/update_blacklisted.clj`

`scripts/generate-media-sites-queries.js`

`scripts/stats_for_companies.clj`

`scripts/stats_for_systems.clj`

`scripts/stats_for_departments.clj`

`scripts/stats_for_authorities.clj`

`scripts/clean_gov_uk_domain_names.clj`

`scripts/british_newspapers.clj`

`scripts/reindex_data.clj`

Packages