The Enhanced OCOD dataset

This repo provides the pipeline to create the enhanced OCOD dataset. The Enhanced OCOD dataset is based on the The paper cleans and enhances the publicly available OCOD dataset produced by Land Registry. This dataset contains the addresses and additional metadata, UK property owned by offshore companies. The OCOD dataset has several issues making it difficult to use. These difficulties include, address in free text format, multiple properties in a single title number, no indication on property usage type (Residential, Business, etc).

The enhanced OCOD dataset, attempt to resolve these issues. The pipeline tidies the data ensuring that there is a single property per line, it also parses the address to make it easier to use and locates the properties within the LSOA/OA system developed by the Office of National Statistics. Finally the OCOD dataset classifies properties into 5 categories, residential, Business, Airspace, Land, Carpark. Any properties which do not have enough information to be put in one of the categories are given the class "unknown".

The enhanced OCOD dataset was demonstrated in the paper 'Inspecting the laundromat: Mapping and characterising offshore owned residential property in London'. The code for the analysis in the paper can be found here.

The Dataset

The compressed dataset from the time the paper was written is included with this repo as 'OCOD_classes.tar.xz'. However, it will become out of date and so it is reccomended to rebuild the dataset with current information. The following section describes how to generate the enhanced OCOD dataset.

How to create the Enhanced OCOD dataset

This repo contains the code needed to create the enchanced OCOD dataset. The simplest way is to build and run the Docker image found here. Otherwise the script [full_ocod_parse_process.py] can be run from a python environment. The process is quite memory intense using a machine with 16Gb is advisable.

Project set-up

clone repo
navigate to repo folder
navigate to empty_homes_data
download required opensource data and rename files/folder (see below)
download spaCy model from this OSF data repository and extract in the empty_homes_data directory of the repo

Docker process

See the docker readme for detailed instructions

install docker
build docker
run docker

Python process

in the command line type the following from the repo root folder

pip install spacy numpy pandas #installs the required libraries
python -m spacy download en_core_web_lg #the spaCy model uses the large vector language model optimized for CPU
python ./full_ocod_parse_process.py ./data/ OCOD.csv OCOD_enhanced.csv

The python command takes three arguments the path to the data folder (shown above as ./data/), the name of the file to be parsed (OCOD.csv), the name of the file created at the end of the parsing process (OCOD_enhanced.csv). Note all files need to be inside the data folder. The process or indeed docker can be run in a loop using a bash script to parse multiple files.

The script itself is full_ocod_parse_process.py

During the parsing and classification process

The process takes about 15 minutes and is broken into several stages. The stages provide information on progress and at the a file called 'enhanced_ocod_dataset.csv' is saved into the the 'empty_homes_data' folder.

The key stages are

NER labelling using a pre-trained spaCy model
Parsing
Expanding so that there is only one property per row
Locating in the Census Geography system
Classifying into the 5 types and 'unknown'
Contracting, by removing irrelevant duplicated data
Saving the enhanced ocod dataset

Required Datasets

In order to re-create or update the Enhanced OCOD dataset several opensource datasets are required. This datasets should be downloaded into the empty_homes_data folder in this repo and renamed as below. All the data is free and covered by an open government licence (OGL), however the OCOD dataset requires the user create and account.

Dataset	Change file/folder name to	Type
OCOD dataset	OCOD.csv	csv
ONSPD*	ONSPD.zip	zip
Price Paid dataset**	price_paid_files	folder
VOA ratings list***	VOA_ratings.csv	csv

Note: * Only folder name of the ONSPD zip needs to be changed the data inside doesn't. The script searches for the correct file inside. Future versions of the script may make it more flexible with regards file names.

** The price paid dataset should be downloaded as yearly files and places inside a folder called 'price_paid_files'. It is advisable to download several years. The paper used 2017-2021. Having more years increases the chances of being able to fill in missing information in OCOD, however after a few years the benefits reduce and the memory costs become high.

*** There are several files in the dataset. The one with a simmilar too 'uk-englandwales-ndr-20xx-listentries-compiled-epoch-00xx-baseline-csv.csv' is the correct one

Additional code

Several notebooks are used in this repo. These notebooks were used to develop the final script. The notebooks used in this paper are as follows.

In order to run these scripts you must download several opensource datasets produced by the UK government. Please see the paper's data section in the method for details.

Contributing

This dataset pipeline is meant to be used, suggestions and helpful commits and improvements are welcomed.

Citing this dataset

If you use this repository please cite the paper

What's in the laundromat? Mapping and characterising offshore owned residential property in London https://doi.org/10.1177/2399808323115548

OGL notices

Source: Office for National Statistics licensed under the Open Government Licence v.3.0

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
dockerfile		dockerfile
humanloop_3		humanloop_3
other_scripts		other_scripts
spacy_config_files		spacy_config_files
.gitignore		.gitignore
Full process testing script.ipynb		Full process testing script.ipynb
LICENSE		LICENSE
README.md		README.md
address_parsing_helper_functions.py		address_parsing_helper_functions.py
expanding_tagged_addresses.ipynb		expanding_tagged_addresses.ipynb
full_ocod_parse_process.py		full_ocod_parse_process.py
helper_functions.py		helper_functions.py
import-functions.py		import-functions.py
locate_and_classify_helper_functions.py		locate_and_classify_helper_functions.py
locating_and_classifying_the_ocod_dataset.ipynb		locating_and_classifying_the_ocod_dataset.ipynb
prep_helper_functions.py		prep_helper_functions.py
unit_tag_and_span_cleaning.ipynb		unit_tag_and_span_cleaning.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Enhanced OCOD dataset

The Dataset

How to create the Enhanced OCOD dataset

Project set-up

Docker process

Python process

During the parsing and classification process

Required Datasets

Additional code

Contributing

Citing this dataset

OGL notices

About

Releases 1

Packages

Languages

License

JonnoB/enhance_ocod

Folders and files

Latest commit

History

Repository files navigation

The Enhanced OCOD dataset

The Dataset

How to create the Enhanced OCOD dataset

Project set-up

Docker process

Python process

During the parsing and classification process

Required Datasets

Additional code

Contributing

Citing this dataset

OGL notices

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages