Build a DuckDB database for performant querying of publicly available cancer data. Code is in a functional state for use in personal projects. While it works as intended, it includes several quick and dirty implementations.
- GDC data via UCSC Xena
- ICGC data via UCSC Xena
- PCAWG data via UCSC Xena
- PDC data via LinkedOmicsKB
Consult links for details on data collection, processing, and guidelines on usage.
Raw data is prepared using "models" saved as .sql
files in ./models
, which (occasionally) follow best practices outlined here.
make -C /path/to/public_cancer_db
With custom configuration:
make -C /path/to/public_cancer_db DIR="/path/to/large_data_storage" MEMORY_LIMIT=32GB NCORES=16 DOWNLOADER=aria2
Adding GDC data to an existing database:
make gdc -C /path/to/public_cancer_db DB="/path/to/data.db"
Quoting from the DuckDB docs on concurrency,
DuckDB has two configurable options for concurrency:
- One process can both read and write to the database.
- Multiple processes can read from the database, but no processes can write (access_mode = 'READ_ONLY'). When using option 1, DuckDB supports multiple writer threads ...
To benefit from make
parallelism, the database can be built in two steps using phony targets,
echo "Fetching data ..."
make fetch -C /path/to/public_cancer_db -j 8
echo "Building database ..."
make ingest -C /path/to/public_cancer_db -j 1
where NCORES
, etc. can be configured separately to best utilize available resources for multi-threaded database writes.