Create a Peristent DuckDB Database for Public Cancer Data

Description

Build a DuckDB database for performant querying of publicly available cancer data. Code is in a functional state for use in personal projects. While it works as intended, it includes several quick and dirty implementations.

Data Sources

Consult links for details on data collection, processing, and guidelines on usage.

Data Models

Raw data is prepared using "models" saved as .sql files in ./models, which (occasionally) follow best practices outlined here.

Prerequisites

aria2 (optional) [docs (fallback: wget)]
DuckDB [docs]
make
wget

Usage

make -C /path/to/public_cancer_db

With custom configuration:

make -C /path/to/public_cancer_db DIR="/path/to/large_data_storage" MEMORY_LIMIT=32GB NCORES=16 DOWNLOADER=aria2

Adding GDC data to an existing database:

make gdc -C /path/to/public_cancer_db DB="/path/to/data.db"

Note on Concurrency

Quoting from the DuckDB docs on concurrency,

DuckDB has two configurable options for concurrency:

One process can both read and write to the database.

Multiple processes can read from the database, but no processes can write (access_mode = 'READ_ONLY'). When using option 1, DuckDB supports multiple writer threads ...

To benefit from make parallelism, the database can be built in two steps using phony targets,

echo "Fetching data ..."
make fetch -C /path/to/public_cancer_db -j 8

echo "Building database ..."
make ingest -C /path/to/public_cancer_db -j 1

where NCORES, etc. can be configured separately to best utilize available resources for multi-threaded database writes.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
config		config
models		models
.gitattributes		.gitattributes
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Create a Peristent DuckDB Database for Public Cancer Data

Description

Data Sources

Data Models

Prerequisites

Usage

Note on Concurrency

About

Releases

Packages

Languages

t-silvers/public_cancer_db

Folders and files

Latest commit

History

Repository files navigation

Create a Peristent DuckDB Database for Public Cancer Data

Description

Data Sources

Data Models

Prerequisites

Usage

Note on Concurrency

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages