Skip to content

t-silvers/public_cancer_db

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Create a Peristent DuckDB Database for Public Cancer Data

Description

Build a DuckDB database for performant querying of publicly available cancer data. Code is in a functional state for use in personal projects. While it works as intended, it includes several quick and dirty implementations.

Data Sources

Consult links for details on data collection, processing, and guidelines on usage.

Data Models

Raw data is prepared using "models" saved as .sql files in ./models, which (occasionally) follow best practices outlined here.

Prerequisites

  • aria2 (optional) [docs (fallback: wget)]
  • DuckDB [docs]
  • make
  • wget

Usage

make -C /path/to/public_cancer_db

With custom configuration:

make -C /path/to/public_cancer_db DIR="/path/to/large_data_storage" MEMORY_LIMIT=32GB NCORES=16 DOWNLOADER=aria2

Adding GDC data to an existing database:

make gdc -C /path/to/public_cancer_db DB="/path/to/data.db"

Note on Concurrency

Quoting from the DuckDB docs on concurrency,

DuckDB has two configurable options for concurrency:

  1. One process can both read and write to the database.
  2. Multiple processes can read from the database, but no processes can write (access_mode = 'READ_ONLY'). When using option 1, DuckDB supports multiple writer threads ...

To benefit from make parallelism, the database can be built in two steps using phony targets,

echo "Fetching data ..."
make fetch -C /path/to/public_cancer_db -j 8

echo "Building database ..."
make ingest -C /path/to/public_cancer_db -j 1

where NCORES, etc. can be configured separately to best utilize available resources for multi-threaded database writes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published