Skip to content

Python interface for accessing Nextstrain SARS-CoV-2 sequence and clade data

License

Notifications You must be signed in to change notification settings

reichlab/cladetime

Repository files navigation

Cladetime

Cladetime is a wrapper around Nextstrain datasets and tools for downloading and working with SARS-CoV-2 virus genome sequence data at a specific point in time.

Usage

This library contains two types of components:

  1. Scripts and CLI tools for use in database pipelines required by the Variant Nowcast Hub (these are in development and not documented here).

  2. Python classes for interactively working with SARS-CoV-2 sequence data, metadata, and clade assignments.

Getting started with Cladetime

To use cladetime interactively, install the package from GitHub:

pip install git+https://github.com/reichlab/cladetime.git

Below are a few examples of using Cladetime in a Python interpreter. See the project documentation for more details.

Time traveling with Cladetime

Cladetime knows where to find past versions of Nextstrain's SARS-CoV-2 sequence data and metadata files. These are intermediate files produced by Nextstrain's daily workflow. Cladetime uses the full set of open Genbank data.

>>> from cladetime import CladeTime

# Create a CladeTime object for any date after May 2023
# (for the most recent data, remove the `sequence_as_of` parameter)
>>> ct = CladeTime(sequence_as_of="2024-10-15")

# Based on the sequence_as_of date (above), Cladetime provides URLs to the corresponding
# SARS-CoV-2 past sequence data and metadata files
>>> ct.url_sequence
'https://nextstrain-data.s3.amazonaws.com/files/ncov/open/sequences.fasta.zst?versionId=8Zszokay3LRP5Zec_cviQ8oXkx8cJlwq'
>>> ct.url_sequence_metadata
'https://nextstrain-data.s3.amazonaws.com/files/ncov/open/metadata.tsv.zst?versionId=U4aIlh5HI1XuDLPW7q9WTZad6gXwARqT'

Interacting with sequence metadata

The metadata provided by Nextstrain contains descriptive information about SARS-CoV-2 sequences. Cladetime provides a Polars-based interface to work with this metadata directly, without downloading it first.

>>> from datetime import datetime
>>> import polars as pl
>>> from cladetime import CladeTime

>>> ct = CladeTime()
>>> metadata = ct.sequence_metadata  # Returns a Polars LazyFrame

# From there, you can use Polars to manipulate the data as needed
>>> filtered_sequence_metadata = (
...     metadata
...     .select(["country", "division", "date", "host", "clade_nextstrain"])
...     .rename({"division": "location"})
...     .cast({"date": pl.Date}, strict=False)
...     .filter(
...         pl.col("country") == "USA",
...         pl.col("date") >= pl.lit(datetime(2024, 9, 1)),
...         pl.col("date") < pl.lit(datetime(2024, 11, 1)),
...     )
...     .group_by(["date", "clade_nextstrain"]).len()
... ).collect(streaming=True)

>>> filtered_sequence_metadata.head(5)
shape: (5, 3)
┌────────────┬──────────────────┬─────┐
│ dateclade_nextstrainlen │
│ --------- │
│ datestru32 │
╞════════════╪══════════════════╪═════╡
│ 2024-09-0524B42  │
│ 2024-09-2624E73  │
│ 2024-09-0324A100 │
│ 2024-09-2424F17  │
│ 2024-09-2524B12  │
└────────────┴──────────────────┴─────┘

# Pandas users can export Polars dataframes
>>> pandas_df = filtered_sequence_metadata.to_pandas()

Reference trees and clade assignments

Cladetime also allows you to access SARS-CoV-2 reference trees from past dates (back to August 1, 2024). This is useful for assigning sequences to clades based on a specific reference tree.

>>> from cladetime import CladeTime, Tree

>>> ct = CladeTime(tree_as_of="2024-09-01")
>>> ref_tree = Tree(ct)

# The tree object provides a URL to the reference tree as
# it existed on the "tree_as_of" date
>>> ref_tree.url
'https://data.clades.nextstrain.org/v3/nextstrain/sars-cov-2/wuhan-hu-1/orfs/2024-07-17--12-57-03Z/tree.json'

# It also contains the tree itself, in the form of a Python dictionary
>>> ref_tree.tree.keys()
dict_keys(['version', 'meta', 'tree', 'root_sequence'])

>>> print(ref_tree.tree['meta']['title'], ref_tree.tree['meta']['updated'])
SARS-CoV-2 phylogeny 2024-07-17

Clade assignments with past sequences and reference trees

Coming soon!

About

Python interface for accessing Nextstrain SARS-CoV-2 sequence and clade data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages