fasql (v0.5.0)

Read and Write FASTX Files w/ DuckDB

Warning This repo will be archived. Please see exon-duckdb, which has similar but expanded functionality.

Reading Overview
- Schemas
  - FASTA
  - FASTQ
- Replacement Scans
- Globs
Writing Overview
- FASTA
- FASTQ
Installation and Usage
- DuckDB Console
- Python
Supported Platforms

Reading Overview

fasql is an open source DuckDB Extension from WHERE TRUE Technologies that adds FASTA and FASTQ file parsing as table functions.

For example, given a FASTA file called ./swissprot.fasta.gz in your local directory, you can query it like so.

SELECT *
FROM read_fasta('./swissprot.fasta.gz')
LIMIT 5
-- ┌──────────────────────┬──────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────────────┐
-- │          id          │     description      │                                                           sequence                                                           │      file_name       │
-- │       varchar        │       varchar        │                                                           varchar                                                            │       varchar        │
-- ├──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────┤
-- │ sp|A0A023I7E1|ENG1…  │ Glucan endo-1,3-be…  │ MRFQVIVAAATITMITSYIPGVASQSTSDGDDLFVPVSNFDPKSIFPEIKHPFEPMYANTENGKIVPTNSWISNLFYPSADNLAPTTPDPYTLRLLDGYGGNPGLTIRQPSAKVLGSYPPTN…  │ ./swissprot.fasta.gz │
-- │ sp|A0A024B7W1|POLG…  │ Genome polyprotein…  │ MKNPKKKSGGFRIVNMLKRGVARVSPFGGLKRLPAGLLLGHGPIRMVLAILAFLRFTAIKPSLGLINRWGSVGKKEAMEIIKKFKKDLAAMLRIINARKEKKRRGADTSVGIVGLLLTTAMA…  │ ./swissprot.fasta.gz │
-- │ sp|A0A024SC78|CUTI…  │ Cutinase OS=Hypocr…  │ MRSLAILTTLLAGHAFAYPKPAPQSVNRRDWPSINEFLSELAKVMPIGDTITAACDLISDGEDAAASLFGISETENDPCGDVTVLFARGTCDPGNVGVLVGPWFFDSLQTALGSRTLGVKGV…  │ ./swissprot.fasta.gz │
-- │ sp|A0A024SH76|GUX2…  │ Exoglucanase 2 OS=…  │ MIVGILTTLATLATLAASVPLEERQACSSVWGQCGGQNWSGPTCCASGSTCVYSNDYYSQCLPGAASSSSSTRAASTTSRVSPTTSRSSSATPPPGSTTTRVPPVGSGTATYSGNPFVGVTP…  │ ./swissprot.fasta.gz │
-- │ sp|A0A026W182|ORCO…  │ Odorant receptor c…  │ MMKMKQQGLVADLLPNIRVMKTFGHFVFNYYNDNSSKYLHKVYCCVNLFMLLLQFGLCAVNLIVESADVDDLTANTITLLFFTHSIVKICYFAIRSKYFYRTWAIWNNPNSHPLFAESNARY…  │ ./swissprot.fasta.gz │
-- └──────────────────────┴──────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────┘

Schemas

The schemas for the FASTA and FASTQ table functions are as follows.

FASTA

The read_fasta function returns a table with the following schema.

column_name	column_type	nullable
id	VARCHAR	NO
description	VARCHAR	YES
sequence	VARCHAR	NO
file_name	VARCHAR	NO

FASTQ

The read_fastq function returns a table with the following schema.

column_name	column_type	nullable
id	VARCHAR	NO
description	VARCHAR	YES
sequence	VARCHAR	NO
quality_scores	VARCHAR	NO
file_name	VARCHAR	NO

Replacement Scans

A number of "replacement scans" also work, whereby you just need to have a file reasonably named, and the extension will pick up on it as the appropriate file. E.g. SELECT * FROM 'test.fasta' or SELECT * FROM 'test.fastq.gz'.

The following file endings will trigger the scan for FASTA:

.fasta.gz
.fa.gz
.fasta
.fa

And for FASTQ:

.fastq.gz
.fq.gz
.fastq
.fq

Globs

Globs are supported both within the table function and the replacement scan provided the glob matches the replacement scan in the first place.

For example, SELECT * FROM './path/to/*.fasta' will select all FASTA files in the ./path/to/ directory. This is the same as SELECT * FROM read_fasta('./path/to/*.fasta').

Writing Overview

On MacOS and Linux, you can write FASTA and FASTQ files using COPY TO.

FASTA

For example, given a table called my_fasta with the schema id VARCHAR, description VARCHAR, sequence VARCHAR, you can write it to a file called my_fasta.fasta like so.

For example, combining this with the DuckDB postgres scanner to export a query to a file:

LOAD 'postgres_scanner';
LOAD 'fasql';

-- Your connection string here, can use PGPASSWORD for the password
CALL POSTGRES_ATTACH('postgres://localhost:5432/my_db');

COPY (
  SELECT id_col AS id, desc_col AS description, seq_col AS sequence
  FROM my_table
  WHERE length(seq_col) < 50
) TO 'long_sequences.fasta' (FORMAT 'fasta');

A description column is optional, and if not present will be ignored.

FASTQ

Similar to above, the COPY TO syntax is the same, but the format is fastq and schema is id VARCHAR, description VARCHAR, sequence VARCHAR, quality_scores VARCHAR.

Installation and Usage

You can use this extension as you would other DuckDB extensions. Here's one example of how to do that in a raw DuckDB console and one in Python.

DuckDB Console

To install and use fasql, start a duckdb session:

# unsigned required for non DuckDB Labs distributed extensions (or a large monthly payment)
$ duckdb -unsigned

Now from the session:

D SET custom_extension_repository='dbe.wheretrue.com/fasql/latest';
D INSTALL fasql;
D LOAD fasql;

If you want some easy data to play with you can wget one of the test files.

$ wget https://raw.githubusercontent.com/wheretrue/fasql/main/test/sql/test.fasta

Then back into the console, load the extension and enjoy.

D LOAD fasql;
D SELECT COUNT(*) FROM read_fasta('./test.fasta');
-- ┌──────────────┐
-- │ count_star() │
-- │    int64     │
-- ├──────────────┤
-- │            2 │
-- └──────────────┘

Python

For example, this script installs the extension then counts the number of records at path.

import pathlib

import duckdb

con = duckdb.connect(config={'allow_unsigned_extensions': True})
con.execute("SET custom_extension_repository='dbe.wheretrue.com/fasql/latest';")
con.execute("INSTALL fasql;")
con.execute("LOAD fasql;")

# Assumes this is in your home directory.
path = pathlib.Path("swissprot.fasta.gz")

result = con.execute(f"SELECT * FROM read_fasta('{path}');").fetchall()
print(len(result))
assert len(result) == 569213

# Or create a dataframe.
df = con.execute(f"SELECT id, description, sequence FROM read_fasta('{path}');").df()
# print(df.head())
#                           id                                        description                                           sequence
# 0   sp|A0A023I7E1|ENG1_RHIMI  Glucan endo-1,3-beta-D-glucosidase 1 OS=Rhizom...  MRFQVIVAAATITMITSYIPGVASQSTSDGDDLFVPVSNFDPKSIF...
# 1   sp|A0A024B7W1|POLG_ZIKVF  Genome polyprotein OS=Zika virus (isolate ZIKV...  MKNPKKKSGGFRIVNMLKRGVARVSPFGGLKRLPAGLLLGHGPIRM...
# 2  sp|A0A024SC78|CUTI1_HYPJR  Cutinase OS=Hypocrea jecorina (strain ATCC 567...  MRSLAILTTLLAGHAFAYPKPAPQSVNRRDWPSINEFLSELAKVMP...
# 3   sp|A0A024SH76|GUX2_HYPJR  Exoglucanase 2 OS=Hypocrea jecorina (strain AT...  MIVGILTTLATLATLAASVPLEERQACSSVWGQCGGQNWSGPTCCA...
# 4   sp|A0A026W182|ORCO_OOCBI  Odorant receptor coreceptor OS=Ooceraea biroi ...  MMKMKQQGLVADLLPNIRVMKTFGHFVFNYYNDNSSKYLHKVYCCV...

Supported Platforms

This extension is built for Linux and MacOS.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
duckdb @ e8e4cea		duckdb @ e8e4cea
scripts		scripts
src		src
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cz.json		cz.json
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fasql (v0.5.0)

Reading Overview

Schemas

FASTA

FASTQ

Replacement Scans

Globs

Writing Overview

FASTA

FASTQ

Installation and Usage

DuckDB Console

Python

Supported Platforms

About

Releases

Packages

Languages

License

wheretrue/fasql

Folders and files

Latest commit

History

Repository files navigation

fasql (v0.5.0)

Reading Overview

Schemas

FASTA

FASTQ

Replacement Scans

Globs

Writing Overview

FASTA

FASTQ

Installation and Usage

DuckDB Console

Python

Supported Platforms

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages