Query AnnData Objects with SQL

The Python based AnnSQL package enables SQL-based queries on AnnData objects, returning results as either a Pandas DataFrame, an AnnData object, or a Parquet file that can easily be imported into a variety of data analysis tools. Behind the scenes, AnnSQL converts the layers of an AnnData object into a relational DuckDB database. Each layer is stored as an individual table, allowing for simple or complex SQL queries, including table joins.

Features

Query AnnData with SQL.
Fast for complex queries and aggregative functions.
Return query results as Pandas Dataframes, Parquet files, or AnnData objects.
Create in-memory or on-disk databases directly from AnnData objects.
Open AnnSQL databases in R. No conversions necessary. Learn more

Full Documentation

docs.annsql.com

Quick Setup

pip install annsql

Basic Usage (In-Memory)

Ideal for smaller datasets.

from AnnSQL import AnnSQL
import scanpy as sc

#read sample data
adata = sc.datasets.pbmc68k_reduced()

#instantiate the AnnData object (you may also pass a h5ad file to the adata parameter)
asql = AnnSQL(adata=adata)

#query the expression table. Returns Pandas Dataframe by Default
asql.query("SELECT * FROM adata LIMIT 10")

Basic Usage (On-Disk)

For larger datasets, AnnSQL can create a local database (asql) from the AnnData object. This database is stored on-disk, can be queried, and is persistent.

import scanpy as sc
from AnnSQL import AnnSQL
from AnnSQL.MakeDb import MakeDb

#read sample data
adata = sc.datasets.pbmc68k_reduced()

#build the AnnSQL database
MakeDb(adata=adata, db_name="pbmc3k_reduced", db_path="db/")

#open the AnnSQL database
asql = AnnSQL(db="db/pbmc3k_reduced.asql")

#query the expression table
asql.query("SELECT * FROM adata LIMIT 5")

Advanced Queries and Usage

from AnnSQL import AnnSQL
import scanpy as sc

#read sample data
adata = sc.datasets.pbmc68k_reduced()

#pass the AnnData object to the AnnSQL class
asql = AnnSQL(adata=adata)

#group and count all labels
asql.query("SELECT obs.bulk_labels, COUNT(*) FROM obs GROUP BY obs.bulk_labels")

#take the log10 of a value
asql.query("SELECT LOG10(HES4) FROM X WHERE HES4 > 0")

#sum all gene counts | Memory intensive | See method calculate_gene_counts for chunked approach.
asql.query("SELECT SUM(COLUMNS(*)) FROM (SELECT * EXCLUDE (cell_id) FROM X)")

#taking the correlation of genes ITGB2 and SSU72 in dendritic cells that express either gene > 0
asql.query("SELECT corr(ITGB2,SSU72) as correlation FROM adata WHERE bulk_labels = 'Dendritic' AND (ITGB2 > 0 OR SSU72 >0)")

Reference

AnnSQL: A Python SQL-based package for large-scale single-cell genomics analysis on a laptop
Kenny Pavan, Arpiar Saunders
bioRxiv 2024.11.02.621676; doi: https://doi.org/10.1101/2024.11.02.621676

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
docs		docs
examples		examples
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Query AnnData Objects with SQL

Features

Full Documentation

docs.annsql.com

Quick Setup

Basic Usage (In-Memory)

Basic Usage (On-Disk)

Advanced Queries and Usage

Reference

About

Releases 11

Packages

Languages

License

ArpiarSaundersLab/annsql

Folders and files

Latest commit

History

Repository files navigation

Query AnnData Objects with SQL

Features

Full Documentation

docs.annsql.com

Quick Setup

Basic Usage (In-Memory)

Basic Usage (On-Disk)

Advanced Queries and Usage

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 11

Packages 0

Languages

Packages