Skip to content

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.

License

Notifications You must be signed in to change notification settings

single-cell-data/TileDB-SOMA

Repository files navigation

TileDB logo

TileDB-SOMA Python CI TileDB-SOMA R CI PyPI version tiledbsoma status badge codecov

TileDB-SOMA

SOMA – for “Stack Of Matrices, Annotated” – is a flexible, extensible, and open-source API enabling access to data in a variety of formats. The driving use case of SOMA is for single-cell data in the form of annotated matrices where observations are frequently cells and features are genes, proteins, or genomic regions.

The TileDB-SOMA package is a C++ library with APIs in Python and R, using TileDB Embedded to implement the SOMA specification.

Get started on using TileDB-SOMA:

What Can TileDB-SOMA Do?

Intended to be used for single-cell data, TileDB-SOMA provides Python and R APIs to allow for storage and data access patterns at scale and for larger-than-memory operations:

  • Create and write large volumes of data.
  • Open and read data at low latency, locally and from the cloud.
  • Query and access interconnected arrays efficiently and at low latency.

TileDB-SOMA provides interoperability with existing single-cell toolkits:

  • Load and create AnnData objects.
  • Load and create Seurat objects.

TileDB-SOMA provides interoperability with existing Python or R data structures:

  • From Python create PyArrow objects, SciPy sparse matrices, NumPy arrays, and pandas data frames.
  • From R create R Arrow objects, sparse matrices (via the Matrix package), and standard data frames and (dense) matrices.

Community

APIs Installation and Quick Start

API Documentation

The TileDB-SOMA doc-site (Python|R), contains the reference documentation and tutorials.

Reference documentation can also be accessed directly from Python help(tiledsoma) or R help(package = "tiledbsoma").

Main SOMA Objects

The capabilities of TileDB-SOMA lay on the different read, access, and query patterns that each of the main implementations of SOMA objects provide:

  • DenseNDArray is a dense, N-dimensional array, with offset (zero-based) integer indexing on each dimension.
  • SparseNDArray is the same as DenseNDArray but sparse, and supports point indexing (disjoint index access).
  • DataFrame is a multi-column table with a user-defined columns names and value types, with support for point indexing.
  • Collection is a persistent container of named SOMA objects.
  • Experiment is a class that represents a single-cell experiment. It always contains two objects:
    • obs: a DataFrame with primary annotations on the observation axis.
    • ms: a Collection of measurements, each composed of X matrices and axis annotation matrices or data frames (e.g. var, varm, obsm, etc).

Who Is Using SOMA?

  • CZ CELLxGENE Discover to build its Census, which provides efficient access and querying to a corpus containing nearly 50 million cells, compiled from 700+ datasets.

If you are interested in listing any projects here please contact us at soma@chanzuckerberg.com.

Issues and Contacts

Branches

This branch, main, implements the updated specfication. Please also see the main-old branch which implements the original specification.

Developer Information

Code of Conduct

All participants in TileDB spaces are expected to adhere to high standards of professionalism in all interactions. This repository is governed by the specific standards and reporting procedures detailed in depth in the TileDB core repository Code Of Conduct.