Skip to content

Commit

Permalink
merfish cell data download
Browse files Browse the repository at this point in the history
  • Loading branch information
oliche committed Feb 8, 2024
1 parent 3bcec73 commit 9af8a28
Show file tree
Hide file tree
Showing 7 changed files with 505 additions and 5 deletions.
92 changes: 92 additions & 0 deletions examples/atlas_genomics_load_merfish.ipynb

Large diffs are not rendered by default.

79 changes: 79 additions & 0 deletions iblatlas/genomics/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
"""A package for working with Allen genomics datasets: AGEA and MERFISH.
AGEA
----
This package provides a way to load the Allen Gene Expression volumes.
The 4345 volumes have been registered and formatted into a binary file.
agea/
├── gene-expression.bin
├── gene-expression.pqt
├── image.npy
└── label.npy
- gene-expression.bin is a float-16 binary file containing the gene expression volumes.
In c-order, the dimensions are (4345, 58, 41, 67) that corresponds to (nexperiments, ml, dv, ap) at 200 um.
- gene-expression.pqt is a parquet file describing the 4345 genes corresponding to the
gene expression volumes.
- image.npy: the Allen atlas diffusion imaging volume downsampled at the gene expression resolution
- label.npy: the Allen atlas region label volume downsampled at the gene expression resolution
See the building scripts in ./genomics/gene_expression_scrapping/05-generate-atlas-templates.py
[1] E. S. Lein et al., “Genome-wide atlas of gene expression in the adult mouse brain,”
Nature, vol. 445, no. 7124, Art. no. 7124, Jan. 2007, doi: 10.1038/nature05453.
[2] L. Ng et al., “An anatomic gene expression atlas of the adult mouse brain,”
Nat Neurosci, vol. 12, no. 3, Art. no. 3, Mar. 2009, doi: 10.1038/nn.2281.
MERFISH
-------
Spatially registered cell types from single cell transcriptomics data.
This package provides a way to load the MERFISH data from the Allen Brain Cell Atlas.
We formatted the original CSV files from the 2023-12-15 release into parquet files for faster loading and smaller hard
drive footprint.
merfish/
├── genes.pqt
├── neurotransmitters.pqt
├── classes.pqt
├── subclasses.pqt
├── supertypes.pqt
├── clusters.pqt
├── C57BL6J-638850_cells.pqt
├── Zhuang-ABCA-1_cells.pqt
├── Zhuang-ABCA-2_cells.pqt
├── Zhuang-ABCA-3_cells.pqt
└── Zhuang-ABCA-4_cells.pqt
- *_cells.pqt: Each dataframe corresponds to a given subject. The concatenation of those 5 dataframes lead to
8_879_868, 11 cells with the following columns:
- 'brain_section_label': the label of the brain section (subject and section): "Zhuang-ABCA-1.085"
- 'donor_label': the label of the subject
- 'neurotransmitter': neurotransmitter label {<NA>, 'Glut', 'Chol', 'GABA-Glyc', 'GABA','Dopa',
'Glut-GABA', 'Hist', 'Sero', 'Nora'}
- 'class': direct index of the class record in df_classes
- 'subclass': direct index of the subclass record in df_subclasses
- 'supertype': direct index of the supertype record in df_supertypes
- 'cluster': direct index of the cluster record in df_clusters
- 'x', 'y', 'z': coordinates of the cell in IBL space (see: iblatlas.atlas.AllenAtlas)
- 'Allen_id': allen region unique identifier
The cells are classified hierarchically, from high level to low level: classes, subclasses, supertypes and clusters.
- df_classes: a dataframe of classes (35, 3), where each record corresponds to a single class
- df_subclasses: a dataframe of subclasses (339, 4), where each record corresponds to a single subclass
- df_supertypes: a dataframe of supertypes (1202, 4), where each record corresponds to a single supertype
- df_clusters: a dataframe of clusters (5323, 5), where each record corresponds to a single cluster
Additional metadata:
- df_neurotransmitters: a dataframe of neurotransmitters (9, 2), index is the neurotransmitter label
- df_genes: a dataframe of genes (1122), this could be used in conjunction with raw gene expressions data (not implemented)
[1] Z. Yao et al., “A high-resolution transcriptomic and spatial atlas of cell types in the whole mouse brain,”
Nature, vol. 624, no. 7991, Art. no. 7991, Dec. 2023, doi: 10.1038/s41586-023-06812-z.
[2] M. Zhang et al., “Molecularly defined and spatially resolved cell atlas of the whole mouse brain,”
Nature, vol. 624, no. 7991, Art. no. 7991, Dec. 2023, doi: 10.1038/s41586-023-06808-9.
"""
6 changes: 1 addition & 5 deletions iblatlas/genomics/agea.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,4 @@
"""
[1] E. S. Lein et al., “Genome-wide atlas of gene expression in the adult mouse brain,”
Nature, vol. 445, no. 7124, Art. no. 7124, Jan. 2007, doi: 10.1038/nature05453.
[2] L. Ng et al., “An anatomic gene expression atlas of the adult mouse brain,”
Nat Neurosci, vol. 12, no. 3, Art. no. 3, Mar. 2009, doi: 10.1038/nn.2281.
"""A package for loading 4345 formatted and registered gene expression volumes
"""
import logging
from pathlib import Path
Expand Down
63 changes: 63 additions & 0 deletions iblatlas/genomics/merfish.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
import logging
from pathlib import Path

import pandas as pd
import numpy as np

import one.remote.aws as aws

from iblatlas import atlas

_logger = logging.getLogger(__name__)


def load(folder_cache=None):
"""
Reads in the Allen gene expression experiments tables
:param folder_cache:
:return:
df_cells: a dataframe of cells (8_879_868, 11), where each record corresponds to a single cell
df_classes: a dataframe of classes (35, 3), where each record corresponds to a single class
df_subclasses: a dataframe of subclasses (339, 4), where each record corresponds to a single subclass
df_supertypes: a dataframe of supertypes (1202, 4), where each record corresponds to a single supertype
df_clusters: a dataframe of clusters (5323, 5), where each record corresponds to a single cluster
df_genes: a dataframe of genes (1672, 4), where each record corresponds to a single gene
df_neurotransmitters: a dataframe of neurotransmitters (9, 2), where each record corresponds to a single
neurotransmitter
"""
OLD_VERSIONS = ['2023-06-12']
folder_cache = Path(folder_cache or atlas.AllenAtlas._get_cache_dir().joinpath('merfish'))
# check the AWS version and download the files if needed
version_flag = next(folder_cache.glob('*.version'), None)
if version_flag is None or version_flag.stem in OLD_VERSIONS:
_logger.info(f'downloading gene expression data from {aws.S3_BUCKET_IBL} s3 bucket...')
aws.s3_download_folder('atlas/merfish', folder_cache)
# it is faster and more memory efficient to read the parquet files with dask, but we do
# not want to require dask as a dependency so we provide the pandas alternative
try:
import dask.dataframe as dd
df_cells = dd.read_parquet(list(folder_cache.rglob('*_cells.pqt')))
df_cells = df_cells.compute()
except ImportError:
df_cells = pd.concat([pd.read_parquet(f) for f in folder_cache.rglob('*_cells.pqt')])
# reads in the other tables
df_classes = pd.read_parquet(folder_cache.joinpath('classes.pqt'))
df_subclasses = pd.read_parquet(folder_cache.joinpath('subclasses.pqt'))
df_supertypes = pd.read_parquet(folder_cache.joinpath('supertypes.pqt'))
df_clusters = pd.read_parquet(folder_cache.joinpath('clusters.pqt'))
df_genes = pd.read_parquet(folder_cache.joinpath('genes.pqt'))
df_neurotransmitters = pd.read_parquet(folder_cache.joinpath('neurotransmitters.pqt'))
return df_cells, df_classes, df_subclasses, df_supertypes, df_clusters, df_genes, df_neurotransmitters


def int2rgb(array, dtype=None):
"""
One liner to convert rgba values stored as integer in dataframes
:param array: rgba column of a dataframe or slice of the column
:param dtype: optional, if int will return the uint8 view from 0-255 else will return floats from 0-1
:return:
"""
if dtype in (int, np.int8):
return np.array(array).view('uint8').reshape(array.shape[0], 4)
else:
return np.array(array).view('uint8').reshape(array.shape[0], 4).astype(float) / 255
44 changes: 44 additions & 0 deletions iblatlas/genomics/merfish_scrapping/00_download_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
"""
Downloads the data from the Allen Brain Cell Atlas
Some useful resources as of 2024-01-31:
# https://alleninstitute.github.io/abc_atlas_access/notebooks/zhuang_merfish_tutorial.html
# https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html
# https://allen-brain-cell-atlas.s3.amazonaws.com/index.html
# https://ibl-brain-wide-map-public.s3.amazonaws.com/index.html
# the explorer setup is described by AWS here: https://github.com/awslabs/aws-js-s3-explorer/tree/master
"""
from pathlib import Path
import os
import json
import requests
from one.remote import aws

version = '20231215'
version = '20230830'
download_base = '/datadisk/Data/merfish_atlas/cache'

use_local_cache = False
manifest_path = 'releases/%s/manifest.json' % version

if not use_local_cache:
url = 'https://allen-brain-cell-atlas.s3-us-west-2.amazonaws.com/' + manifest_path
manifest = json.loads(requests.get(url).text)
else:
file = os.path.join(download_base, manifest_path)
with open(file, 'rb') as f:
manifest = json.load(f)

s3_allen, bucket_name = aws.get_s3_allen()
for r in manifest['directory_listing']:
r_dict = manifest['directory_listing'][r]
for d in r_dict['directories']:
if d != 'metadata':
continue
d_dict = r_dict['directories'][d]
local_path = Path(download_base).joinpath(d_dict['relative_path'])
print(local_path)
# !aws s3 ls s3://allen-brain-cell-atlas/metadata/Zhuang-ABCA-1/20231215/
aws.s3_download_folder(d_dict['relative_path'], local_path, s3_allen, bucket_name)
Loading

0 comments on commit 9af8a28

Please sign in to comment.