Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/cells argument #170

Open
wants to merge 84 commits into
base: master
Choose a base branch
from
Open

Feature/cells argument #170

wants to merge 84 commits into from

Conversation

Hoohm
Copy link
Owner

@Hoohm Hoohm commented Jul 3, 2022

  • Rewrite data chunking
  • Rewrite loading of CSVs with polars
  • Rewrite Mapping in polars
  • Rewrite barcode correction in polars
  • Rewrite UMI correction in polars
  • Rewrite fastq reading in polars
  • Disambiguation of whitelist and reference.
  • Generate parquet outputs
  • Deprecate csv outputs

Tasks details

Rewrite UMI correction in polars

Current version of CSC uses umi_tools.network.UMIClusterer() to go through each list of UMIs per cell per feature and handles the potential UMI corrections needed. The simple implementation on polars is to use map_elements but this is not optimized as it's not using the polars infrastructure. There is a big potential for improvement if this step can be rewritten entirely in polars.

Status on branch

UMI correction is skipped at the moment, no function available.

Rewrite fastq reading in polars

Current version of CSC reads in the fastq files and then spits out a big csv which we read using polars. Fastq files are basically text files with 4 lines per read. We can rewrite the input intake to read fastq files directly and store them into a dataframe. This reduces io operations and should be faster as well. It also would allow to extend CSC to use quality to filter reads.

Status on branch

io.write_mapping_input is the function that reads the fastqs and writes the csv to be read later. Then preprocessing.split_data_input reads the csv file and generates the dataframes necessary for processing. The idea would be to skip the intermediate step by just reading the fastqs directly into the necessary dataframes.

Disambiguation of whitelist and reference

Currently CSC uses terms such as whitelist and reference to distinguish a short handpicked list from users and the whole world of barcodes. But historically, reference files also have been called whitelist and this makes it confusing. I would like to change the language to reference_subset and reference to make it clearer that the first one is a subset of the second one.

Status on branch

Delete any mention of whitelist and replace it by subset.

The two last tasks I'm going to deal with.

@lldelisle
Copy link

I was trying to make the code of this branch work and I observed an unexpected behaviour with your choice of 'join_asof' to correct cell barcodes:

In [1]: import polars as pl

In [2]: barcodes_df = pl.DataFrame({'barcode': ["AACATATTCTTTACTG", "TAAAGGGAAGTCAAGC", "TAAATATTCTTTACTG", "TACATATTCTTTACTG", "
   ...: TAGAGCGAAGTCAAGC", "TAGAGGGAAGTCAAGC"], 'count': [1, 1, 1, 98, 1, 98]})

In [3]: barcode_subset_df = pl.DataFrame({'whitelist': ['TACATATTCTTTACTG', 'TAGAGGGAAGTCAAGC']})

In [4]: barcode_subset_df = barcode_subset_df.with_columns(
   ...:             reference=pl.col("whitelist"))

In [5]: barcode_subset_df
Out[5]: 
shape: (2, 2)
┌──────────────────┬──────────────────┐
│ whitelistreference        │
│ ------              │
│ strstr              │
╞══════════════════╪══════════════════╡
│ TACATATTCTTTACTGTACATATTCTTTACTG │
│ TAGAGGGAAGTCAAGCTAGAGGGAAGTCAAGC │
└──────────────────┴──────────────────┘

In [6]: BARCODE_COLUMN = 'barcode'

In [7]: WHITELIST_COLUMN = 'whitelist'

In [8]: temp1 = barcodes_df.sort(BARCODE_COLUMN).join_asof(
   ...:             barcode_subset_df.sort(WHITELIST_COLUMN),
   ...:             left_on=BARCODE_COLUMN,
   ...:             right_on=WHITELIST_COLUMN,
   ...:         )

In [9]: temp1
Out[9]: 
shape: (6, 4)
┌──────────────────┬───────┬──────────────────┬──────────────────┐
│ barcodecountwhitelistreference        │
│ ------------              │
│ stri64strstr              │
╞══════════════════╪═══════╪══════════════════╪══════════════════╡
│ AACATATTCTTTACTG1nullnull             │
│ TAAAGGGAAGTCAAGC1nullnull             │
│ TAAATATTCTTTACTG1nullnull             │
│ TACATATTCTTTACTG98TACATATTCTTTACTGTACATATTCTTTACTG │
│ TAGAGCGAAGTCAAGC1TACATATTCTTTACTGTACATATTCTTTACTG │
│ TAGAGGGAAGTCAAGC98TAGAGGGAAGTCAAGCTAGAGGGAAGTCAAGC │
└──────────────────┴───────┴──────────────────┴──────────────────┘

What I would expect:

┌──────────────────┬───────┬──────────────────┬──────────────────┐
│ barcodecountwhitelistreference        │
│ ------------              │
│ stri64strstr              │
╞══════════════════╪═══════╪══════════════════╪══════════════════╡
│ AACATATTCTTTACTG1TACATATTCTTTACTGTACATATTCTTTACTG │
│ TAAAGGGAAGTCAAGC1TAGAGGGAAGTCAAGCTAGAGGGAAGTCAAGC │
│ TAAATATTCTTTACTG1TACATATTCTTTACTGTACATATTCTTTACTG │
│ TACATATTCTTTACTG98TACATATTCTTTACTGTACATATTCTTTACTG │
│ TAGAGCGAAGTCAAGC1TAGAGGGAAGTCAAGCTAGAGGGAAGTCAAGC │
│ TAGAGGGAAGTCAAGC98TAGAGGGAAGTCAAGCTAGAGGGAAGTCAAGC │
└──────────────────┴───────┴──────────────────┴──────────────────┘

@Hoohm
Copy link
Owner Author

Hoohm commented Nov 22, 2023

Good catch! Should have started with writing tests!

I can go back to a simpler but slightly slower implementation or find a better way using the asof join.

@Hoohm
Copy link
Owner Author

Hoohm commented Dec 28, 2023

@lldelisle I rewrote correct_barcodes_pl and I added some tests including the ones you ran. I think this time it works. Basically run the asof join twice with both strategies and only keep the ones that have a close enough hamming distance.

Again, thank you for catching this so early!

@Hoohm
Copy link
Owner Author

Hoohm commented Dec 28, 2023

I've turned off UMI correction for now just to see runs go through completely. The path without cell reference and whitelist should be working. I've still to write tests for the MTX outputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants