Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for sex chromosomes to per-sample summary stats #626

Merged
merged 12 commits into from
Jul 30, 2024

Conversation

jkgoodrich
Copy link
Contributor

@jkgoodrich jkgoodrich commented Jul 8, 2024

Tested with the following:

Create filter group HT for all tests:

Command:

hailctl dataproc submit jg2 gnomad_qc/v4/assessment/calculate_per_sample_stats.py \
    --data-type exomes \
    --create-filter-group-ht \
    --overwrite \
    --pyfiles ../gnomad_methods/gnomad,gnomad_qc

Output:

Filter group HT: gs://gnomad/v4.1/assessment/exomes/gnomad.exomes.v4.1.per_sample_filtering_groups.ht

Job ID: 55c5bbe6a3ff4abdab4581edf44dcad4

No intermediate:

Full pipeline with autosomes and sex chromosomes processed in one run:
Command:

hailctl dataproc submit jg2 gnomad_qc/v4/assessment/calculate_per_sample_stats.py \
    --data-type exomes \
    --test-dataset \
    --create-per-sample-counts-ht \
    --aggregate-sample-stats \
    --custom-suffix no_intermediate_all_in_one \
    --overwrite \
    --pyfiles ../gnomad_methods/gnomad,gnomad_qc

Output:

per-sample counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.no_intermediate_all_in_one.per_sample_variant_counts.ht
aggregate counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.no_intermediate_all_in_one.per_sample_variant_counts.aggregated.ht

Job ID: 84e68a8a723045449d348e8dd131578c

Full pipeline on autosomes only:
Command:

hailctl dataproc submit jg2 gnomad_qc/v4/assessment/calculate_per_sample_stats.py \
    --data-type exomes \
    --test-dataset \
    --create-per-sample-counts-ht \
    --aggregate-sample-stats \
    --autosomes-only-stats \
    --custom-suffix no_intermediate_combine_chr \
    --overwrite \
    --pyfiles ../gnomad_methods/gnomad,gnomad_qc

Output:

per-sample counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.no_intermediate_combine_chr.per_sample_variant_counts.autosomes.ht
aggregate counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.no_intermediate_combine_chr.per_sample_variant_counts.autosomes.aggregated.ht

Job ID: e9e5619a71154b4b9ced0ae5b6f04bce

Full pipeline on sex chromosomes only:
Command:

hailctl dataproc submit jg2 gnomad_qc/v4/assessment/calculate_per_sample_stats.py \
    --data-type exomes \
    --test-dataset \
    --create-per-sample-counts-ht \
    --aggregate-sample-stats \
    --sex-chr-only-stats \
    --custom-suffix no_intermediate_combine_chr \
    --overwrite \
    --pyfiles ../gnomad_methods/gnomad,gnomad_qc

Output:

per-sample counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.no_intermediate_combine_chr.per_sample_variant_counts.sex_chr.ht
aggregate counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.no_intermediate_combine_chr.per_sample_variant_counts.sex_chr.aggregated.ht

Job ID: a04667e986224188a5cbfd6fdf8e3740

Combine autosomes and sex chromosomes stats and aggregate:
Command:

hailctl dataproc submit jg2 gnomad_qc/v4/assessment/calculate_per_sample_stats.py \
    --data-type exomes \
    --test-dataset \
    --aggregate-sample-stats \
    --combine-autosome-and-sex-chr-stats \
    --custom-suffix no_intermediate_combine_chr \
    --overwrite \
    --pyfiles ../gnomad_methods/gnomad,gnomad_qc

Output:

per-sample counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.no_intermediate_combine_chr.per_sample_variant_counts.ht
aggregate counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.no_intermediate_combine_chr.per_sample_variant_counts.aggregated.ht

Job ID: b7c2e9eaea7549128cb30af010807f52

Using intermediate:

Full pipeline with autosomes and sex chromosomes processed in one run:
Command:

hailctl dataproc submit jg2 gnomad_qc/v4/assessment/calculate_per_sample_stats.py \
    --data-type exomes \
    --test-dataset \
    --create-intermediate-mt-for-sample-counts \
    --create-per-sample-counts-ht \
    --use-intermediate-mt-for-sample-counts \
    --aggregate-sample-stats \
    --custom-suffix intermediate_all_in_one \
    --overwrite \
    --pyfiles ../gnomad_methods/gnomad,gnomad_qc

Output:

intermediate HT: gs://gnomad-tmp/gnomad.exomes.v4.1.qc_data/per_sample_summary_stats_intermediate.test.ht
per-sample counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.intermediate_all_in_one.per_sample_variant_counts.ht
aggregate counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.intermediate_all_in_one.per_sample_variant_counts.aggregated.ht

Job ID: 3ec7d2b378b24fdaaec691d483293585

Full pipeline on autosomes only:
Command:

hailctl dataproc submit jg2 gnomad_qc/v4/assessment/calculate_per_sample_stats.py \
    --data-type exomes \
    --test-dataset \
    --create-intermediate-mt-for-sample-counts \
    --create-per-sample-counts-ht \
    --use-intermediate-mt-for-sample-counts \
    --aggregate-sample-stats \
    --autosomes-only-stats \
    --custom-suffix intermediate_combine_chr \
    --overwrite \
    --pyfiles ../gnomad_methods/gnomad,gnomad_qc

Output:

intermediate HT: gs://gnomad-tmp/gnomad.exomes.v4.1.qc_data/per_sample_summary_stats_intermediate.test.autosomes.ht
per-sample counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.intermediate_combine_chr.per_sample_variant_counts.autosomes.ht
aggregate counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.intermediate_combine_chr.per_sample_variant_counts.autosomes.aggregated.ht

Job ID: 7284b9db36de4909be6bcb6be6b9cb71

Full pipeline on sex chromosomes only:
Command:

hailctl dataproc submit jg2 gnomad_qc/v4/assessment/calculate_per_sample_stats.py \
    --data-type exomes \
    --test-dataset \
    --create-intermediate-mt-for-sample-counts \
    --create-per-sample-counts-ht \
    --use-intermediate-mt-for-sample-counts \
    --aggregate-sample-stats \
    --sex-chr-only-stats \
    --custom-suffix intermediate_combine_chr \
    --overwrite \
    --pyfiles ../gnomad_methods/gnomad,gnomad_qc

Output:

intermediate HT: gs://gnomad-tmp/gnomad.exomes.v4.1.qc_data/per_sample_summary_stats_intermediate.test.sex_chr.ht
per-sample counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.intermediate_combine_chr.per_sample_variant_counts.sex_chr.ht
aggregate counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.intermediate_combine_chr.per_sample_variant_counts.sex_chr.aggregated.ht

Job ID: dc1b64766e454f88897fe8d9b5dddc03

Combine autosomes and sex chromosomes stats and aggregate:
Command:

hailctl dataproc submit jg2 gnomad_qc/v4/assessment/calculate_per_sample_stats.py \
    --data-type exomes \
    --test-dataset \
    --aggregate-sample-stats \
    --combine-autosome-and-sex-chr-stats \
    --custom-suffix intermediate_combine_chr \
    --overwrite \
    --pyfiles ../gnomad_methods/gnomad,gnomad_qc

Output:

per-sample counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.intermediate_combine_chr.per_sample_variant_counts.ht
aggregate counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.intermediate_combine_chr.per_sample_variant_counts.aggregated.ht

Job ID: d5e4fb6a5a9445bba52e6f16dcd38ff4

@mike-w-wilson
Copy link
Contributor

This is a great rewrite, it is so clear. I totally forgot about the pipeline class and how much that cleans the script up -- as does the the test arg handling. As far as data out looks, the y_non_par stats are different when comparing the all in one run, (gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.no_intermediate_all_in_one.per_sample_variant_counts.aggregated.ht from Job ID: 84e68a8a723045449d348e8dd131578c) and the intermediate+combine run (gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.intermediate_combine_chr.per_sample_variant_counts.aggregated.ht from Job ID: d5e4fb6a5a9445bba52e6f16dcd38ff4).
Screenshot 2024-07-11 at 3 31 31 PM
The first column is a matching boolean, the second is the full run, the third is the intermediate+combine. The intermediate+combine appears to always be half the full run value so maybe something is happening regarding ploidy in the full? Here is my notebook (gs://gnomad-mwilson/v4.1/sum_stat_testing.ipynb), last cell is the fails. Other than that, this looks good, all my comments are for comments/docs.

gnomad_qc/v3/resources/basics.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
**{k: hl.sum([x[s] for s in v]) for k, v in sums.items()},
**{k: x[d1] - x[d2] for k, (d1, d2) in diffs.items()},
**{
k: divide_null(hl.float64(x[n]), x[d])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure why I'm seeing NaNs in the ratio fields instead of missing, e.g. r_ti_tv, for your test tables, considering this is in here. Not a big deal since we can see its the result of divide by zero but unexpected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you take a look again? I only see NaNs now in the agg Tables, not the count Tables

@jkgoodrich
Copy link
Contributor Author

I still need to fix the main issue and figure out the NaN, but in the meantime, can you take a look at the rest of the changes?

… that this also gets done on the combined HT
@jkgoodrich
Copy link
Contributor Author

OK, I found the problem, it's a very simple fix. Just needed to move the setting of y_nonpar in XX samples to missing so that it's also done on the combined HT.

Here are the new tests:

No intermediate:

Full pipeline with autosomes and sex chromosomes processed in one run:
Command:

hailctl dataproc submit jg1 gnomad_qc/v4/assessment/calculate_per_sample_stats.py \
    --data-type exomes \
    --test-dataset \
    --create-per-sample-counts-ht \
    --aggregate-sample-stats \
    --custom-suffix no_intermediate_all_in_one \
    --overwrite \
    --pyfiles ../gnomad_methods/gnomad,gnomad_qc

Output:

per-sample counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.no_intermediate_all_in_one.per_sample_variant_counts.ht
aggregate counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.no_intermediate_all_in_one.per_sample_variant_counts.aggregated.ht

Job ID: cf9d6f6773d0466789324c7512fea122

Full pipeline on autosomes only:
Command:

hailctl dataproc submit jg1 gnomad_qc/v4/assessment/calculate_per_sample_stats.py \
    --data-type exomes \
    --test-dataset \
    --create-per-sample-counts-ht \
    --aggregate-sample-stats \
    --autosomes-only-stats \
    --custom-suffix no_intermediate_combine_chr \
    --overwrite \
    --pyfiles ../gnomad_methods/gnomad,gnomad_qc

Output:

per-sample counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.no_intermediate_combine_chr.per_sample_variant_counts.autosomes.ht
aggregate counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.no_intermediate_combine_chr.per_sample_variant_counts.autosomes.aggregated.ht

Job ID: 1d3c35cad8704f85be6927734d19e565

Full pipeline on sex chromosomes only:
Command:

hailctl dataproc submit jg1 gnomad_qc/v4/assessment/calculate_per_sample_stats.py \
    --data-type exomes \
    --test-dataset \
    --create-per-sample-counts-ht \
    --aggregate-sample-stats \
    --sex-chr-only-stats \
    --custom-suffix no_intermediate_combine_chr \
    --overwrite \
    --pyfiles ../gnomad_methods/gnomad,gnomad_qc

Output:

per-sample counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.no_intermediate_combine_chr.per_sample_variant_counts.sex_chr.ht
aggregate counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.no_intermediate_combine_chr.per_sample_variant_counts.sex_chr.aggregated.ht

Job ID: 41a2325bf5804cd2ac6277b9ae6be5d6

Combine autosomes and sex chromosomes stats and aggregate:
Command:

hailctl dataproc submit jg1 gnomad_qc/v4/assessment/calculate_per_sample_stats.py \
    --data-type exomes \
    --test-dataset \
    --aggregate-sample-stats \
    --combine-autosome-and-sex-chr-stats \
    --custom-suffix no_intermediate_combine_chr \
    --overwrite \
    --pyfiles ../gnomad_methods/gnomad,gnomad_qc

Output:

per-sample counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.no_intermediate_combine_chr.per_sample_variant_counts.ht
aggregate counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.no_intermediate_combine_chr.per_sample_variant_counts.aggregated.ht

Job ID: b012882d71894e3ba4fd7aac02bf225d

Using intermediate:

Full pipeline with autosomes and sex chromosomes processed in one run:
Command:

hailctl dataproc submit jg1 gnomad_qc/v4/assessment/calculate_per_sample_stats.py \
    --data-type exomes \
    --test-dataset \
    --create-intermediate-mt-for-sample-counts \
    --create-per-sample-counts-ht \
    --use-intermediate-mt-for-sample-counts \
    --aggregate-sample-stats \
    --custom-suffix intermediate_all_in_one \
    --overwrite \
    --pyfiles ../gnomad_methods/gnomad,gnomad_qc

Output:

intermediate HT: gs://gnomad-tmp/gnomad.exomes.v4.1.qc_data/per_sample_summary_stats_intermediate.test.ht
per-sample counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.intermediate_all_in_one.per_sample_variant_counts.ht
aggregate counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.intermediate_all_in_one.per_sample_variant_counts.aggregated.ht

Job ID: ed4da37139e04ba09b8eb34f163a0693

Full pipeline on autosomes only:
Command:

hailctl dataproc submit jg1 gnomad_qc/v4/assessment/calculate_per_sample_stats.py \
    --data-type exomes \
    --test-dataset \
    --create-intermediate-mt-for-sample-counts \
    --create-per-sample-counts-ht \
    --use-intermediate-mt-for-sample-counts \
    --aggregate-sample-stats \
    --autosomes-only-stats \
    --custom-suffix intermediate_combine_chr \
    --overwrite \
    --pyfiles ../gnomad_methods/gnomad,gnomad_qc

Output:

intermediate HT: gs://gnomad-tmp/gnomad.exomes.v4.1.qc_data/per_sample_summary_stats_intermediate.test.autosomes.ht
per-sample counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.intermediate_combine_chr.per_sample_variant_counts.autosomes.ht
aggregate counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.intermediate_combine_chr.per_sample_variant_counts.autosomes.aggregated.ht

Job ID: b85117e5f62f434ca30480ccf3b3f94f

Full pipeline on sex chromosomes only:
Command:

hailctl dataproc submit jg1 gnomad_qc/v4/assessment/calculate_per_sample_stats.py \
    --data-type exomes \
    --test-dataset \
    --create-intermediate-mt-for-sample-counts \
    --create-per-sample-counts-ht \
    --use-intermediate-mt-for-sample-counts \
    --aggregate-sample-stats \
    --sex-chr-only-stats \
    --custom-suffix intermediate_combine_chr \
    --overwrite \
    --pyfiles ../gnomad_methods/gnomad,gnomad_qc

Output:

intermediate HT: gs://gnomad-tmp/gnomad.exomes.v4.1.qc_data/per_sample_summary_stats_intermediate.test.sex_chr.ht
per-sample counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.intermediate_combine_chr.per_sample_variant_counts.sex_chr.ht
aggregate counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.intermediate_combine_chr.per_sample_variant_counts.sex_chr.aggregated.ht

Job ID: 65d226b70bdd4ca6baa279e70dd6a841

Combine autosomes and sex chromosomes stats and aggregate:
Command:

hailctl dataproc submit jg1 gnomad_qc/v4/assessment/calculate_per_sample_stats.py \
    --data-type exomes \
    --test-dataset \
    --aggregate-sample-stats \
    --combine-autosome-and-sex-chr-stats \
    --custom-suffix intermediate_combine_chr \
    --overwrite \
    --pyfiles ../gnomad_methods/gnomad,gnomad_qc

Output:

per-sample counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.intermediate_combine_chr.per_sample_variant_counts.ht
aggregate counts: gs://gnomad-tmp/gnomad_v4.1_testing/assessment/exomes/gnomad.exomes.v4.1.intermediate_combine_chr.per_sample_variant_counts.aggregated.ht

Job ID: 00fe80d228bf4682a3fa0a0ba5b58065

Updated the notebooks:

http://localhost:8123/notebooks/gnomad-julia/gnomad_v4/test_per_sample_stats_add_sex_chr.ipynb#
http://localhost:8123/notebooks/gnomad-mwilson/v4.1/sum_stat_testing.ipynb#

Copy link
Contributor

@mike-w-wilson mike-w-wilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@jkgoodrich jkgoodrich merged commit 152c556 into main Jul 30, 2024
4 checks passed
@jkgoodrich jkgoodrich deleted the jg/summary_stats_sex_chr branch July 30, 2024 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants