feat, wip: compound key on spectrum #3

jspaezp · 2024-12-20T21:47:19Z

This PR goes off this discussion: wfondrie#132 (comment)

The main purpose is two-fold.

replace the hard-coded columns that were defined in the past to define a spectrum to allow "compound keys" on the spectra.
return semantic meaning to the datasets, since there had been a divergence between what the "columns" meant between the on-disk dataset and the linear psm dataset (which we should rename as in-mem-dataset or so ...)

Notably:

I consolidate the datasets to an abstract base class, and made brew take either of those. (mokapot/dataset/init.py)
moved column organization/definition to mokapot/column_defs.py. I would like to extend this class a bit in the future to include the column resolution required for multiple search engines (include the detection of the engine and propagate the column names accordingly).
There are some refactors here and there in the sections of the code that I was not able to understand
Bumps numpy to v2, since triqler 0.8 was released!

I think this should be the last major change, we should get after this a feature freeze and a pre-release. I really want a new release by Feb.

…ion to remove numba

jspaezp · 2025-01-08T16:30:05Z

What is broken RN?

tests/system_tests/test_sqlite.py
- The current implementation relies on hard-coded names, one of which is a single primary key. I can think of 2 easy ways around this:
  1. To concatenate all compound keys to a single one and use it as-is as the primary key.
  2. To pass the real column definitions to the column names. I am not a huge fan of concatenating strings to buld sql tables, so I would suggest a sql builder as a dependency (https://github.com/kayak/pypika seems like a good candidate)
tests/system_tests/test_brew_rollup.py
- Streaming qvals are different from the batch calculated ones, the streaming one starts with 0 instead of a number very close to 0 ... it feels like an off-by-one error or the lack of an addition of a 1 to a number of decoys somewhere.

tests/unit_tests/test_parser_parquet.py::test_parquet_parsing
- The test has labels with options [0, 1, -1], should the 0 be promoted to 1 or should the -1 be moved to 0? (are 0s targets or decoys?)
- I am unsure what the expected behavior here should be (IMO the expected behavior is to error out).

jspaezp · 2025-01-08T16:42:20Z

mokapot/parsers/pin.py

@@ -285,7 +307,9 @@ def drop_missing_values_and_fill_spectra_dataframe(
        chunk_size=CHUNK_SIZE_ROWS_FOR_DROP_COLUMNS, columns=column
    )
    for i, feature in enumerate(file_iterator):
-        if set(spectra) <= set(column):
+        if set(spectra) <= set(


I think it is ... I should make sure it is and move the check out of the loop

Yes. The function has IMHO some issues anyway. The name is like a whole sentence and still you don't know what it does exactly. Also the handling of nans/missing values is debatable.

jspaezp added 6 commits December 20, 2024 15:46

feat, wip: compound key on spectrum

a8df8e4

refactor,wip: centralized column group logic

614971f

refactor,dataset: broke module into files and changed tdc implementat…

1ec39df

…ion to remove numba

fix: fixed string to bool target col conversion and added notes on tests

44bf2e2

ci: enabled lint and test on all PRs

10b9f02

chore: updated triqler and np versions

8b5a736

jspaezp commented Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat, wip: compound key on spectrum #3

feat, wip: compound key on spectrum #3

jspaezp commented Dec 20, 2024 •

edited

Loading

jspaezp commented Jan 8, 2025

jspaezp Jan 8, 2025

ezander Jan 10, 2025

feat, wip: compound key on spectrum #3

Are you sure you want to change the base?

feat, wip: compound key on spectrum #3

Conversation

jspaezp commented Dec 20, 2024 • edited Loading

jspaezp commented Jan 8, 2025

What is broken RN?

jspaezp Jan 8, 2025

Choose a reason for hiding this comment

ezander Jan 10, 2025

Choose a reason for hiding this comment

jspaezp commented Dec 20, 2024 •

edited

Loading