Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat, wip: compound key on spectrum #3

Open
wants to merge 6 commits into
base: feature/auto_pin_handling2
Choose a base branch
from

Conversation

jspaezp
Copy link
Owner

@jspaezp jspaezp commented Dec 20, 2024

This PR goes off this discussion: wfondrie#132 (comment)

The main purpose is two-fold.

  1. replace the hard-coded columns that were defined in the past to define a spectrum to allow "compound keys" on the spectra.
  2. return semantic meaning to the datasets, since there had been a divergence between what the "columns" meant between the on-disk dataset and the linear psm dataset (which we should rename as in-mem-dataset or so ...)

Notably:

  1. I consolidate the datasets to an abstract base class, and made brew take either of those. (mokapot/dataset/init.py)
  2. moved column organization/definition to mokapot/column_defs.py. I would like to extend this class a bit in the future to include the column resolution required for multiple search engines (include the detection of the engine and propagate the column names accordingly).
  3. There are some refactors here and there in the sections of the code that I was not able to understand
  4. Bumps numpy to v2, since triqler 0.8 was released!

I think this should be the last major change, we should get after this a feature freeze and a pre-release. I really want a new release by Feb.

@jspaezp
Copy link
Owner Author

jspaezp commented Jan 8, 2025

What is broken RN?

  1. tests/system_tests/test_sqlite.py

    • The current implementation relies on hard-coded names, one of which is a single primary key. I can think of 2 easy ways around this:
      1. To concatenate all compound keys to a single one and use it as-is as the primary key.
      2. To pass the real column definitions to the column names. I am not a huge fan of concatenating strings to buld sql tables, so I would suggest a sql builder as a dependency (https://github.com/kayak/pypika seems like a good candidate)
  2. tests/system_tests/test_brew_rollup.py

    • Streaming qvals are different from the batch calculated ones, the streaming one starts with 0 instead of a number very close to 0 ... it feels like an off-by-one error or the lack of an addition of a 1 to a number of decoys somewhere.
image
  1. tests/unit_tests/test_parser_parquet.py::test_parquet_parsing
    • The test has labels with options [0, 1, -1], should the 0 be promoted to 1 or should the -1 be moved to 0? (are 0s targets or decoys?)
    • I am unsure what the expected behavior here should be (IMO the expected behavior is to error out).

@@ -285,7 +307,9 @@ def drop_missing_values_and_fill_spectra_dataframe(
chunk_size=CHUNK_SIZE_ROWS_FOR_DROP_COLUMNS, columns=column
)
for i, feature in enumerate(file_iterator):
if set(spectra) <= set(column):
if set(spectra) <= set(
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ... I should make sure it is and move the check out of the loop

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The function has IMHO some issues anyway. The name is like a whole sentence and still you don't know what it does exactly. Also the handling of nans/missing values is debatable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants