(feat) added auto handling of traditional pin and testing #126

jspaezp · 2024-09-06T23:49:42Z

This PR fixes backwards compatibility with 'traditional' pin files (file extension == ".pin", tab delimited, tabs as protein separators, last column == "protein(s)?" caps insensitive).

Basically, when detecting that its a 'traditional' pin, in memory it 'fixes' the separators in the protein column and moves forward with the fixed dataframe. This in practice means that pin files with ragged edges don't benefit from the streaming dataset benefits but are handled transparently in the background (in other words: if your data is small, no need to convert it. If it is large it will break IF it has tabs as separators in the protein column, which I believe its a good compromise)

This PR also incorporates the changes in #125 (sorry for the large auto-formatted changes).

codecov · 2024-09-07T04:57:26Z

Codecov Report

Attention: Patch coverage is 87.30159% with 8 lines in your changes missing coverage. Please review.

Project coverage is 85.68%. Comparing base (6b8fbfd) to head (3214c24).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
mokapot/tabular_data.py	85.71%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #126      +/-   ##
==========================================
+ Coverage   84.60%   85.68%   +1.08%     
==========================================
  Files          26       25       -1     
  Lines        2949     2928      -21     
==========================================
+ Hits         2495     2509      +14     
+ Misses        454      419      -35

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gessulat · 2024-09-18T10:56:02Z

mokapot/tabular_data.py

+    header_names = None
+    num_cols = None
+
+    out_lines = []


This accumulating in-memory breaks the assumption that we can stream independent of input file size.
For example, running Mokapot on a PIN file from 10x Astral files would consume a lot of memory.

I agree, but my general thought was ... if you care about that you can reformat the file to any of the formats that support streaming (fix the separator, rename to tsv/csv in the simplest of cases). I could add a line where if more than 50k lines are parsed, it notifies the user that the file is large and a conversion might be a good idea (or to use comet with the separator parameter ... I cannot think of many other search engines right now that keep the protein separator as tabs).

The alternatives I thought were to:

have a temp file where the lines are dumped and then read with any of the streaming alternatives. But since this only really matters for large files, I didn't like the idea of duplicating a massive file that might or might not get cleaned correctly.

Have a non-temp file generated in the same dir location (It felt a bit intrusive to clutter the user's directory).

Error out and point the user to a sub-command that handles the conversion (very non-backwards compatible) and the main purpose of this PR is to restore backwards compatibility.

Also note that this will only run if is_traditional_pin is true, so if the file has no "ragged edges" (consistent number of tabs ...) , it will use the normal streaming version.
do you guy use the streaming features with plain text .pin files?

We do not use PIN files as inputs internally anymore, so this should be fine. I was just concerned that the claim that Mokapot is all streamed could then not be made for traditional pins. My pin_to_tsv conversion is streamed. I think I am unclear how the streamed solution breaks backwards compatibility - that was definitely not intended.

The aim of my implementation was to decouple historic data formats (such as traditional_pin) in preprocessing and move into a single specification for input, so that internally in Mokapot only one format needs to be handled. In this MR there in a few places we have this internal handling between .tsv and .pin.
I have no hard feelings on either approach, but it this might also conflict with the confidence streaming MR that is upcomming from Elmar. We are aiming for this to be up at the end of the week.

Maybe we can discuss how to best move forward in a call?

jspaezp · 2024-12-05T20:44:51Z

Closing in favor of #132

jspaezp added 4 commits September 6, 2024 16:48

(feat) added auto handling of traditional pin and testing

d9d91dc

(fix) added handling of default direction

31dff36

(fix) changed intermediate files pin->tsv and fixed tests accordingly

5940500

(chore) formatting on docs and removed T20 from them

a64c1f2

jspaezp added 3 commits September 6, 2024 21:58

(chore) upgraded to upstream actions

1ff50c5

(chore) removed unused dependency in docs

1809edd

(chore) reformatted tests

3214c24

jspaezp marked this pull request as ready for review September 7, 2024 05:19

jspaezp requested a review from wfondrie September 7, 2024 06:00

jspaezp mentioned this pull request Sep 7, 2024

Update Actions and Enable Dependabot for GH Actions #125

Open

gessulat reviewed Sep 18, 2024

View reviewed changes

jspaezp requested a review from gessulat September 18, 2024 19:29

jspaezp closed this Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(feat) added auto handling of traditional pin and testing #126

(feat) added auto handling of traditional pin and testing #126

jspaezp commented Sep 6, 2024 •

edited

Loading

codecov bot commented Sep 7, 2024 •

edited

Loading

gessulat Sep 18, 2024

jspaezp Sep 18, 2024 •

edited

Loading

gessulat Sep 24, 2024

jspaezp commented Dec 5, 2024

(feat) added auto handling of traditional pin and testing #126

(feat) added auto handling of traditional pin and testing #126

Conversation

jspaezp commented Sep 6, 2024 • edited Loading

codecov bot commented Sep 7, 2024 • edited Loading

Codecov Report

gessulat Sep 18, 2024

Choose a reason for hiding this comment

jspaezp Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

gessulat Sep 24, 2024

Choose a reason for hiding this comment

jspaezp commented Dec 5, 2024

jspaezp commented Sep 6, 2024 •

edited

Loading

codecov bot commented Sep 7, 2024 •

edited

Loading

jspaezp Sep 18, 2024 •

edited

Loading