Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feat) added auto handling of traditional pin and testing #126

Closed
wants to merge 7 commits into from

Conversation

jspaezp
Copy link
Collaborator

@jspaezp jspaezp commented Sep 6, 2024

This PR fixes backwards compatibility with 'traditional' pin files (file extension == ".pin", tab delimited, tabs as protein separators, last column == "protein(s)?" caps insensitive).

Basically, when detecting that its a 'traditional' pin, in memory it 'fixes' the separators in the protein column and moves forward with the fixed dataframe. This in practice means that pin files with ragged edges don't benefit from the streaming dataset benefits but are handled transparently in the background (in other words: if your data is small, no need to convert it. If it is large it will break IF it has tabs as separators in the protein column, which I believe its a good compromise)

This PR also incorporates the changes in #125 (sorry for the large auto-formatted changes).

Copy link

codecov bot commented Sep 7, 2024

Codecov Report

Attention: Patch coverage is 87.30159% with 8 lines in your changes missing coverage. Please review.

Project coverage is 85.68%. Comparing base (6b8fbfd) to head (3214c24).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
mokapot/tabular_data.py 85.71% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #126      +/-   ##
==========================================
+ Coverage   84.60%   85.68%   +1.08%     
==========================================
  Files          26       25       -1     
  Lines        2949     2928      -21     
==========================================
+ Hits         2495     2509      +14     
+ Misses        454      419      -35     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jspaezp jspaezp marked this pull request as ready for review September 7, 2024 05:19
@jspaezp jspaezp requested a review from wfondrie September 7, 2024 06:00
header_names = None
num_cols = None

out_lines = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This accumulating in-memory breaks the assumption that we can stream independent of input file size.
For example, running Mokapot on a PIN file from 10x Astral files would consume a lot of memory.

Copy link
Collaborator Author

@jspaezp jspaezp Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but my general thought was ... if you care about that you can reformat the file to any of the formats that support streaming (fix the separator, rename to tsv/csv in the simplest of cases). I could add a line where if more than 50k lines are parsed, it notifies the user that the file is large and a conversion might be a good idea (or to use comet with the separator parameter ... I cannot think of many other search engines right now that keep the protein separator as tabs).

The alternatives I thought were to:

  1. have a temp file where the lines are dumped and then read with any of the streaming alternatives. But since this only really matters for large files, I didn't like the idea of duplicating a massive file that might or might not get cleaned correctly.
  2. Have a non-temp file generated in the same dir location (It felt a bit intrusive to clutter the user's directory).
  3. Error out and point the user to a sub-command that handles the conversion (very non-backwards compatible) and the main purpose of this PR is to restore backwards compatibility.

Also note that this will only run if is_traditional_pin is true, so if the file has no "ragged edges" (consistent number of tabs ...) , it will use the normal streaming version.
do you guy use the streaming features with plain text .pin files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not use PIN files as inputs internally anymore, so this should be fine. I was just concerned that the claim that Mokapot is all streamed could then not be made for traditional pins. My pin_to_tsv conversion is streamed. I think I am unclear how the streamed solution breaks backwards compatibility - that was definitely not intended.

The aim of my implementation was to decouple historic data formats (such as traditional_pin) in preprocessing and move into a single specification for input, so that internally in Mokapot only one format needs to be handled. In this MR there in a few places we have this internal handling between .tsv and .pin.
I have no hard feelings on either approach, but it this might also conflict with the confidence streaming MR that is upcomming from Elmar. We are aiming for this to be up at the end of the week.

Maybe we can discuss how to best move forward in a call?

@jspaezp jspaezp requested a review from gessulat September 18, 2024 19:29
@jspaezp
Copy link
Collaborator Author

jspaezp commented Dec 5, 2024

Closing in favor of #132

@jspaezp jspaezp closed this Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants