-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(feat) added auto handling of traditional pin and testing #126
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #126 +/- ##
==========================================
+ Coverage 84.60% 85.68% +1.08%
==========================================
Files 26 25 -1
Lines 2949 2928 -21
==========================================
+ Hits 2495 2509 +14
+ Misses 454 419 -35 ☔ View full report in Codecov by Sentry. |
header_names = None | ||
num_cols = None | ||
|
||
out_lines = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This accumulating in-memory breaks the assumption that we can stream independent of input file size.
For example, running Mokapot on a PIN file from 10x Astral files would consume a lot of memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but my general thought was ... if you care about that you can reformat the file to any of the formats that support streaming (fix the separator, rename to tsv/csv in the simplest of cases). I could add a line where if more than 50k lines are parsed, it notifies the user that the file is large and a conversion might be a good idea (or to use comet with the separator parameter ... I cannot think of many other search engines right now that keep the protein separator as tabs).
The alternatives I thought were to:
- have a temp file where the lines are dumped and then read with any of the streaming alternatives. But since this only really matters for large files, I didn't like the idea of duplicating a massive file that might or might not get cleaned correctly.
- Have a non-temp file generated in the same dir location (It felt a bit intrusive to clutter the user's directory).
- Error out and point the user to a sub-command that handles the conversion (very non-backwards compatible) and the main purpose of this PR is to restore backwards compatibility.
Also note that this will only run if is_traditional_pin
is true, so if the file has no "ragged edges" (consistent number of tabs ...) , it will use the normal streaming version.
do you guy use the streaming features with plain text .pin files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not use PIN files as inputs internally anymore, so this should be fine. I was just concerned that the claim that Mokapot is all streamed could then not be made for traditional pins. My pin_to_tsv conversion is streamed. I think I am unclear how the streamed solution breaks backwards compatibility - that was definitely not intended.
The aim of my implementation was to decouple historic data formats (such as traditional_pin) in preprocessing and move into a single specification for input, so that internally in Mokapot only one format needs to be handled. In this MR there in a few places we have this internal handling between .tsv and .pin.
I have no hard feelings on either approach, but it this might also conflict with the confidence streaming MR that is upcomming from Elmar. We are aiming for this to be up at the end of the week.
Maybe we can discuss how to best move forward in a call?
Closing in favor of #132 |
This PR fixes backwards compatibility with 'traditional' pin files (file extension == ".pin", tab delimited, tabs as protein separators, last column == "protein(s)?" caps insensitive).
Basically, when detecting that its a 'traditional' pin, in memory it 'fixes' the separators in the protein column and moves forward with the fixed dataframe. This in practice means that pin files with ragged edges don't benefit from the streaming dataset benefits but are handled transparently in the background (in other words: if your data is small, no need to convert it. If it is large it will break IF it has tabs as separators in the protein column, which I believe its a good compromise)
This PR also incorporates the changes in #125 (sorry for the large auto-formatted changes).