-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ap/sppid protid delim #84
base: main
Are you sure you want to change the base?
Conversation
modules/local/annotate_uniprot.nf
Outdated
def spp = "${meta.id}" | ||
def is_uniprot = "${meta.uniprot}" | ||
def project_dir = "${projectDir}" | ||
def spp_prot_delim = "${params.sppid_protid_delim}"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be confusing later on to have these arg names be slightly different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call - probably better to be consistent throughout!
"sppid_protid_delim": { | ||
"type": "string", | ||
"default": "_", | ||
"description": "Option specifying the character that delimits the unique species and protein IDs. This delimiter MUST NOT occur within either the species ID or any protein ID." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth throwing an error if the delimiter is already in the string (e.g., if we find it twice in the string)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, it could be - alternatively, would it be worth just internally rename things in some consistent way if we catch these exceptions?
…ap/sppid_protid_delim
…, fail and print error if not
Okay, so I've made a number of changes, and this now works as anticipated.
I've not yet added in a check at the onset of the workflow to make sure that the sequence headers are named properly, though I have included a check to make sure it's actually in the sequence IDs, and stop the workflow if it's not, printing a useful error message to output in this case. I think we can make these checks a fair bit more extensive, but doing something like this could be part of a larger effort to build in checks throughout the workflow. |
…ap/sppid_protid_delim
Signed-off-by: Austin Patton <austin.patton@arcadiascience.com>
Okay, so as we briefly discussed, this is a (relatively) simple change to use an updated naming convention for protein IDs, made to be consistent with the snakemake preprocessing workflow.
Old convention was:
Genus_species:proteinID
The colon got replaced by an underscore by orthofinder, which made splitting the species and protein ID more challenging.
Now, the convention is:
Genus-species_proteinID
The changes I implemented here basically just parameterize the delimiter, making
_
the default, but splitting the two identifiers using the parameter value within the annotation module.I haven't actually tested it yet (hence the draft PR), but will make an updated version of the test dataset that follows this convention so that I can do so.