Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ap/sppid protid delim #84

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from
Draft

Ap/sppid protid delim #84

wants to merge 23 commits into from

Conversation

austinhpatton
Copy link
Collaborator

Okay, so as we briefly discussed, this is a (relatively) simple change to use an updated naming convention for protein IDs, made to be consistent with the snakemake preprocessing workflow.

Old convention was: Genus_species:proteinID

The colon got replaced by an underscore by orthofinder, which made splitting the species and protein ID more challenging.

Now, the convention is: Genus-species_proteinID

The changes I implemented here basically just parameterize the delimiter, making _ the default, but splitting the two identifiers using the parameter value within the annotation module.

I haven't actually tested it yet (hence the draft PR), but will make an updated version of the test dataset that follows this convention so that I can do so.

def spp = "${meta.id}"
def is_uniprot = "${meta.uniprot}"
def project_dir = "${projectDir}"
def spp_prot_delim = "${params.sppid_protid_delim}""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be confusing later on to have these arg names be slightly different?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call - probably better to be consistent throughout!

"sppid_protid_delim": {
"type": "string",
"default": "_",
"description": "Option specifying the character that delimits the unique species and protein IDs. This delimiter MUST NOT occur within either the species ID or any protein ID."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth throwing an error if the delimiter is already in the string (e.g., if we find it twice in the string)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, it could be - alternatively, would it be worth just internally rename things in some consistent way if we catch these exceptions?

@austinhpatton
Copy link
Collaborator Author

Okay, so I've made a number of changes, and this now works as anticipated.

  1. I've made the naming of the sppid_protid_delim parameter consistent throughout
  2. The delimiter is provided as input to the cogeqc R script, which is then used to split the sequence headers - this works using either naming convention.

I've not yet added in a check at the onset of the workflow to make sure that the sequence headers are named properly, though I have included a check to make sure it's actually in the sequence IDs, and stop the workflow if it's not, printing a useful error message to output in this case. I think we can make these checks a fair bit more extensive, but doing something like this could be part of a larger effort to build in checks throughout the workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants