Ap/sppid protid delim #84

austinhpatton · 2023-09-26T20:21:55Z

Okay, so as we briefly discussed, this is a (relatively) simple change to use an updated naming convention for protein IDs, made to be consistent with the snakemake preprocessing workflow.

Old convention was: Genus_species:proteinID

The colon got replaced by an underscore by orthofinder, which made splitting the species and protein ID more challenging.

Now, the convention is: Genus-species_proteinID

The changes I implemented here basically just parameterize the delimiter, making _ the default, but splitting the two identifiers using the parameter value within the annotation module.

I haven't actually tested it yet (hence the draft PR), but will make an updated version of the test dataset that follows this convention so that I can do so.

bhatman · 2023-09-26T21:13:59Z

modules/local/annotate_uniprot.nf

+    def spp            = "${meta.id}"
+    def is_uniprot     = "${meta.uniprot}"
+    def project_dir    = "${projectDir}"
+    def spp_prot_delim = "${params.sppid_protid_delim}""


would it be confusing later on to have these arg names be slightly different?

good call - probably better to be consistent throughout!

bhatman · 2023-09-26T21:15:17Z

nextflow_schema.json

+                "sppid_protid_delim": {
+                    "type": "string",
+                    "default": "_",
+                    "description": "Option specifying the character that delimits the unique species and protein IDs. This delimiter MUST NOT occur within either the species ID or any protein ID."


is it worth throwing an error if the delimiter is already in the string (e.g., if we find it twice in the string)?

Hmmm, it could be - alternatively, would it be worth just internally rename things in some consistent way if we catch these exceptions?

…ap/sppid_protid_delim

…, fail and print error if not

austinhpatton · 2023-10-02T21:09:27Z

Okay, so I've made a number of changes, and this now works as anticipated.

I've made the naming of the sppid_protid_delim parameter consistent throughout
The delimiter is provided as input to the cogeqc R script, which is then used to split the sequence headers - this works using either naming convention.

I've not yet added in a check at the onset of the workflow to make sure that the sequence headers are named properly, though I have included a check to make sure it's actually in the sequence IDs, and stop the workflow if it's not, printing a useful error message to output in this case. I think we can make these checks a fair bit more extensive, but doing something like this could be part of a larger effort to build in checks throughout the workflow.

…ap/sppid_protid_delim

Signed-off-by: Austin Patton <austin.patton@arcadiascience.com>

austinhpatton added 3 commits September 25, 2023 22:37

add spp/protein id delimiter parameter

24fbfcb

add species/protein ID delimiter param to schema

9c75943

use param to split spp and prot ids

eb5cd62

austinhpatton requested a review from mertcelebi September 26, 2023 20:21

bhatman reviewed Sep 26, 2023

View reviewed changes

austinhpatton added 12 commits September 27, 2023 21:58

make var names consistent

d9e898e

conditionally rename species based on delimiter

e8a280f

conditionally rename species based on delimiter

1f431e2

meoved trailing quotation

63624bc

update to flexibly use spp/prot id delimiter

a289ee0

fix typo

10631e1

fix typo

c98d07d

move treefile creation out of if statement

17d0c41

Merge branch 'main' of github.com:Arcadia-Science/phylorthology into …

4d16536

…ap/sppid_protid_delim

delimit spp/prot ids using user-specified delimiter

922b45d

provide user-specified spp/prot ID delimiter param to cogeqc Rscript

52dd39c

add check to see if the sppid/protid delimiter is in the sequence IDs…

bb4cb52

…, fail and print error if not

austinhpatton added 8 commits October 6, 2023 18:12

fix to schema file for nf-core compatibility

d75d945

Merge branch 'main' of github.com:Arcadia-Science/phylorthology into …

7abf125

…ap/sppid_protid_delim

pull out UniProt accessions from snakemake headers

c85db4a

skip OMA annotations without error as needed

7e51613

downscale mafft threaduse on retry

ceca2d3

slight update to process specs

205075e

fixed path specification

702e262

update iqtree process specs

38e263e

Signed-off-by: Austin Patton <austin.patton@arcadiascience.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ap/sppid protid delim #84

Ap/sppid protid delim #84

austinhpatton commented Sep 26, 2023

bhatman Sep 26, 2023

austinhpatton Sep 27, 2023

bhatman Sep 26, 2023

austinhpatton Sep 27, 2023

austinhpatton commented Oct 2, 2023

Ap/sppid protid delim #84

Are you sure you want to change the base?

Ap/sppid protid delim #84

Conversation

austinhpatton commented Sep 26, 2023

bhatman Sep 26, 2023

Choose a reason for hiding this comment

austinhpatton Sep 27, 2023

Choose a reason for hiding this comment

bhatman Sep 26, 2023

Choose a reason for hiding this comment

austinhpatton Sep 27, 2023

Choose a reason for hiding this comment

austinhpatton commented Oct 2, 2023