Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate other databases #6

Open
jacobwindsor opened this issue Mar 12, 2017 · 12 comments
Open

Integrate other databases #6

jacobwindsor opened this issue Mar 12, 2017 · 12 comments

Comments

@jacobwindsor
Copy link
Owner

jacobwindsor commented Mar 12, 2017

Currently, this only ranks through PUBCHEM's API. It would be nice to use other APIs to rank compounds. Then probably rename this project too. We would have to discuss how other databases are implemented. I.e. simply rank by the total number of "hits" across all databases, or allow filtering of search parameters, who knows. Probably the algorithm needs to be a bit more complex to get an accurate indication of the amount of data available for each compound in the dataset

Other databases (please add):

@jacobwindsor
Copy link
Owner Author

@jacobwindsor
Copy link
Owner Author

After some preliminary research it seems MetaCyc is the easiest to add since they have a REST API. They even have a nice service to search for foreign keys (e.g. PubChem or KEGG), see here.

However, the only issue is that you have to search on an organism specific basis. The url to search is something like:

http://websvc.biocyc.org/[ORGID]/foreignid?ids=[DATABASE-NAME]:[FOREIGNID] 

Where ORGID is the organism ID.

@DeniseSl22 Is it okay to make the ranker only usable for human datasets for now? It should be easy to add other organisms in the future. However, bare in mind that the more databases are added, the harder it will be to keep the organism restriction broad since some databases may support fewer organisms.

@DeniseSl22
Copy link
Collaborator

Yeah sure. Is PubChem then searched for humans only as well?
Perhaps we can add a option in the future where people can say which organism they want to filter on ;)

@DeniseSl22
Copy link
Collaborator

Oh btw; Egon just told me there is a new service (I will get the details through mail) which allows automated search through articles (for a lot of publishers, not Elsevier). Perhaps we can do something with that as well (I remembered you told me that a specific search through literature was really missing when you guys were looking at the VOCs dataset)

@DeniseSl22
Copy link
Collaborator

Here the info from Egon:
CrossRef API (citation counts): https://github.com/CrossRef/rest-api-doc/blob/master/rest_api.md
EuroPubMedCentral API: http://europepmc.org/RestfulWebService#cites
Initiative for Open Citations: https://i4oc.org/

@jacobwindsor
Copy link
Owner Author

Hmm cool! CrossRef I guess is the most well known so can integrate that first.

@jacobwindsor
Copy link
Owner Author

jacobwindsor commented Apr 24, 2017

Using MetaCyc, the flow is:

  1. Get the MetaCyc ID using the PubChem ID with `https://metacyc.org/META/foreignid?ids=PUBCHEM:&fmt=json
  2. Retrieve the set of MetaCyc objects concerning that compound with http://websvc.biocyc.org/apixml?fn=[API-FUNCTION]&id=[ORGID]:[OBJECT-ID]&detail=[none|low|full]

The second step is what is needed to be discussed. What information do we actually want to retrieve from MetaCyc? If you see here, there is quite a lot we can do.

The obvious ones are:

  • pathways-of-compound
  • reactions-of-compound

But, there are some others in this list that could be interesting. Potentially, you can go however deep you like - getting the ID required for the next query from the previous query.

  • all-products-of-gene
  • binding-site-transcription-factors
  • chromosome-of-gene
  • compounds-of-pathway
  • containers-of
  • containing-tus
  • direct-activators
  • direct-inhibitors
  • enzymes-of-gene
  • enzymes-of-pathway
  • enzymes-of-reaction
  • genes-of-pathway
  • genes-of-protein
  • genes-of-reaction
  • genes-regulated-by-gene
  • genes-regulating-gene
  • modified-containers
  • modified-forms
  • monomers-of-protein
  • pathways-of-compound
  • pathways-of-gene
  • reactions-of-compound
  • reactions-of-enzyme
  • reactions-of-gene
  • regulator-proteins-of-transcription-unit
  • regulon-of-protein
  • substrates-of-reaction
  • top-containers
  • transcription-unit-activators
  • transcription-unit-binding-sites
  • transcription-unit-genes
  • transcription-unit-inhibitors
  • transcription-unit-mrna-binding-sites
  • transcription-unit-promoter
  • transcription-unit-terminators
  • transcription-unit-transcription-factors
  • transcription-units-of-gene
  • transcription-units-of-protein

@egonw and @DeniseSl22 could you provide some input?

@egonw
Copy link

egonw commented Apr 26, 2017

I would go to number of pathways and number of substrates...

@DeniseSl22
Copy link
Collaborator

Hi Jacob,

Just found some info on the ChEBI website that they have an API....
Perhaps useful to add this to the Ranker Program?

https://www.ebi.ac.uk/chebi/libchebi.do​

@jacobwindsor
Copy link
Owner Author

Oh wow! How did I not see that?

For my reference: here's the API library for Python

@DeniseSl22
Copy link
Collaborator

DeniseSl22 commented May 16, 2017 via email

@DeniseSl22
Copy link
Collaborator

Oh and another one I can across (HMDB API):
https://github.com/mzmine/mzmine2/issues/195

I think you didn't look at this, cause Egon already checked if the compounds were in HMBD and ChEBI (which a lot f them weren't). SO, this could help other people to find which compounds they do not have to investigate any further :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants