Skip to content

Latest commit

 

History

History
126 lines (98 loc) · 3.4 KB

README.md

File metadata and controls

126 lines (98 loc) · 3.4 KB

flavoenzymes

⚠️ Important:

  • run all the commands from the root of the repo
  • On windows, replace / with \ when specifying path
  • If you normally run python3 instead of python, try it if having difficulty setting up virtual environment

Getting Started

Prerequisites

  1. You must have Python 3.3 or above.
    • Check whether you do by running python --version or python --version.
  2. You must have pip installed.
    • Check whether you do by running pip3 --version or pip --version

Quick start

python env_setup.py
# IMPORTANT: activate your virtual environment using instructions printed from the command above
pip install -r requirements.txt
python scrape_flavoenzymes.py
If you get stuck, follow these instructions:

Virtual environment setup

  1. Create virtual environment.
    • python modules/helpers/env_setup.py
  2. Activate the virtual environment

    if you don't, all packages will be installed to your global environment, if you are ok with that, skip this step

    • On MacOS or Linux run:
      • source flav_env/bin/activate
    • On Windows run:
      • flav_env\Scripts\activate.bat
  3. Install dependancies within the environment.
    • pip install -r requirements.txt

Run the pipeline

Scraping all the data

python scrape_flavoenzymes.py

More information:
  • This will try to scrape all the information from all the websites that have been configured.
  • If existing file is found in ./export/scraped_flavoenzymes.json the programm will only update it if new entries will be found.
  • Inside the modules/scrapers you can find blacklist.csv and whitelist.csv. These files allow you to add some enzymes that should be always skipped or always fetched. Try using this approach before harcoding something in the code.

Loading data into Neo4j

Here is the list of useful commands to run

Importing files

Create from URL

WITH "https://raw.githubusercontent.com/supervanya/flavoenzymes/master/export/kegg.json" AS url

Create from local file

WITH "kegg.json" AS url

Create from JSON

if creating from a local file replace link with file name and place file within import folder of Neo4j

WITH "https://raw.githubusercontent.com/supervanya/flavoenzymes/master/export/kegg.json" AS url
CALL apoc.load.json(url) YIELD value AS enzymes
UNWIND keys(enzymes) AS enzName
	MERGE (e:Enzyme {name: enzName})
    
    FOREACH (subsName in enzymes[enzName].SUBSTRATE | 
    	MERGE (s:Substrate {name: subsName})
        MERGE (s)<-[:binds]-(e)
    )
    
    FOREACH (prodName in enzymes[enzName].PRODUCT |
    	MERGE (p:Product {name: prodName})
        MERGE (p)<-[:releases]-(e)
    )

Queries

Show all nodes (this will limit to 300 or your settings)

MATCH (n) return n

25 enzymes with anything they bind

MATCH (n:Enzyme) 
RETURN (n)-[:binds]->()
LIMIT 25

25 enzymes with anything they bind and release

MATCH (n)
RETURN ()<-[:releases]-(n)-[:binds]->() 
LIMIT 25

Specific enzyme with all links

MATCH p=(e:Enzyme)-->()
WHERE e.ec="ec:1.2.99.7" 
RETURN p
MATCH (e:Enzyme)
MATCH path = (e)-[]->(s:Substrate)
RETURN path;

Other Modules

BruceSorter: A CLI to help with sorting flavoenzymes and filtering out false positives