- run all the commands from the root of the repo
- On windows, replace
/
with\
when specifying path - If you normally run
python3
instead ofpython
, try it if having difficulty setting up virtual environment
Prerequisites
- You must have Python 3.3 or above.
- Check whether you do by running
python --version
orpython --version
.
- Check whether you do by running
- You must have pip installed.
- Check whether you do by running
pip3 --version
orpip --version
- Check whether you do by running
python env_setup.py
# IMPORTANT: activate your virtual environment using instructions printed from the command above
pip install -r requirements.txt
python scrape_flavoenzymes.py
If you get stuck, follow these instructions:
Virtual environment setup
- Create virtual environment.
python modules/helpers/env_setup.py
- Activate the virtual environment
if you don't, all packages will be installed to your global environment, if you are ok with that, skip this step
- On MacOS or Linux run:
source flav_env/bin/activate
- On Windows run:
flav_env\Scripts\activate.bat
- On MacOS or Linux run:
- Install dependancies within the environment.
pip install -r requirements.txt
Scraping all the data
python scrape_flavoenzymes.py
More information:
- This will try to scrape all the information from all the websites that have been configured.
- If existing file is found in
./export/scraped_flavoenzymes.json
the programm will only update it if new entries will be found. - Inside the
modules/scrapers
you can findblacklist.csv
andwhitelist.csv
. These files allow you to add some enzymes that should be always skipped or always fetched. Try using this approach before harcoding something in the code.
Here is the list of useful commands to run
WITH "https://raw.githubusercontent.com/supervanya/flavoenzymes/master/export/kegg.json" AS url
WITH "kegg.json" AS url
if creating from a local file replace link with file name and place file within import folder of Neo4j
WITH "https://raw.githubusercontent.com/supervanya/flavoenzymes/master/export/kegg.json" AS url
CALL apoc.load.json(url) YIELD value AS enzymes
UNWIND keys(enzymes) AS enzName
MERGE (e:Enzyme {name: enzName})
FOREACH (subsName in enzymes[enzName].SUBSTRATE |
MERGE (s:Substrate {name: subsName})
MERGE (s)<-[:binds]-(e)
)
FOREACH (prodName in enzymes[enzName].PRODUCT |
MERGE (p:Product {name: prodName})
MERGE (p)<-[:releases]-(e)
)
MATCH (n) return n
MATCH (n:Enzyme)
RETURN (n)-[:binds]->()
LIMIT 25
MATCH (n)
RETURN ()<-[:releases]-(n)-[:binds]->()
LIMIT 25
MATCH p=(e:Enzyme)-->()
WHERE e.ec="ec:1.2.99.7"
RETURN p
MATCH (e:Enzyme)
MATCH path = (e)-[]->(s:Substrate)
RETURN path;
BruceSorter: A CLI to help with sorting flavoenzymes and filtering out false positives