This project is a community effort to build a Neo4j Knowledge Graph (KG) that integrates heterogeneous biomedical and environmental datasets to help researchers analyze the interplay between host, pathogen, the environment, and COVID-19.
Location Subgraph: This subgraph represents the geographic hierarchy from the world to the city level (population > 1000), as well as PostalCode (US ZIP) and US Census Tract level. Each geographic node has a Location label (not shown), to simplify finding locations without specifying a specific level in the geographic hierarchy.
Epidemiology Subgraph: This subgraph represents COVID-19 cases including information about viral strains, and the pathogen and host organism. Cases and Strains are linked to the locations where they were reported and found, respectively.
Biology Subgraph: This subgraph represents organism, genome, chromosome, gene, variant, protein, protein structure, protein domain, protein family, pathogen-host protein-protein interactions, and links to publications.
Population Characteristics Subgraph: This subgraph represents data from the American Community Survey 2018 5-year estimates. Selected population characteristics that may be risk factors for COVID-19 infections have been included. These data are currently available at three geographic levels: US Counties (Admin2), US ZIP Codes (PostalCode), and US Census Tract (Tract).
Note, this KG is work in progress and changes frequently.
The Knowledge Graph is updated daily approximately between 07:00 and 09:00 UTC.
View of Neo4j Browser showing the result of a query about interactions of the Spike glycoprotein with human host proteins and related publications in PubMedCentral.
You can browse the Knowledge Graph here (click the launch button and follow the instructions below)
Full-text queries enable a wide range of search options including exact phrase queries, fuzzy and wildcard queries, range queries, regular expression queries, and use of boolean expressions (see tutorial on FulltextQuery).
The KG can be searched by the following full-text indices:
bioentities
Organism, Genome, Chromosome, Gene, GeneName, Protein, ProteinName, ProteinDomain, ProteinFamily, Structure, Chain, Outbreak, Strain, Variant, Publication
bioids
keyword (exact) query for bioentity identifiers (e.g., id, taxonomyId, accession, proId, genomeAccession, doi, variantType, variantConsequence)
sequences
full-text and regular expression query for protein sequences
locations
UNRegion, UNSubRegion, UNIntermediateRegion, Country, Admin1, Admin2, USRegion, USDivision, City, PostalCode, Tract, CruiseShip
geoids
keyword (exact) query for geographic identifiers (e.g., zip codes, fips codes, country iso codes)
Full-text queries have the following format:
CALL db.index.fulltext.queryNodes('<type of entity>', '<text query>') YIELD node, score
The queries return the node and score for each match (higher scores indicate closer matches).
Query: (copy and paste into Neo4j browser)
CALL db.index.fulltext.queryNodes("bioentities", "spike") YIELD node
WHERE 'Protein' IN labels(node) // only return nodes with the label Protein
RETURN node
Result:
The full-text query matches several Spike proteins from several coronaviruses. The SARS-CoV-2 Spike glycoprotein (uniprot:P0DTC2) is highlighted in the center with its four cleavage products: Spike glycoprotein without signal peptide (uniprot.chain:PRO_0000449646), Spike protein S1 (uniprot.chain:PRO_0000449647), Spike protein S2 (uniprot.chain:PRO_0000449648), and Spike protein S2' (uniprot.chain:PRO_0000449649) linked by a CLEAVED_BY
relationship.
The following query returns the names of the matched bioentities and the labels of the nodes (e.g., Protein, ProteinName) sorted by the match score in descending order.
Query:
CALL db.index.fulltext.queryNodes("bioentities", "spike") YIELD node, score
RETURN node.name, labels(node), score
Result:
Specific Nodes and Relationships in the KG can be searched using the Cypher query language.
Query: (limited to 10 hits)
MATCH (s:Strain)-[:FOUND_IN]->(l:Location{name: 'Houston'}) RETURN s, l LIMIT 10
This subgraph shows viral strains (green) of the SARS-CoV-2 virus carried by human hosts in Houston (organisms in gray). The strains have several variants (e.g., mutations)(red) in common. Details of the high-lighted variant is shown at the bottom. This variant is a missense mutation in the S gene (S:c.1841gAt>gGt): the base "A" (Adenosine) found in the Wuhan-Hu-1 reference genome NC_45512 was mutated to a "G" (Guanine) at position 23403, resulting in the encoded Spike glycoprotein (QHD43416) to be changed from a "D" (Aspartic acid) to a "G" (Glycine) amino acid at position 614 (QHD43416.1:p.614D>G).
Query:
MATCH (o:Outbreak{id: "COVID-19"})<-[:RELATED_TO]-(c:Cases{date: date("2020-08-31"), source: 'JHU'})-[:REPORTED_IN]->(a:Admin2)-[:IN]->(a1:Admin1)
RETURN a1.name as state, sum(c.cases) as cases, sum(c.deaths) as deaths
ORDER BY cases DESC;
Result:
Note, some cases in the COVID-19 Data Repository by Johns Hopkins University cannot be mapped to a county or state location (e.g., correctional facilities, missing location data). Therefore, the results of this query will underreport the actual number of cases.
Cypher queries can be run in Jupyter Notebooks to enable reproducible data analyses and visualizations.
You can run the following Jupyter Notebooks in your web browser:
NOTE: Authentication is now required to launch binder! Sign into GitHub from your browser, then click on the launch binder
badge below to launch Jupyter Lab.
** Pangeo Binder seems to be unsupported and is currently down. The Binder launch may not work **
Once Jupyter Lab launches, navigate to the notebooks/queries
and notebooks/analyses
directory and run the following notebooks:
Notebook | Description |
---|---|
FulltextQuery | Runs example fulltext queries |
CaseCounts | Runs example queries for case counts |
Locations | Runs example queries for locations |
Demographics | Runs example queries for demographics data from the American Community Survey |
SocialCharacteristics | Runs example queries for social characteristics from the American Community Survey |
EconomicCharacteristics | Runs example queries for economic characteristics from the American Community Survey |
Housing | Runs example queries for housing characteristics from the American Community Survey |
Bioentities | Runs example queries for bioentities |
EmergingStrains | Analyze emerging SARS-CoV-2 Strains |
EmergingStrainsInLiterature | Analyze emerging SARS-CoV-2 Strains based on mentioning in the Literature |
StrainB.1.1.7 | Analyze B.1.1.7 Strain |
AnalyzeVariantsSpikeGlycoprotein | Analyze SARS-CoV-2 Spike Glycoprotein Variants |
Coronavirus3DStructures | Inventory of coronavirus 3D protein structures |
GraphVisualization | Demo of graph visualization with Cytoscape |
MapMutationsTo3D | Map mutations from SARS-CoV-2 strains to 3D Structures |
RiskFactorsByStateCounty | Explore Risk Factors for COVID-19 for Counties in US States |
RiskFactorsSanDiegoCounty | Explore Risk Factors for COVID-19 for San Diego County |
CovidRatesByStates | Explore COVID-19 confirmed cases and death rates for states in a selected country |
... | add examples here ... |
COVID-19-Net Knowledge Graph is created from publically available resources, including databases, files, and web services. A reproducible workflow, defined in this repository, is used to run a daily update of the knowledge graph. The Jupyter notebooks listed in the table below download, clean, standardize, and integrate data in the form of .csv files for ingestion into the Knowledge Graph. The prepared data files are saved in the NEO4J_HOME/import
directory and cached intermediate files are saved in the NEO4J_HOME/import/cache
directory. These notebooks are run daily at 07:00 UTC in batch using Papermill with the update script to download the latest data and update the Knowlege Graph.
Notebook | Description |
---|---|
00b-NCBITaxonomy | Downloads the NCBI taxonomy for a subset of organisms |
00b-PANGOLineage | Downloads the PANGO lineage designations for SARS-CoV-2 |
00e-GeoNamesCountry | Downloads country information from GeoNames.org |
00f-GeoNamesAdmin1 | Downloads first administrative divisions (State, Province, Municipality) information from GeoNames.org |
00g-GeoNamesAdmin2 | Downloads second administrative divisions (Counties in the US) information from GeoNames.org |
00h-GeoNamesCity | Downloads city information (population > 1000) from GeoNames.org |
00i-USCensusRegionDivisionState2017 | Downloads US regions, divisions, and assigns state FIPS codes from the US Census Bureau |
00j-USCensusCountyCity2017 | Downloads US County FIPS codes from the US Census Bureau |
00k-UNRegion | Downloads UN geographic regions, subregions, and intermediate region information from United Nations |
00m-USHUDCrosswalk | Downloads mappings of US Census tracts to US Postal Service ZIP codes and US Counties |
00n-GeoNamesData | Downloads longitude, latitude, elevation, and population data from GeoNames.org |
00o-GeoNamesPostalCode | Downloads US zip code, place name, latitude, longitude data from GeoNames.org |
01a-UniProtGene | Downloads chromosome and gene information from UniProt |
01a-UniProtProtein | Downloads protein information from UniProt |
01b-NCBIGeneProtein | Downloads gene and protein information from NCBI |
01c-CNCBStrain | Downloads SARS-CoV-2 viral strain metadata from CNCB (China National Center for Bioinformation) |
01c-CNCBVariation | Downloads variant data from CNCB (China National Center for Bioinformation) |
01d-Nextstrain | Downloads the SARS-CoV-2 strain metadata from Nextstrain |
01e-ProteinProteinInteraction | Downloads SARS-CoV-2 - human protein interaction data from IntAct |
01f-PDBStructure | Downloads 3D protein structures from the Protein Data Bank |
01g-PfamDomain | Downloads mappings between PDB protein chains and Pfam domains |
01h-CORDLineages | Maps publications and preprints in the CORD-19 data set to PANGO lineages |
01h-PublicationLink | Downloads mappings between datasets and publications indexed by PubMed Central (PMC) and Preprints (PPR) and PubMed (PM) |
02a-JHUCases | Downloads cummulative confimed cases and deaths from the COVID-19 Data Repository by Johns Hopkins University |
02a-JHUCasesLocation | Standardizes location data for the COVID-19 Data Repository by Johns Hopkins University |
02c-SDHHSACases | Downloads cummulative confirmed COVID-19 cases from the County of San Diego, Health and Human Services Agency |
03a-USCensusDP02Education | Downloads social characteristics (DP02) from the American Community Survey 5-Year Data 2018 |
03a-USCensusDP02Computers | Downloads social characteristics (DP02) from the American Community Survey 5-Year Data 2018 |
03a-USCensusDP03Commuting | Downloads economic characteristics (DP03) from the American Community Survey 5-Year Data 2018 |
03a-USCensusDP03Employment | Downloads economic characteristics (DP03) from the American Community Survey 5-Year Data 2018 |
03a-USCensusDP03HealthInsurance | Downloads economic characteristics (DP03) from the American Community Survey 5-Year Data 2018 |
03a-USCensusDP03Income | Downloads economic characteristics (DP03) from the American Community Survey 5-Year Data 2018 |
03a-USCensusDP03Income | Downloads economic characteristics (DP03) from the American Community Survey 5-Year Data 2018 |
03a-USCensusDP03Occupation | Downloads economic characteristics (DP03) from the American Community Survey 5-Year Data 2018 |
03a-USCensusDP03Poverty | Downloads economic characteristics (DP03) from the American Community Survey 5-Year Data 2018 |
03a-USCensusDP04 | Downloads housing (DP04) from the American Community Survey 5-Year Data 2018 |
03a-USCensusDP05 | Downloads demographic data estimates (DP05) from the American Community Survey 5-Year Data 2018 |
... | Future notebooks that add new data to the knowledge graph |
1. Fork this project
A fork is a copy of a repository in your GitHub account. Forking a repository allows you to freely experiment with changes without affecting the original project.
In the top-right corner of this GitHub page, click Fork
.
Then, download all materials to your laptop by cloning your copy of the repository, where your-user-name
is your GitHub user name. To clone the repository from a Terminal window or the Anaconda prompt (Windows), run:
git clone https://github.com/your-user-name/covid-19-community.git
cd covid-19-community
2. Create a conda environment
The file environment.yml
specifies the Python version and all packages required by the tutorial.
conda env create -f environment.yml
Activate the conda environment
conda activate covid-19-community
3. Launch Jupyter Lab
jupyter lab
Navigate to the notebooks/queries
directory to run the example Jupyter Notebooks and notebooks/analyses
directory to run analyses.
Note, the following steps have been implemented for MacOS and Linux only.
Some steps will take a very long time, e.g., notebook 01d-CNCBStrain may take more than 12 hours to run the first time.
Follow steps 1. - 3. from above.
4. Install Neo4j Desktop
Then, launch the Neo4j Browser, create an empty database, set the password to "neo4jbinder", and close the database.
5. Set Environment Variable
Add the environment variable NEO4J_HOME
with the path to the Neo4j database installation to your .bash_profile file, e.g.
export NEO4J_HOME="/Users/username/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-.../installation-4.0.3"
Add the environment variable NEO4J_IMPORT
with the path to the Neo4j database import directory to your .bash_profile file, e.g.
export NEO4J_IMPORT="/Users/username/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-.../installation-4.0.3/import"
6. Run Data Download Notebooks
Start Jupyter Lab.
jupyter lab
Navigate to the (notebooks/dataprep/
) directory and run all notebooks in alphabetical order to download, clean, standardize and save the data in the NEO4J_HOME/import
directory for ingestion into the Neo4j database.
7. Upload Data into a Local Neo4j Database
Afer all data files have been created in step 6, run (notebooks/local/2-CreateKGLocal.ipynb
to import the data into your local Neo4j database. Make sure the Neo4j Browser is closed before running the database import!
8. Browse local KG in Neo4j Browser
After step 7 has completed, start the database in the Neo4j Browser to interactively explore the KG or run local queries.
- File an issue to discuss your idea so we can coordinate efforts
- Help with specific issues
- Suggest publically accessible data sets
- Add Jupyter Notebooks with data analyses, maps, and visualizations
- Report bugs or issues
Peter W. Rose, David Valentine, Ilya Zaslavsky, COVID-19-Net: Integrating Health, Pathogen and Environmental Data into a Knowledge Graph for Case Tracking, Analysis, and Forecasting. Available online: https://github.com/covid-19-net/covid-19-community (2020).
Please also cite the data providers.
The schema below shows how data sources are integrated into the nodes of the Knowledge Graph.
Neo4j provided technical support and organized the community development: "GraphHackers, Let’s Unite to Help Save the World — Graphs4Good 2020".
Students of the UCSD Spatial Data Science course DSC-198: EXPLORING COVID-19 PANDEMIC WITH DATA SCIENCE
Contributors: Kaushik Ganapathy, Braden Riggs, Eric Yu
Alexander Din, U.S. Department of Housing and Urban Development, for help with HUD Crosswalk Files.
Project KONQUER team members at UC San Diego and UTHealth at Houston.
Project Pangeo hosts a Binder instance used to launch Jupyter Notebooks on the web. Pangeo is supported, in part, by the National Science Foundation (NSF) and the National Aeronautics and Space Administration (NASA). Google provided compute credits on Google Compute Engine.
Development of this prototype is in part supported by the National Science Foundation under Award Numbers:
NSF Convergence Accelerator Phase I (RAISE): Knowledge Open Network Queries for Research (KONQUER) (1937136)
NSF RAPID: COVID-19-Net: Integrating Health, Pathogen and Environmental Data into a Knowledge Graph for Case Tracking, Analysis, and Forecasting (2028411)