CoronaWhy.org is a global volunteer organization dedicated to driving actionable insights into significant world issues using industry-leading data science, artificial intelligence and knowledge sharing. CoronaWhy was founded during the 2020 COVID-19 crisis, following a White House call to help extract valuable data from more than 50,000 coronavirus-related scholarly articles, dating back decades. Currently at over 1000 volunteers, CoronaWhy is composed of data scientists, doctors, epidemiologists, students, and various subject matter experts on everything from technology and engineering to communications and program management.
Read about our creations before you start.
The infrastructure can be setup locally and exposed as a number of CoronaWhy services using traefik tool.
You need to specify the value of "traefikhost" before you'll start to deploy the infrastructure:
export traefikhost=lab.coronawhy.org
or export traefikhost=localhost
download all CoronaWhy notebooks
./build-coronawhy-infra.sh
then you simply run
docker-compose up
after that there would be exposed next CoronaWhy services:
- Airflow http://airflow.apps.coronawhy.org (takes some time to launch)
- Whoami http://whoami.apps.coronawhy.org (simple webserver returning host stats)
- CoronaWhy API http://api.apps.coronawhy.org (FastAPI with Swagger)
- Elasticsearch http://es.apps.coronawhy.org
- SPARQL http://sparql.apps.coronawhy.org (Virtuoso as a service)
- INDRA http://indra.apps.coronawhy.org (INDRA REST API https://indra.readthedocs.io/en/latest/rest_api.html)
- Grlc http://grlc.apps.coronawhy.org (SPARQL queries into RESTful APIs convertor)
- Doccano http://doccano.apps.coronawhy.org
- Jupyter http://jupyter.apps.coronawhy.org (look for token in the logs)
- Portainer http://portainer.apps.coronawhy.org
- Traefik dashboard is available at http://apps.coronawhy.org:8080 (not secure setup)
- Kibana http://kibana.apps.coronawhy.org
Warning: in the example all infrastructure components deployed on *.apps.coronawhy.org, you should be able to get a local deployment on *.localhost (doccano.localhost, etc) or *.lab.coronawhy.org
CoronaWhy community is building an Infrastructure for Open Science that can be distributed and scaled up in the future and reused for other important tasks like cancer research. The vision of the community is to build it completely from Open Source components, all data should be published data in FAIR way and keep all available provenance information.
We're using Harvard Data Commons as a foundation that allows all CoronaWhy members to work together. We’re building a different services and running an experimental Labs and our data infrastructure is something common and reusable, a place where all research groups are sharing the same resources. It’s build on top of Dataverse data repository developed by Harvard University and available on datasets.coronawhy.org.
CoronaWhy also maintaining various APIs to produce an aggregated COVID-19 datasets. You can access the data by querying CoronaWhy Data API with using country codes, for example, FRA for France http://api.apps.coronawhy.org/country/FRA
-
Task-Risk helps to identify risk factors that can increase the chance of being infected, or affects the severity or the survival outcome of the infection
-
Task-Ties to explore transmission, incubation and environment stability
-
Named Entity Recognition across the entire corpus of CORD-19 papers with full text
-
Match Clinical Trials allows exploration of the results from the COVID-19 International Clinical Trials dataset
-
COVID-19 Literature Visualization helps to explore the data behind the AI-powered literature review
More detailed information about every dashboard published on Kaggle.
Download COVID-19 Open Research Dataset Challenge (CORD-19) from Kaggle
bash ./download_dataset.sh
Start NLP pipeline manually by executing
docker run -v /data/distrib/covid-19-infrastructure/data/original:/data -it coronawhy/pipeline /bin/bash
or automatically with
docker-compose -f ./docker-compose-pipeline.yml up
Follow all updates from our YouTube and CoronaWhy Github
How to access Elasticsearch and Dataverse, notebook
CoronaWhy Elasticsearch Tutorial notebook
How to Create Knowledge Graph, notebook
Dataverse Colab Connect, notebook
GitHub dataset sync with Dataverse, notebook
You can connect your notebooks to the number of services listed below, all services coming from CoronaWhy Labs have an experimental status. Join the fight against COVID-19 if you want to help us!
Dataverse deployed as a data service on https://datasets.coronawhy.org Dataverse is an open source web application to share, preserve, cite, explore, and analyze research data. It facilitates making data available to others.
CoronaWhy Elasticsearch has CORD-19 indexes on sentences level and available at CoronaWhy Search
Available indexes:
MongoDB service deployed on mongodb.coronawhy.org and available from CoronaWhy Labs Virtual Machines. Please contact our administrators if you want to use it.
Our Hypothesis annotation service is running on hypothesis.labs.coronawhy.org and allows to manually annotate CORD-19 papers. Please try our Hypothesis Demo if you're interested.
We are providing Virtuoso as a service with public SPARQL Endpoint that offers an HTTP-based Query Service that operates on Entity Relationship Types (Relations) represented as RDF sentence collections using the SPARQL Query Language. https://virtuoso.openlinksw.com
You can run a simple SPARQL query to get some overview of triples from CoronaWhy Knowledge Graph.
Kibana deployed as a community service connected to CoronaWhy Elasticsearch on https://kibana.labs.coronawhy.org Allows to visualize Elasticsearch data and navigate the Elastic Stack so you can do anything from tracking query load to understanding the way requests flow through your apps. https://www.elastic.co/kibana
BEL Commons 3.0 available as a service https://bel.labs.coronawhy.org
An environment for curating, validating, and exploring knowledge assemblies encoded in Biological Expression Language (BEL) to support elucidating disease-specific, mechanistic insight.
You can watch the introduction video and read Corona BEL Tutorial if you want to know more.
INDRA deployed as a service on https://indra.labs.coronawhy.org/indra.
INDRA (Integrated Network and Dynamical Reasoning Assembler) generates executable models of pathway dynamics from natural language (using the TRIPS and REACH parsers), and BioPAX and BEL sources (including the Pathway Commons database and NDEx.
You can quickly test the service by running:
curl -X POST "https://indra.labs.coronawhy.org/bel/process_pybel_neighborhood" -H "accept: application/json" -H "content-type: application/json" -d "{ \"genes\": [ \"MAP2K1\" ]}" -l -o test_coronawhy_map2k1.json
Geoparser as a service https://geoparser.labs.coronawhy.org
The Geoparser is a software tool that can process information from any type of file, extract geographic coordinates, and visualize locations on a map. Users who are interested in seeing a geographical representation of information or data can choose to search for locations using the Geoparser, through a search index or by uploading files from their computer. https://github.com/nasa-jpl-memex/GeoParser
Tabula allows you to extract data from PDF files into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. We deployed it as a CoronaWhy service available for all community members. More information at Tabula website.
We use Teamchatviz to explore how communication works in our distributed team and learn how communication shapes culture in CoronaWhy community. https://moovel.github.io/teamchatviz/
We are working on the deployment Neo4j graph database.
I’m an AI researcher and here’s how I fight corona by Artur Kiulian
Exploration of Document Clustering with SPECTER Embeddings by Brandon Eychaner
COVID-19 Research Papers Geolocation by Ishan Sharma