NAR-Data-Discovery

Nucleic Acids Research Data Discovery

Note: This Project has continued in a new form in another repository, known as NARTA (https://github.com/SeanFlannery/NARTA).

The work presented here was a preliminary foray to see if this would be a feasible project.

Purpose

This is an exploratory project interested in using an archive of papers from a major biology research database (NAR) as well as the Natural Language Processing model of Doc2Vec to identify trends, topics, and similarities in publications over time. We are hopeful that this analysis could generate interesting conclusions about the direction the field of Biology is headed, and that several novel visualizations may be produced.

Notebooks

Below you will find a list of notebooks that serve as the building blocks of this project. We are always looking for feedback or suggestions on how to improve any of the steps in the pipeline, and anybody is welcome to reproduce and modify all of this work if it could help you in any way.

Part 1: Web Scraping and Downloading Files

Crawls URLs from nardb.txt of article year pages contained in the NAR Issue Archive
Produces list of all article links to download
Downloads all of relevant article html files and stores them in articles directory

Part 2: Article Data Extraction and Removal of Irrelevant Pages

Processes title, abstract, introduction, and author name into original-article-data.csv
Remove articles lacking useful information (editorial editions, advertisements, or lacking needed parameters)
Places all data into complete-article-data.csv for later processing

Part 3: Data Pre-Processing

Perform input sanitization, lemmatization, removal of stopwords
Use TF-IDF weighting to determine words to remove from corpus
Output results from documents into preprocessed_data.csv

Part 4: Train our Model

Train a Doc2Vec instance and save the model's resultant embeddings in doc_embeddings.npy
Perform self-similarity "sanity checks" to ensure embedding is useful

Part 5: Clustering Techniques

Using various clustering techniques, establish clusters of different documents
All prior data, cluster numbers, along with PCA components are all stored in cluster_results.csv

Part 6: Visualizing our Data

Use cluster assignments and PCA components of document embeddings to visualize cluster groupings

Current Data (feel free to use)

The aggregate data we've accrued over time can be found within the file cluster_results.csv consisiting of:

year: year of article publication
article-link: original link to the article
local-path: local path in the articles folder where one can find page source
title: the paper or article's title
abstract: the paper's abstract (editorial articles, by their nature, lack an abstract)
authors: a list of all contributing authors to the papers
introduction: all text contained within the introduction of the article (or a similar area, we occasionally didn't have labels for the introductions)
preprocessed_data: all text after Part 3's preprocessing with words separated by spaces
pca_feature*: Principle Component Analysis values (note total explained variance is only around ~10% even when accounting for all 3 features). These features are used to represent individual articles in 2D and 3D space in Part 6.
xmeans_cluster: cluster assignment based on the XMeans algorithm
kmeans_cluster: cluster assignment based on the KMeans algorithm

Interact with Current 3D Vizualization Here

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
3D_XMeans_Plot.png		3D_XMeans_Plot.png
Article Data Extraction and Removal of Irrelevant Pages.ipynb		Article Data Extraction and Removal of Irrelevant Pages.ipynb
Clustering Techniques.ipynb		Clustering Techniques.ipynb
Data Pre-processing.ipynb		Data Pre-processing.ipynb
README.md		README.md
Train Our Model.ipynb		Train Our Model.ipynb
Visualizing our Data.ipynb		Visualizing our Data.ipynb
Web Scraping and Downloading Files.ipynb		Web Scraping and Downloading Files.ipynb
article-path-data.csv		article-path-data.csv
cluster_results.csv		cluster_results.csv
complete-article-data.csv		complete-article-data.csv
doc_embeddings.npy		doc_embeddings.npy
nardb.txt		nardb.txt
preprocessed-data.csv		preprocessed-data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NAR-Data-Discovery

Purpose

Notebooks

Part 1: Web Scraping and Downloading Files

Part 2: Article Data Extraction and Removal of Irrelevant Pages

Part 3: Data Pre-Processing

Part 4: Train our Model

Part 5: Clustering Techniques

Part 6: Visualizing our Data

Current Data (feel free to use)

About

Releases

Packages

Languages

SeanFlannery/NAR-Data-Discovery

Folders and files

Latest commit

History

Repository files navigation

NAR-Data-Discovery

Purpose

Notebooks

Part 1: Web Scraping and Downloading Files

Part 2: Article Data Extraction and Removal of Irrelevant Pages

Part 3: Data Pre-Processing

Part 4: Train our Model

Part 5: Clustering Techniques

Part 6: Visualizing our Data

Current Data (feel free to use)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages