Nucleic Acids Research Data Discovery
Note: This Project has continued in a new form in another repository, known as NARTA (https://github.com/SeanFlannery/NARTA).
The work presented here was a preliminary foray to see if this would be a feasible project.
This is an exploratory project interested in using an archive of papers from a major biology research database (NAR)
as well as the Natural Language Processing model of Doc2Vec
to identify trends, topics, and similarities in publications
over time. We are hopeful that this analysis could generate interesting conclusions about the direction the field of
Biology is headed, and that several novel visualizations may be produced.
Below you will find a list of notebooks that serve as the building blocks of this project. We are always looking for feedback or suggestions on how to improve any of the steps in the pipeline, and anybody is welcome to reproduce and modify all of this work if it could help you in any way.
- Crawls URLs from
nardb.txt
of article year pages contained in the NAR Issue Archive - Produces list of all article links to download
- Downloads all of relevant article html files and stores them in
articles
directory
- Processes title, abstract, introduction, and author name into
original-article-data.csv
- Remove articles lacking useful information (editorial editions, advertisements, or lacking needed parameters)
- Places all data into
complete-article-data.csv
for later processing
- Perform input sanitization, lemmatization, removal of stopwords
- Use TF-IDF weighting to determine words to remove from corpus
- Output results from documents into
preprocessed_data.csv
- Train a
Doc2Vec
instance and save the model's resultant embeddings indoc_embeddings.npy
- Perform self-similarity "sanity checks" to ensure embedding is useful
- Using various clustering techniques, establish clusters of different documents
- All prior data, cluster numbers, along with PCA components are all stored in
cluster_results.csv
- Use cluster assignments and PCA components of document embeddings to visualize cluster groupings
The aggregate data we've accrued over time can be found within the file cluster_results.csv
consisiting of:
year
: year of article publicationarticle-link
: original link to the articlelocal-path
: local path in thearticles
folder where one can find page sourcetitle
: the paper or article's titleabstract
: the paper's abstract (editorial articles, by their nature, lack an abstract)authors
: a list of all contributing authors to the papersintroduction
: all text contained within the introduction of the article (or a similar area, we occasionally didn't have labels for the introductions)preprocessed_data
: all text after Part 3's preprocessing with words separated by spacespca_feature*
: Principle Component Analysis values (note total explained variance is only around ~10% even when accounting for all 3 features). These features are used to represent individual articles in 2D and 3D space in Part 6.xmeans_cluster
: cluster assignment based on the XMeans algorithmkmeans_cluster
: cluster assignment based on the KMeans algorithm
Interact with Current 3D Vizualization Here