Pubmed_mining

My foray into text mining data from pubmed.

##To Do:

Mesh Headings and Keywords will need to be inspected for run-on words
Use n-grams (i.e. stem cell instead of cell and stem) -- bigram tokenizer initated
Create dictionary of relevant terms
Fix stemCompletion2 code -- package update may have broken the code.
Add topic model river plots to shiny
add Grant-PMID netowrk visualization and analytics

##Usage:

The script takes pubmed data in xml form and extracts the abstracts for each citation. Abstracts are then processed in what seems to be a pretty standard way (remove numbers, puncuation and stems). Stems are completed and then some basic frequency and associations are computed. Lastly three graphics are generated, a word cloud, a dendrogram and graph for the most frequently occuring words.

Qualitative Performance Notes:

XML reading and traversing seems memory efficient and fast. Whatever problem I encountered previous has been resolved with better functions.

tm_map calls seem relatively speedy. stop word removal and stemming are by far slower than to lower and remove numbers. Stem completion is very slow, distributing the task helps but a large corpus may need to be moved to larger machine. However, memory usage has been reasonable throughout the transformation processes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Pubmed_mining

Qualitative Performance Notes:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Pubmed_mining

Qualitative Performance Notes: