Skip to content

NLP on scientific articles to create the vector space model, and the information retrieval was realized using cosine similarity ranking.

Notifications You must be signed in to change notification settings

zhaoxin1124ds/NLP-on-scientific-articles-for-information-retrieval

Repository files navigation

Picture2

NLP-on-scientific-articles-for-information-retrieval

This experiment will work on language modeling to provide a way of giving a token of identification to research articles which facilitates recommendation and search process.

Key work on this experiment is word embedding. I will test several wording embedding methods such as TF-IDF and word2vec. And then, I will evaluate the classification modeling performance on vectors from different word embedding. The best word embedding should be able to reasonably separate the articles and give out the best classification prediction. I will use the best word embedding method to realize information retrieval and produce a web APP for Scientific-articles-search.

https://github.com/zhaoxin1124ds/NLP-on-scientific-articles-for-information-retrieval/blob/main/NLP_information_retrieval.pdf

Data

The data is from Kaggle and contains abstract and title for a set of scientific papers. All the papers have been labeled according to 6 topics: computer science, physics, mathematics, statistics, quantitative biology, and quantitative finance.

Experiment procedure

Data pre-processing

I removed the numbers, puntuations, extra white space, new lines, etc.

Word embedding and classification

I will test below word embedding algorithms:

  • TF-IDF
  • word2vec
  • Since the document is labeled, I will acquire classification score to evaluate difference vector space models

Information retrieval

I will use cosine similarity ranking to realize the information retrieval

Results

  • The classification accuracy score is 0.84
  • The product of information retrieval returns reasonable recommendations for anu queries

More thinks for further improvement

  • Larger documents to improve the balance
  • Playing with weight between the title and abstract
  • BERT for a more advanced word embedding attempt

About

NLP on scientific articles to create the vector space model, and the information retrieval was realized using cosine similarity ranking.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published