NLP-on-scientific-articles-for-information-retrieval

This experiment will work on language modeling to provide a way of giving a token of identification to research articles which facilitates recommendation and search process.

Key work on this experiment is word embedding. I will test several wording embedding methods such as TF-IDF and word2vec. And then, I will evaluate the classification modeling performance on vectors from different word embedding. The best word embedding should be able to reasonably separate the articles and give out the best classification prediction. I will use the best word embedding method to realize information retrieval and produce a web APP for Scientific-articles-search.

Presentation

https://github.com/zhaoxin1124ds/NLP-on-scientific-articles-for-information-retrieval/blob/main/NLP_information_retrieval.pdf

Data

The data is from Kaggle and contains abstract and title for a set of scientific papers. All the papers have been labeled according to 6 topics: computer science, physics, mathematics, statistics, quantitative biology, and quantitative finance.

Experiment procedure

Data pre-processing

I removed the numbers, puntuations, extra white space, new lines, etc.

Word embedding and classification

I will test below word embedding algorithms:

TF-IDF
word2vec
Since the document is labeled, I will acquire classification score to evaluate difference vector space models

Information retrieval

I will use cosine similarity ranking to realize the information retrieval

Results

The classification accuracy score is 0.84
The product of information retrieval returns reasonable recommendations for anu queries

More thinks for further improvement

Larger documents to improve the balance
Playing with weight between the title and abstract
BERT for a more advanced word embedding attempt

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
NLP_Scientific_Articles_Retrieval.ipynb		NLP_Scientific_Articles_Retrieval.ipynb
NLP_Scientific_Articles_Retrieval_app.ipynb		NLP_Scientific_Articles_Retrieval_app.ipynb
NLP_information_retrieval.pdf		NLP_information_retrieval.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP-on-scientific-articles-for-information-retrieval

Presentation

Data

Experiment procedure

Data pre-processing

Word embedding and classification

Information retrieval

Results

More thinks for further improvement

About

Releases

Packages

Languages

zhaoxin1124ds/NLP-on-scientific-articles-for-information-retrieval

Folders and files

Latest commit

History

Repository files navigation

NLP-on-scientific-articles-for-information-retrieval

Presentation

Data

Experiment procedure

Data pre-processing

Word embedding and classification

Information retrieval

Results

More thinks for further improvement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages