GitHub - spoluan/PubMed200kRCT_medical_abstracts_classification: This project explores the fundamental aspects of natural language processing (NLP) using the PubMed200kRCT_medical_abstracts dataset, comprising approximately 200,000 labeled Randomized Controlled Trial (RCT) abstracts.

Project descriptions

This project delves into the fundamental aspects of natural language processing (NLP), which involves the processing of text data to make predictions. For this purpose, I utilized the PubMed200kRCT_medical_abstracts dataset, which consists of approximately 200,000 labeled Randomized Controlled Trial (RCT) abstracts, as presented in the paper "PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts" published in 2017. For further in-depth information about the dataset, I recommend referring to the original paper for comprehensive details.

In NLP, there are various approaches to handle text data prior to training, such as text/word level or character level. Text level involves treating each word as a token with a unique ID, whereas character level entails assigning a unique ID to each character and training the model accordingly. In this project, I have covered both approaches with a simple model architecture for demonstration purposes. Despite the simplicity, the accuracy of the model is noteworthy. Feel free to check out the project for more details!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.gitignore		.gitignore
PubMed200kRCT_medical_abstracts.ipynb		PubMed200kRCT_medical_abstracts.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

spoluan/PubMed200kRCT_medical_abstracts_classification

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages