Skip to content

This project explores the fundamental aspects of natural language processing (NLP) using the PubMed200kRCT_medical_abstracts dataset, comprising approximately 200,000 labeled Randomized Controlled Trial (RCT) abstracts.

Notifications You must be signed in to change notification settings

spoluan/PubMed200kRCT_medical_abstracts_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Project descriptions

This project delves into the fundamental aspects of natural language processing (NLP), which involves the processing of text data to make predictions. For this purpose, I utilized the PubMed200kRCT_medical_abstracts dataset, which consists of approximately 200,000 labeled Randomized Controlled Trial (RCT) abstracts, as presented in the paper "PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts" published in 2017. For further in-depth information about the dataset, I recommend referring to the original paper for comprehensive details.

In NLP, there are various approaches to handle text data prior to training, such as text/word level or character level. Text level involves treating each word as a token with a unique ID, whereas character level entails assigning a unique ID to each character and training the model accordingly. In this project, I have covered both approaches with a simple model architecture for demonstration purposes. Despite the simplicity, the accuracy of the model is noteworthy. Feel free to check out the project for more details!

About

This project explores the fundamental aspects of natural language processing (NLP) using the PubMed200kRCT_medical_abstracts dataset, comprising approximately 200,000 labeled Randomized Controlled Trial (RCT) abstracts.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published