Skip to content

This project focuses on using Linear Regression technique to create a model that can efficiently identify fake news. Further an Idea is proposed on how to approach the problem statement further

Notifications You must be signed in to change notification settings

srihitha2005/Fake-News-Detector

Repository files navigation

PROBLEM: FAKE NEWS DETECTOR

Fake news is false information that is published and promoted as if it were true. Nowadays it has become easier to spread false news due to social media which can reach a wider audience at a faster pace. Many people use the internet to stay informed and share millions of posts, articles, and videos across platforms such as Facebook, Twitter, and YouTube.

Fake news can be damaging, malicious, and even dangerous. Fake news can have serious consequences including character assassination, increasing violence, and decreasing trust in actual news.

Fake news can be of various forms such as false headlines, manipulated images or videos, and misleading quotes. Fake news can be divided into three major categories.

  1. Political Fake News
  2. Clickbait
  3. Fake News Articles

Political Fake News: Fake news in this category often contains a statement made by a public personality. The statement, on its own, may seem outrageous or untrue. However, without proper context or background information, we cannot determine the accuracy of the news. For example, in 2019, during an election campaign rally, a prominent Indian politician made a statement about a particular community. The statement, when taken out of context, appeared to be critical and derogatory towards the community. However, when considered in the full context of the politician's overall stance on the issue and their track record, it became clear that the statement was actually a call for unity and harmony between different religious groups. Unfortunately, the damage had already been done, and the statement was widely criticized and condemned, leading to a negative impact on the politician's reputation and public image. More details about the incident can be found here.

Clickbait: In this category, headlines are designed to be sensations using misleading language to make the readers click the link. This can lead to disappointment and mistrust from users when they realize that the content does not live up to the headline's promises. Clickbait can also have negative impacts on search engine rankings. Often people read headlines to catch up with news on their busy schedules and might share them without reading their complete content and this might cause the spread of misleading information making the articles more popular. In addition, the popularity of a news item can also affect its acceptance, even if the content itself is not verified or accurate. People may be more likely to believe and share news that appears to be popular or widely accepted, without taking the time to assess its credibility or accuracy. For example, the headlines about the severity of the storms that hit the United Kingdom in 2023 were published by a news channel in Andhra Pradesh leaving people worried about their family, relatives, and friends living in the United Kingdom. However, the storm proved to be less serious when the news about the headline was telecasted.

Fake News Articles: This kind of news uses terms intended to inspire outrage or anger in the readers or the viewers. The writing in these articles is often subpar, with grammatical errors and a lack of logical flow. For example, in 2020, during the COVID-19 pandemic a claim that consuming cow urine could cure the virus. This claim was shared widely on social media platforms like WhatsApp and Facebook, and many people believed it to be true. However, there was no scientific evidence mentioned to support this claim.

HOW IS AI USED TO SOLVE THIS PROBLEM?

The main focus of this project is to identify clickbait news.
Often headlines are released before the actual news in news channels and headlines direct the user to the actual news in most of the apps. Artificial Intelligence (AI) can be used to identify clickbait in the headlines by being trained to recognize patterns and characteristics of typical clickbait headlines involving various elements such as the use of determiners, pronouns, numbers, and question phrasing, among others. Additionally, AI can be trained to distinguish between clickbait headlines written by humans and those generated by machines. The AI model can then learn from this data and make predictions about new, unseen headlines.

In the project, a working AI model is presented to identify clickbait and a theory of identifying fake news in other two categories namely political fake news and fake news articles will be provided.

HOW AI IS USED IN DETECTING FAKE NEWS? Contents overview :

  1. Introduction

  2. Data Collection 2.1. Collecting Non-Clickbait Headlines 2.2. Collecting Clickbait Headlines

  3. Data Preprocessing

  4. Data Labeling

  5. Feature Extraction

  6. Data Vectorization

  7. Model Training

  8. Evaluation of the model

  9. Development of Frontend Environment using Tkinter

  10. Proposal for Predicting Political Fake News and Fake News Articles

  11. Summary and Conclusion

  12. References

  13. Introduction :
    In Report 1, fake news is divided into three types. In this report, how AI is used to identify clickbait and how AI can be used to detect other kinds of fake news are given. The features that differentiate clickbait from non-clickbait are analyzed using a linear regression model. This report also discusses how the other two types of fake news might be detected.

  14. Data Collection : Data collection is done in a way to ensure equal distribution of data among all categories.. Data collection is divided into two main components: non-clickbait headlines and clickbait headlines. 2.1. Collecting Non-Clickbait Headlines: Non-clickbait headlines are collected from various reputable online sources. These sources include NBC News, The Indian Express, NDTV, and The Wire. The headlines are collected from the sources using web scraping techniques with the help of the BeautifulSoup tool. With the help of web scraping, a diverse range of non-clickbait headlines were extracted efficiently. The extracted data is stored in the form of a CSV file.

2.2. Collecting clickbait headlines: Clickbait headlines are manually collected from various sources. The sources include Buzzfeed, Reddit, and NewsPunch. The data is saved as a CSV file. The final data sets have the following dimensions: Non-clickbait training dataset: 108 rows * 1 column Non-clickbait test dataset : 103 rows * 1 column Clickbait training dataset : 103 rows * 1 column Clickbait testing dataset : 105 rows * 1 column Clickbait : Testing : 50.481 % Trainining : 50. 739 % Non-Clickbait: Testing : 48.815 % Training : 51.185 %

  1. DATA PREPROCESSING: The preprocessing for clickbait data is minimal since the data is collected and stored manually.
    The non clickbait data contained duplicates, not required rows, and extra charecters due to html code differences while web scrapping. The rows with single words such as follow, subscribe etc are deleted. Then the rows with words subscribers only are deleted. Then the rows containing unwanted charecters are deleted.

  2. Data Labelling :
    An extra column named Label is added to all the CSV files after loading them. Clickbait headlines are labelled as 1 whereas non-clickbait headlines are labelled as 0.

  3. Feature Extraction :
    The labelled data is concatenated into test and train data and the data is splitted based on features. Headlines and labels are choosen as features for the model.

  4. Data Vectorization: Machine learning models cannot directly process raw text data. Vectorization converts text data into numerical vectors, enabling the machine learning models to process the data. Vectorization also helps in extracting relevant features from the text data.

  5. Model training : We choose the Logistic Regression model to train the data. Linear regression is used to model the relationship between a dependent variable and one or more independent variables.
    Our dependent variable is class i.e clickbait or not and our independent variables are our features i.e Headlines and labels.

  6. Evaluation of the model :
    The evaluation metric is accuracy. An accuracy of 94.14% was obtained on the test dataset.

  7. Proposal for Predicting Political Fake News: For predicting the above-mentioned types of fake news pre-trained models like BERT can be used. BERT (Bidirectional Encoder Representations from Transformers) is a natural language processing (NLP) model developed by Google. Pretrained BERT models are versions of BERT that have been trained on large amounts of text data.

  8. Summary and Conclusion This report discusses an approach to detect fake news, mainly focusing on identifying clickbait headlines using AI techniques. We collected data, preprocessed the data, and trained a logistic regression model, achieving 94.14% accuracy. We also propose using advanced NLP models like BERT to detect other types of fake news, such as political misinformation.

  9. References :
    https://www.nbcnews.com/https://indianexpress.com/https://www.ndtv.com/https://thewire.in/https://www.peoplesbanknet.com/the-dangers-of-fake-news/https://github.com/IITGuwahati-AI/Fake-News-Detection/https://monkeylearn.com/blog/text-classification-machine-learning/https://www.netreputation.com/clickbait-examples/https://www.buzzfeednews.com/section/nationalhttps://www.pewresearch.org/religion/2021/06/29/religion-in-india-toleranceand-segregation/

About

This project focuses on using Linear Regression technique to create a model that can efficiently identify fake news. Further an Idea is proposed on how to approach the problem statement further

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published