Skip to content

tttonyalpha/news_monitoring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Logo

News monitoring service

Prototype for news monitoring service
Explore the docs »

View Demo · Report Bug · Request Feature

About The Project

The main task of news monitoring is to process the incoming stream of news, identifying events that are interesting to users. In the banking sector, this can be useful for predicting defaults of major borrowers, such as various large companies. In this case, it is necessary to build a model to detect in the news an event corresponding to a delay in putting a certain object into operation. Later, with the selected texts, simpler models can be used to search for mentions of the bank's borrowers.

Data collection and preparation

For this project, a pre-selected set of training and testing data was used. More details about the data analysis can be found below.

Train

The training dataset consists of 1.6k samples, with 19 percent being target texts. The remaining texts were chosen in a way that it is initially difficult to determine whether they are target texts or not. image image

Test

The testing dataset is a set of 10k samples collected from various news sources over the course of one week.

image image

Topic modeling

To get sentence embeddings I used model cointegrated/rubert-tiny2, which were trained to produce high-quality sentence embeddings. Then I reduced the dimensions of the embeddings using UMAP and clustered them using HDBSCAN. To tune hyperparameters and score clusters I used Bayesian optimization with Hyperopt

Train

image

Test

image

Summarization

For texts with more than 512 tokens, we will summarize them to fit into the classifier

Choosing a method: extractive vs abstractive summarization

I decided to use the abstractive model, despite the fact that extractive models work faster in this case. My choice is justified by the fact that the texts in the test set are quite large, there may be several different topics, and an extractive model may not extract what we need from such text.

Model selection

I chose the model mbart_ru_sum_gazeta because it is trained for summarizing news in Russian and adapted to the domain of our data. Additionally, in the model author's article about the training dataset, you can see that the distribution of the number of tokens per sentence in the test set and the model's output is suitable for our task. arxiv:2006.11063

Classification

Results

data Accuracy f1-score
val_dataset 0.95 0.87

(back to top)

Project structure

The project has the following structure:

  • news_monitoring/eda: clustering scripts
  • news_monitoring/models: .py scripts with summarization and classification models
  • news_monitoring/preprocessing: .py scripts with text preprocessing
  • news_monitoring/preprocessing/news_monitoring.ipynb: inference notebook

Roadmap

  • Topic modeling

  • News summarizing

  • News classificator

  • News deduplication

  • App for news scrapping

(back to top)

Contacts

Telegram: @my_name_is_nikita_hey
Mail: tttonyalpha@gmail.com

License

Distributed under the MIT License. See LICENSE.txt for more information.

About

News monitoring service prototype

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published