GitHub - tttonyalpha/news_monitoring: News monitoring service prototype

News monitoring service

Prototype for news monitoring service
Explore the docs »

View Demo · Report Bug · Request Feature

About The Project

The main task of news monitoring is to process the incoming stream of news, identifying events that are interesting to users. In the banking sector, this can be useful for predicting defaults of major borrowers, such as various large companies. In this case, it is necessary to build a model to detect in the news an event corresponding to a delay in putting a certain object into operation. Later, with the selected texts, simpler models can be used to search for mentions of the bank's borrowers.

Data collection and preparation

For this project, a pre-selected set of training and testing data was used. More details about the data analysis can be found below.

Train

The training dataset consists of 1.6k samples, with 19 percent being target texts. The remaining texts were chosen in a way that it is initially difficult to determine whether they are target texts or not.

Test

The testing dataset is a set of 10k samples collected from various news sources over the course of one week.

Topic modeling

To get sentence embeddings I used model cointegrated/rubert-tiny2, which were trained to produce high-quality sentence embeddings. Then I reduced the dimensions of the embeddings using UMAP and clustered them using HDBSCAN. To tune hyperparameters and score clusters I used Bayesian optimization with Hyperopt

Train

Test

Summarization

For texts with more than 512 tokens, we will summarize them to fit into the classifier

Choosing a method: extractive vs abstractive summarization

I decided to use the abstractive model, despite the fact that extractive models work faster in this case. My choice is justified by the fact that the texts in the test set are quite large, there may be several different topics, and an extractive model may not extract what we need from such text.

Model selection

I chose the model mbart_ru_sum_gazeta because it is trained for summarizing news in Russian and adapted to the domain of our data. Additionally, in the model author's article about the training dataset, you can see that the distribution of the number of tokens per sentence in the test set and the model's output is suitable for our task. arxiv:2006.11063

Classification

Results

data	Accuracy	f1-score
val_dataset	0.95	0.87

(back to top)

Project structure

The project has the following structure:

news_monitoring/eda: clustering scripts
news_monitoring/models: .py scripts with summarization and classification models
news_monitoring/preprocessing: .py scripts with text preprocessing
news_monitoring/preprocessing/news_monitoring.ipynb: inference notebook

Roadmap

(back to top)

Contacts

Telegram: @my_name_is_nikita_hey
Mail: tttonyalpha@gmail.com

License

Distributed under the MIT License. See LICENSE.txt for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
eda		eda
images		images
models		models
preprocessing		preprocessing
README.md		README.md
news_monitoring.ipynb		news_monitoring.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News monitoring service

About The Project

Data collection and preparation

Train

Test

Topic modeling

Train

Test

Summarization

Choosing a method: extractive vs abstractive summarization

Model selection

Classification

Results

Project structure

Roadmap

Contacts

License

About

Releases

Packages

Languages

tttonyalpha/news_monitoring

Folders and files

Latest commit

History

Repository files navigation

News monitoring service

About The Project

Data collection and preparation

Train

Test

Topic modeling

Train

Test

Summarization

Choosing a method: extractive vs abstractive summarization

Model selection

Classification

Results

Project structure

Roadmap

Contacts

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages