Welcome to the repository for our solutions to the final round of the Irancell Labs Artificial Intelligence Hackathon 2023! Our team, consisting of Alireza Mirrokni and myself, participated in this challenging competition and tackled the problems presented with great effort and dedication.
Due to the Contest Code of Conduct, we are unable to share the datasets publicly. However, we provide detailed explanations of the problems and our solutions to give you insights into our approach. Enjoy exploring!
The contest featured two main problems judged by an automated system. Here, we describe each problem and outline our solutions.
Overview:
Data Structure Example:
├── 2023-0-27
│ ├── isna
│ │ ├── 0.html
│ │ └── 1.html
│ └── afkarnews
│ │ ├── 0.html
│ │ └── 1.html
└── 2023-07-28
│ ├── borna
│ │ ├── 0.html
│ │ └── 1.html
│ ├── digiato
│ │ ├── 0.html
│ │ └── 1.html
The dataset provided consists of news articles collected over several days from various newspapers and blogs. The data is organized into folders named by date, and within each date folder, there are subfolders for each news source. Each news source folder contains HTML files representing individual news articles. Given this dataset, we were tasked with the following queries:
Queries:
- Unique News Sources: How many unique news sources are in the dataset?
- Most News Record: Which news source has the most news record in the dataset.
- Word Count in
varzesh3
: For all HTML files related tovarzesh3
, count the occurrences of the wordsکشتی
,والیبال
, andفوتبال
inp
tags. - Most Repeated Word in
h2
Tags: Find the most repeated word inh2
tags among all news collected on 2023-08-01, excluding provided stopwords.
Overview:
The task is to build a model predicting the topic of news items. Each row in the dataset corresponds to a news item, with the topic in the tags
column.
Topics and Labels:
- Social (Label: 0)
- Economic (Label: 1)
- Iran Provinces (Label: 2)
- International (Label: 3)
- Political (Label: 4)
- Scientific/Cultural/Sports (Label: 5)
Objectives:
- Modify the
tags
column to fit one of the six main categories. - Tag each news item with one main category.
- Evaluate the model using the F1 Score.
Our Model:
We achieved an accuracy of approximately 8% with our predictive model.
problems/
: Contains PDF files detailing the problems./q1.ipynb
: Code and explanations for our solution to the first problem./q2.ipynb
: Code and explanations for our solution to the second problem.
Note: Due to contest rules, the datasets used in this hackathon cannot be shared publicly. Please refer to the explanations provided for a comprehensive understanding of our approach.