Skip to content

Commit

Permalink
Merge pull request #579 from ghousiya47/top-foreign-language-analysis…
Browse files Browse the repository at this point in the history
…-#519

Top foreign language analysis #519
  • Loading branch information
abhisheks008 authored Feb 12, 2024
2 parents 503f46d + dc2082c commit ea285df
Show file tree
Hide file tree
Showing 7 changed files with 3,841 additions and 0 deletions.
31 changes: 31 additions & 0 deletions Top Foreign Languages Analysis/Dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Top Foreign Languages Dataset

The Dataset used here is taken from the Kaggle database website. You can download the file from the link given here, Top Foreign Languagess Analysis and Prediction.( https://www.kaggle.com/datasets/timmofeyy/top-foreign-languages-preply-tutors)

## About the dataset

The data includes the main languages and the most popular among students. The datasets have 8 csv files for 8 different top foreign languages.

- **columns_description**: Each CSV File contains the description of all the features.

name: The name of the tutor.

badge: Any badge or certification associated with the tutor.

rating: The overall rating of the tutor.

reviews_number: The number of reviews the tutor has received.

usd_price: The price charged by the tutor for their services.

language: The languages spoken by the tutor.

active_students: The number of active students the tutor is currently teaching.

lessons_number: The total number of lessons conducted by the tutor.

speak: The languages spoken by the tutor.

description: A brief description or snippet provided by the tutor.

link: The link or URL to the tutor's profile.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3,696 changes: 3,696 additions & 0 deletions Top Foreign Languages Analysis/Model/Top_Foreign_Language_Analysis.ipynb

Large diffs are not rendered by default.

103 changes: 103 additions & 0 deletions Top Foreign Languages Analysis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
<h1>Top Foreign Language Analysis</h1>

**GOAL**

To build a machine learning model for predicting the usd_price per lesson for a given language and other feature.

**DATASET**

[https://www.kaggle.com/datasets/timmofeyy/top-foreign-languages-preply-tutors]

**DESCRIPTION**

To analyze the dataset of Top Foreign Language and build and train the model on the basis of different features and variables.

The data includes the main languages and the most popular among students. The datasets have 8 csv files for 8 different top foreign languages.

- **columns_description**: Each CSV File contains the description of all the features.

name: The name of the tutor.

badge: Any badge or certification associated with the tutor.

rating: The overall rating of the tutor.

reviews_number: The number of reviews the tutor has received.

usd_price: The price charged by the tutor for their services.

language: The languages spoken by the tutor.

active_students: The number of active students the tutor is currently teaching.

lessons_number: The total number of lessons conducted by the tutor.

speak: The languages spoken by the tutor.

description: A brief description or snippet provided by the tutor.

link: The link or URL to the tutor's profile.


### Visualization and EDA of different attributes:

<img alt="correlation" src="./Images/correlation between amount of lessons and price per lesson.png">

<img alt="graph" src="./Images/count of tutors.png">

<img alt="graph" src="./Images/distribution plot.png">




**MODELS USED**

| Model | MSE_train | R2_train | MSE_test | R2_test |
|---------------------------|-----------|----------|-----------|-----------|
| Random Forest Regression | 7.03 | 0.93 | 57.90 | 0.51 |
| Linear Regression | 17.09 | 0.84 | 60.72 | 0.49 |
| Ridge Regression | 85.65 | 0.22 | 96.16 | 0.20 |
| Elastic Net Regression | 105.0 | 0.04 | 114.7 | 0.047 |
| Decision Tree Regression | 0.00 | 1.00 | 61.30 | 0.49 |
| Deep NN | 34.29 | 0.04 | 114.7 | 0.0471 |


**WHAT I HAD DONE**

* Load the dataset which is in zip file format unzipped it and than concatenated all 8 CSV files.
* After Concatenation it contains 34442 entries in it and having 47 columns in it.
* Checked for missing values and cleaned the data accordingly.
* Analyzed the data, found insights and visualized them accordingly.
* Plotting heatmap using correlation and checking the relation between different features.
* Found detailed insights of different columns with target variable using plotting libraries.
* Train the datasets by different models and saves their accuracies into a dataframe.


**LIBRARIES NEEDED**

1. Pandas
2. Matplotlib
3. Sklearn
4. NumPy
5. XGBoost
6. Tensorflow
7. Keras
8. Sci-py
9. Seaborn
10. missingno
11. plotly


**CONCLUSION**

- Random Forest and Linear Regression models show promising performance with lower MSE and higher R2 values.
- Decision Tree Regression achieved perfect R2 on the training set but performed poorly on the test set, indicating overfitting.
- Deep Neural Network (NN) has a high MSE and approximately zero R2, suggesting poor performance on both training and test sets.


**YOUR NAME**

*Ghousiya Begum*

[![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/ghousiya-begum-a9b634258/) [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/ghousiya47)

11 changes: 11 additions & 0 deletions Top Foreign Languages Analysis/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
numpy==1.19.2
pandas==1.4.3
matplotlib==3.7.1
scikit-learn~=1.0.2
scipy==1.5.0
seaborn==0.10.1
xgboost~=1.5.2
tensorflow==2.4.1
keras==2.4.0
missingno==0.5.2
plotly==5.15.0

0 comments on commit ea285df

Please sign in to comment.