Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Top foreign language analysis #519 #579

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions Top Foreign Languages Analysis/Dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Top Foreign Languages Dataset

The Dataset used here is taken from the Kaggle database website. You can download the file from the link given here, Top Foreign Languagess Analysis and Prediction.( https://www.kaggle.com/datasets/timmofeyy/top-foreign-languages-preply-tutors)

## About the dataset

The data includes the main languages and the most popular among students. The datasets have 8 csv files for 8 different top foreign languages.

- **columns_description**: Each CSV File contains the description of all the features.

name: The name of the tutor.

badge: Any badge or certification associated with the tutor.

rating: The overall rating of the tutor.

reviews_number: The number of reviews the tutor has received.

usd_price: The price charged by the tutor for their services.

language: The languages spoken by the tutor.

active_students: The number of active students the tutor is currently teaching.

lessons_number: The total number of lessons conducted by the tutor.

speak: The languages spoken by the tutor.

description: A brief description or snippet provided by the tutor.

link: The link or URL to the tutor's profile.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3,696 changes: 3,696 additions & 0 deletions Top Foreign Languages Analysis/Model/Top_Foreign_Language_Analysis.ipynb

Large diffs are not rendered by default.

103 changes: 103 additions & 0 deletions Top Foreign Languages Analysis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
<h1>Top Foreign Language Analysis</h1>

**GOAL**

To build a machine learning model for predicting the usd_price per lesson for a given language and other feature.

**DATASET**

[https://www.kaggle.com/datasets/timmofeyy/top-foreign-languages-preply-tutors]

**DESCRIPTION**

To analyze the dataset of Top Foreign Language and build and train the model on the basis of different features and variables.

The data includes the main languages and the most popular among students. The datasets have 8 csv files for 8 different top foreign languages.

- **columns_description**: Each CSV File contains the description of all the features.

name: The name of the tutor.

badge: Any badge or certification associated with the tutor.

rating: The overall rating of the tutor.

reviews_number: The number of reviews the tutor has received.

usd_price: The price charged by the tutor for their services.

language: The languages spoken by the tutor.

active_students: The number of active students the tutor is currently teaching.

lessons_number: The total number of lessons conducted by the tutor.

speak: The languages spoken by the tutor.

description: A brief description or snippet provided by the tutor.

link: The link or URL to the tutor's profile.


### Visualization and EDA of different attributes:

<img alt="correlation" src="./Images/correlation between amount of lessons and price per lesson.png">

<img alt="graph" src="./Images/count of tutors.png">

<img alt="graph" src="./Images/distribution plot.png">




**MODELS USED**

| Model | MSE_train | R2_train | MSE_test | R2_test |
|---------------------------|-----------|----------|-----------|-----------|
| Random Forest Regression | 7.03 | 0.93 | 57.90 | 0.51 |
| Linear Regression | 17.09 | 0.84 | 60.72 | 0.49 |
| Ridge Regression | 85.65 | 0.22 | 96.16 | 0.20 |
| Elastic Net Regression | 105.0 | 0.04 | 114.7 | 0.047 |
| Decision Tree Regression | 0.00 | 1.00 | 61.30 | 0.49 |
| Deep NN | 34.29 | 0.04 | 114.7 | 0.0471 |


**WHAT I HAD DONE**

* Load the dataset which is in zip file format unzipped it and than concatenated all 8 CSV files.
* After Concatenation it contains 34442 entries in it and having 47 columns in it.
* Checked for missing values and cleaned the data accordingly.
* Analyzed the data, found insights and visualized them accordingly.
* Plotting heatmap using correlation and checking the relation between different features.
* Found detailed insights of different columns with target variable using plotting libraries.
* Train the datasets by different models and saves their accuracies into a dataframe.


**LIBRARIES NEEDED**

1. Pandas
2. Matplotlib
3. Sklearn
4. NumPy
5. XGBoost
6. Tensorflow
7. Keras
8. Sci-py
9. Seaborn
10. missingno
11. plotly


**CONCLUSION**

- Random Forest and Linear Regression models show promising performance with lower MSE and higher R2 values.
- Decision Tree Regression achieved perfect R2 on the training set but performed poorly on the test set, indicating overfitting.
- Deep Neural Network (NN) has a high MSE and approximately zero R2, suggesting poor performance on both training and test sets.


**YOUR NAME**

*Ghousiya Begum*

[![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/ghousiya-begum-a9b634258/) [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/ghousiya47)

11 changes: 11 additions & 0 deletions Top Foreign Languages Analysis/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
numpy==1.19.2
pandas==1.4.3
matplotlib==3.7.1
scikit-learn~=1.0.2
scipy==1.5.0
seaborn==0.10.1
xgboost~=1.5.2
tensorflow==2.4.1
keras==2.4.0
missingno==0.5.2
plotly==5.15.0