Data Science Project
Extracted Population data from United Nation website. Data set contains information about various type of population of various countries of World from 1950 to 2019, with many null values. Using various algorithms (Decision Tree, Random Forest, SVM etc.) to see which one best fits the data most accurately. Explored the data, and provided insights and forecasts about population for various countries.
My Team Members~
- Abhay Panwar
- Chetan Kumar
- Gaikwad Chandan Mahadeo
- Devansh Mishra
- Mallempadi Yashaswini
It has 6 tasks~
a. Data collection
b. Data Preprocessing and Cleaning
c. Data Visualization
d. Data Statistics(Summary of statistics)
e. Hypothesis Statement
f. Prediction Task(Using Machine Learning Model)
Data Collection
The data is extracted from https://population.un.org/wpp/Download/Standard/CSV/
Information about usage of various Classifiers is taken from http://www.Datacamp.org
Data Preprocessing and Cleaning
- Taking only medium variant of Population in consideration
- Removing column varID, variant and MidPeriod
- We used built-in train_test_split in 70/30 split as it provide good accuracy.
- Source_Code_link
Data Visualization
- Plotted scatter Plot of all 4 attributes for 10 different countries
- Also calculated R^2 for each case using Plotly
- To select one attribute out of four attributes for hypothetical testing
- Male Population
- Female Population
- Total Population
- Population Density
- Male Population
- Source_Code_link
Data Statistics
- Select one attribute from four to make a model for Prediction by making observations from graphs plotted during visualization.
- Through statistics, find out the names of countries Possible whose total population will be in the range 5000 to 15000 in 2011.
- Source_File
Hypothesis Statement
From statistics, we have created a hypothetical statement. Try to prove it by creating a model.
Prediction Task(Using Machine Learning Model)
- Compared accuracy of various classifiers and selected one for making model.
- Experimentally, Our model proves our Hypothetical statement which is based on using “Total Population” as an attribute to make our model Wrong.
- Created a model using various algorithm
- Decision Tree Classification
- Random Forest Classification
- Support Vector Machines(SVM)
- Logistic Regression
- Decision Tree Classification
My Experience and Learnings
Coded in python language and Google Collab.
Learned usage of various python modules Numpy, Pandas, Plotly, csv and sklearn.
Problems faced
- Many Classifiers needed both numbers as input and output. So using Attribute “Location” possess a problem.
Solution:
Instead attribute “LocID” is used along with other numeric type attributes. - Void or NULL values created problem during Pre-processing.
Solution:
dropna() method of Pandas Python which removed null values.