PROGRAMMING FOR DATA ANALYSIS PROJECT

This repository contains Jupyter notebooks and other relevant files relating to the module assessment for Programming For Data Analysis. This README.md file contains the documentation for technologies and libraries used for the project.

All notebooks and all other relevant files can be found at: https://github.com/kmcd14/ProgrammingDA-project.

Description

Data Simulation

The notebook data_sim.ipynb details the research that went into the simulation of the data set, as well as the data set.

The aim of this notebook is to create a simulated data set by simulating a real-world phenomenon. I chose to investigate the sporting phenomenon of "home advantage".

Objectives:

Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.
Investigate the types of variables involved, their likely distributions, and their
Synthesise/simulate a data set as closely matching their properties as possible.
Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.
A clear and informative READMe file.

I found this project interesting as it was the first time that I have generated a simulated data set. I can definitely see the benefits of generating your own data. It is much more time efficient than collecting and cleaning real world data, it also avoids having to consider GPDR regulations, it gets around the issue in instances where there is not enough data for certain topics and finally, it is much more cost effective!

While there are a lot of positives, there are also downsides. It is difficult to truly recreate and mimic a real-world data set that has been collected. Models which use real world data will always be more accurate and reliable than a model which uses simulated data. There are so many sporting events which just could not be simply simulated because they are so surprising and magical; Greece winning Portugal in the to be crowned European champions in 2004, the competition was held in Portugal!

To build on this project, I would like to try and simulate a football match from both teams’ perspectives. In the current simulated data set you only see the stats of one team per match played.

How To Get The Repository on Your Machine

Using your browser navigate to the repository:
https://github.com/kmcd14/ProgrammingDA-project.
Under clone, copy the repository address, as seen in the above picture, using either SSH or HTTPS
Open your terminal.
Navigate to the location where you want to store the cloned directory.

In the terminal type the command:

$git clone git@github.com:kmcd14/ProrgrammingDA-project.git

Press enter. The cloned repository is now on your machine.

Running Jupyter Notebook

The easiest way to run the notebooks is by python installed via the Anaconda distribution. Anaconda is the most widely used python distribution in data science fields as it comes preloaded with most of the most popular packages and tools. You can find out more about Anaconda and how to install it here https://docs.anaconda.com/.

You can forgo downloading Anaconda and install each package individually in the python shell. A full list of requirements for each notebook can be found in the requirements.txt file in this repository. Full details and links to each package used can be found further down in this README.

Additionally, if you wish to view the notebook without having to install additional requirements, please click on the following badges to be redirected in your browser.

data_sim.ipynb

Opening and Running The Notebook

From the command line navigate to the folder you have cloned the repository to.
Type jupyter lab or jupyter notebook into the command line and press enter to launch the jupyer interface.

In the side panel you will see all files in the repository as seen in the above image.
Click on data_sim to open the notebook.
To run the code in a cell, hold down the shift key and press enter or click Kernel in the top toolbar and run all cells.
To change between edit and read mode at any time press the ESC key.
When you have finished, shut down the kernel via file > shut down in the browser, close the browser and press Ctrl + C on the command line to terminate the programme.

Note:

If the jupyter interface doesn't automatically open in your browser try specifying the browser e.g.:

  jupyter lab —browser=chrome

Jupyter Notebook has a full troubleshooting guide which can be found:

https://jupyter-notebook.readthedocs.io/en/stable/troubleshooting.html

Technologies Used:

Google Docs: an online word processor used to write my documentation before transferring into this README file.

https://www.google.com/docs/about/

Anaconda: the easiest way to perform Python data science machine learning on Windows, Linux and Mac OS. This script was created using Version 4.9.2. https://www.anaconda.com/distribution/

Python: an interpreted, object-oriented, high-level programming language with dynamic semantics. This script was created using Version 3.8.5. https://www.python.org/

GitHub: is a code hosting platform for collaboration and version control. https://github.com/

Jupyter Lab/Jupyter Notebook: a web-based interactive development environment for Jupyter notebooks, code, and data. https://jupyter.org/

NBViewer: a web application which enables you enter the URL of a Jupyter Notebook file, renders that notebook as a static HTML web page, and gives you a stable link to that page which you can share with others. https://nbviewer.org/

Libraries Used:

Python has a vast and continuously growing library to choose from which makes it perfect for data analysis, such as NumPy and Pandas. It is a robust, flexible and efficient language which provides many solutions and avenues to approach and solve problems.

A full list of each notebook’s requirements can be found in the requirements.txt file in the project repository.

Numpy is a Python library used for working with arrays. It produces a narray object. NumPy arrays are faster and more efficient than using python lists. It does this by storing arrays in one place in memory, so they can be accessed and manipulated quickly http://www.numpy.org/

Mathplotlib is a python library used to create plots, graphs, charts etc. https://matplotlib.org/

Pandas is a data manipulation tool built on Numpy. Its key structure is the dataframe. You can think of a dataframe as a spreadsheet or table but, dataframes as are more efficient and powerful and are an integral part of Python and Numpy. Pandas will allow us to select specific rows and columns within the dataframe. https://pandas.pydata.org/

If your system does not have these libraries installed enter the below command from the command line:

    $pip install <library name>

Credits:

The study Home advantage during the COVID-19 pandemic: Analyses of European football leagues by Dane McCarrick, Merim Bilalicb, Nick Neaveb and Sandy Wolfson. If it wasn't for their investigation into matches during Covid-19 restrictions and after supporters were allowed back into stadiums there wouldn't have been data which on the phenomenon of home advantage in football which could support the Home Advantage theory at the highest levels.

Contact:

katieisanimdom@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
images		images
sample_data		sample_data
.gitignore		.gitignore
README.md		README.md
data_sim.ipynb		data_sim.ipynb
licence		licence
project brief.pdf		project brief.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PROGRAMMING FOR DATA ANALYSIS PROJECT

Table of Contents

Description

Data Simulation

Objectives:

How To Get The Repository on Your Machine

Running Jupyter Notebook

Opening and Running The Notebook

Technologies Used:

Libraries Used:

Credits:

Contact:

About

Releases

Packages

Languages

License

kmcd14/ProgrammingDA-project

Folders and files

Latest commit

History

Repository files navigation

PROGRAMMING FOR DATA ANALYSIS PROJECT

Table of Contents

Description

Data Simulation

Objectives:

How To Get The Repository on Your Machine

Running Jupyter Notebook

Opening and Running The Notebook

Technologies Used:

Libraries Used:

Credits:

Contact:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages