For years, Olympics have been held around the world, and participants from around the world compete to succeed in different sports. We wanted to work with this data because we want to look for answers to the questions about which countries are superior in which branches, which countries are leading women or men in which sports and bringing medals to their country. We choose a dataset which is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. Dataset contains 271116 observations and 15 variables. ID, number, gender, age, height, weight, Team, NOC, games, Year, season, city, sport, event and medal parameters are used in this dataset as variables.
This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. The athlete events dataset contains information about Olympic Athletes and events that they have competed in, including biological data (Age, Sex, Height, Weight etc.) and event data (Year, Season, City, Sport etc.). The NOC (National Organizing Committee) regions dataset possesses information about the countries that compete in the Olympics, including the country name and any notes about said country.
The file athlete_events.csv contains 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in an individual Olympic event.
The Athletes dataset contains 15 variables:
- ID: A number used as a unique identifier for each athlete
- Name: The athlete’s name(s) in the form of First Middle Last where available
- Sex: The athlete’s gender; one of M or F 4 Age: The athlete’s age in years
- Height: The athlete’s height in centimeters (cm)
- Weight: The athlete’s weight in kilograms (kg)
- Team: The name of the team that the athlete competed for
- NOC: The National Organizing Committee’s 3-letter code
- Games: The year and season of the Olympics the athlete competed in in the format YYYY Season
- Year: The year of the Olympics that the athlete competed in
- Season: The season of the Olympics that the athlete competed in
- City: The city that hosted the Olympics that the athlete competed in
- Sport: The sport that the athlete competed in
- Event: The event that the athlete competed in
- Medal: The medal won by the athlete; one of Gold, Silver, or Bronze. NA if no medal was won.
Dataset Link: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results#athlete_events.csv
getwd()
## [1] "/home/mertdgn/Desktop/R_project"
setwd("/home/mertdgn/Desktop/R_project")
a <- read.csv("athlete_events.csv")
summary(a)
## ID Name Sex Age
## Min. : 1 Robert Tait McKenzie : 58 F: 74522 Min. :10.00
## 1st Qu.: 34643 Heikki Ilmari Savolainen: 39 M:196594 1st Qu.:21.00
## Median : 68205 Joseph "Josy" Stoffel : 38 Median :24.00
## Mean : 68249 Ioannis Theofilakis : 36 Mean :25.56
## 3rd Qu.:102097 Takashi Ono : 33 3rd Qu.:28.00
## Max. :135571 Alexandros Theofilakis : 32 Max. :97.00
## (Other) :270880 NA's :9474
## Height Weight Team NOC
## Min. :127.0 Min. : 25.0 United States: 17847 USA : 18853
## 1st Qu.:168.0 1st Qu.: 60.0 France : 11988 FRA : 12758
## Median :175.0 Median : 70.0 Great Britain: 11404 GBR : 12256
## Mean :175.3 Mean : 70.7 Italy : 10260 ITA : 10715
## 3rd Qu.:183.0 3rd Qu.: 79.0 Germany : 9326 GER : 9830
## Max. :226.0 Max. :214.0 Canada : 9279 CAN : 9733
## NA's :60171 NA's :62875 (Other) :201012 (Other):196971
## Games Year Season City
## 2000 Summer: 13821 Min. :1896 Summer:222552 London : 22426
## 1996 Summer: 13780 1st Qu.:1960 Winter: 48564 Athina : 15556
## 2016 Summer: 13688 Median :1988 Sydney : 13821
## 2008 Summer: 13602 Mean :1978 Atlanta : 13780
## 2004 Summer: 13443 3rd Qu.:2002 Rio de Janeiro: 13688
## 1992 Summer: 12977 Max. :2016 Beijing : 13602
## (Other) :189805 (Other) :178243
## Sport Event
## Athletics : 38624 Football Men's Football : 5733
## Gymnastics: 26707 Ice Hockey Men's Ice Hockey : 4762
## Swimming : 23195 Hockey Men's Hockey : 3958
## Shooting : 11448 Water Polo Men's Water Polo : 3358
## Cycling : 10859 Basketball Men's Basketball : 3280
## Fencing : 10735 Cycling Men's Road Race, Individual: 2947
## (Other) :149548 (Other) :247078
## Medal
## Bronze: 13295
## Gold : 13372
## Silver: 13116
## NA's :231333
##
##
##
library(ggplot2)
library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
- Medal
- Event
- Sport
- Season
- Games
- NOC
- Team
- Sex
- Name
So, there are 9 categorical variables
Handle all NA values except not medal column variables
na <- subset(a, !is.na(Height+Weight+Age)) #Medal haricinde NA'siz.
Change Na variables in the Medal to ‘No Medal’
na$Medal <- as.character(na$Medal)
na$Medal[is.na(na$Medal)] <- "No Medal"
In this section, we plot one variable plots by using information from the dataset
qplot(x=Age, data=na,
xlab="Age",
ylab="Number of People",
main = ("Distribution of Ages"),
geom="histogram",
binwidth=1,
fill=I("pink"),
col=I("blue"))+
scale_x_continuous(limits = c(10,71), breaks = seq(10,71,10))
## Warning: Removed 2 rows containing missing values (geom_bar).
In this plot, we want to see the distribution of ages by number of people. We choose histogram because we want to see clearly the age distribution from the plot by looking the columns. We can see that there is the highest number of athletes between the ages of 20-30. The 23-year-old athletes participate in the Olympics. There is no participation of athletes over the age of 56. In addition, the youngest athletes participating in the Olympics are 10 years old.
p <- ggplot(na, aes(x = `Sport`))+
geom_bar(color="darkblue", fill="lightblue")+
ggtitle("Distribution of Sport Branchs") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p
In this plot, we want to see the distribution of sport branchs by number of the branch count. We choose histogram because we want to see clearly the sport branch distribution from the plot by looking the columns. In addition we use theme to change angle of the branches so we can see the branches clearly on the x-axis. We can see easily ‘Art Competitions’, ‘Gymnastics’ and ‘Swimming’ has the higher rate than others. In addition, basketball was expected to be higher, but less athletes participated in many sports. The reason why this is a different detail is that in many countries the ‘Biathlon’ sport is less well known than Basketball.
qplot(x=Sex, data=na,
xlab="Gender",
ylab="Number of People",
main = ("Distribution of Gender"),
fill=I("orange"),
col=I("purple"))+
scale_y_continuous(limits = c(0,145000), breaks = seq(0,145000,10000))
In this plot, we can observe the sex distribution of the participations. We choose histogram because we want to see clearly the sex distribution from the plot by looking the columns. As shown in the plot, the male competitor ratio is twice the female competitor ratio. It is very unfortunate that even in the field of sports branches such as the Olympics, we find women and men inequality.
qplot(x=Height, data=na,
xlab="Height",
ylab="Number of People",
main = ("Distribution of Height"),
geom="histogram",
binwidth=1,
fill=I("black"),
col=I("green"))+
scale_x_continuous(limits = c(127,226), breaks = seq(127,226,10))
## Warning: Removed 2 rows containing missing values (geom_bar).
We just want to for observe the distribution of the heights. The best way to do this is to plot a histogram
qplot(x=Weight, data=na,
xlab="Weight",
ylab="Number of People",
main = ("Distribution of Weight"),
geom="histogram",
binwidth=1,
fill=I("grey"),
col=I("red"))+
scale_x_continuous(limits = c(25,214), breaks = seq(25,214,10))
## Warning: Removed 2 rows containing missing values (geom_bar).
We just want to for observe the distribution of the weights. The best way to do this is to plot a histogram
In this section, we plot two variable plots by using information from the dataset
qplot(x=Medal, y=Age, data=na, main=('Distribution of Medal by Age'), geom='boxplot', color='purple')
In this plot, we see the medal types of the people who were entering the Olympics. We handle the NA values by changing its value to the ‘No Medal’. We can see that there is the highest mean of medal awards between the ages of 22-28. Absolutely it is normal, because younger people are be able to win medals than the older people. In addition, we can see there are outliers after 38 age. We use boxplot because we want to see median and outliers to analyse and easily see the medal distribution statistics by age.
ggplot(data = na) +
aes(x = Year) +
aes(y = `Sport`) +
geom_point() +
geom_smooth(method = "lm", se = T) +
scale_x_continuous(limits= c(1896,2016))+
scale_color_manual(values = c("black", "yellow")) +
labs(col = "") +
labs(title = "Distribution of Sports Types by Years") +
theme_bw()
In this plot, we want to observe sports branchs and their played rate distributed by years. As we can see, there are huge gap between 1936-1948. In these years there were no olympics games played. Because that, there are World War II in these years. In addition, there are also a gap between 1912-1920. In these years there were mostly no olympics games played. Because that, there are World War I in these years. Lacrosse game branch have played at once in the 1905. It is so interesting, when we observe the olympics games. And the tennis branch was not played for 46 years in the period of 1924-1980. In the end, we can see the olympics games play in every 4 years.
ftbl <- a %>%
filter(Sport == "Football") %>%
select(Name, Sex, Age, Team, NOC, Year, City, Event, Medal)
# Count Events, Nations, and Football competitions each year
counts_ftbl <- ftbl %>% filter(Team != "Unknown") %>%
group_by(Year) %>%
summarize(
Events = length(unique(Event)),
Nations = length(unique(Team)),
Footballs = length(unique(Name))
)
##############continues to plot################
# count number of medals awarded to each Team
medal_counts_ftbl <- ftbl %>% filter(!is.na(Medal))%>%
group_by(Team, Medal) %>%
summarize(Count=length(Medal))
#plot
ggplot(medal_counts_ftbl, aes(x=Team, y=Count, fill=Medal)) +
geom_col() +
coord_flip() +
scale_fill_manual(values=c("#CD7F32","#FFD700","#C0C0C0")) +
ggtitle("Historical medal counts from Football Competitions") +
theme(plot.title = element_text(hjust = 0.5))
In this plot, we can observe that the medal counts and types for each teams. As we can see, the most medal winners are USA, Germany, Brazil, SSCB, Yugoslavia. As a fact that the governments in these countries support. Olympic games especially football and as a natural result; these countries has more medals than other countries by the help of some politic reasons in their country. We choose histogram because we want to clearly see the medal distribution and also determine which medal type won by which country. We also flipped the histogram coordinates. Thus, we can easily analyze the medal distribution over the countries.
na %>%
group_by(Year, Season) %>%
summarise(NoOfCountries = length(unique(NOC))) %>%
ggplot(aes(x = Year, y = NoOfCountries, group = Season)) +
geom_line(aes(color = Season)) +
geom_point(aes(color = Season)) +
labs(x = "Year", y = "#of countries that participated", title = "#of countries that participated in the Olympics") +
theme_minimal()
In this plot, we can observe that the number of participants in the olympics have grown overtime. It is also obvious that the number of participants in the summer olympics are more than that of the winter olympics. As a natural result, in summer olympics there are a lot of participants than winter olympics. We choose scatter plot and lines because we want to clearly see the weather distribution with rates and also determine which years and the count of the countries decreasing or increasing by looking the graph. Thus, we can easily analyze the weather rate over the countries that participated.