Skip to content

Commit

Permalink
replacing templates for projects
Browse files Browse the repository at this point in the history
On branch main
modified:   Projects/project2.qmd
modified:   Projects/project3.qmd
modified:   Projects/project4.qmd
modified:   Projects/project5.qmd
  • Loading branch information
1Ramirez7 committed Mar 30, 2024
1 parent 130957d commit efea14a
Show file tree
Hide file tree
Showing 4 changed files with 976 additions and 12 deletions.
277 changes: 273 additions & 4 deletions Projects/project2.qmd
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
---
title: "Client Report - [Insert Project Title]"
title: "Client Report - JSON & Missing"
subtitle: "Course DS 250"
author: "[STUDENT NAME]"
author: "Eduardo Ramirez"
format:
html:
self-contained: true
page-layout: full
title-block-banner: true
toc: true
toc-depth: 3
toc-depth: 5
toc-location: body
number-sections: false
html-math-method: katex
Expand All @@ -25,4 +25,273 @@ execute:

---

### Paste in a template

## Airport Delays Analysis: Metrics and Weather Impact

_paste your elevator pitch here_
_A SHORT (4-5 SENTENCES) PARAGRAPH THAT `DESCRIBES KEY INSIGHTS` TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS._

```{python}
#| label: project data
#| code-summary: Read and format project data
# Include and execute your code here
# df = pd.read_csv("C://Users//eduar//OneDrive - BYU-Idaho//2024 Winter//250 DS//flights_missing.json")
# print(df.columns)
```


## 1. Standardizing Data Types: Handling Missing

__Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”).__ _In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value._

_type your results and analysis here_

```{python}
#| label: Q1
#| code-summary: Read and format data
# Include and execute your code here
import pandas as pd
import numpy as np
import plotly.express as px
# Load data
file_path = "C://Users//eduar//OneDrive - BYU-Idaho//2024 Winter//250 DS//flights_missing.csv"
flights_df = pd.read_csv(file_path)
# Replace missing values with NaN
flights_df.fillna(value=np.nan, inplace=True) # Null values
flights_df.replace(r'^\s*$', np.nan, regex=True, inplace=True)
flights_df.replace(-999, np.nan, inplace=True)
# Row with a NaN value
row_with_nan = flights_df[flights_df.isna().any(axis=1)].head(1)
# Display row in JSON format
json_output = row_with_nan.to_json(orient='records', default_handler=str)
json_output = json_output.replace('null', 'NaN') # maybe skipping a step here
print(json_output)
```



## 2. Analyzing Airport Delay Performance

__Which airport has the worst delays?__ _Discuss the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours._

_To determine which airport has the worst delays, the metric used "proportion of delayed flights" as the primary metric because it indicates the likelihood of a flight being delayed no matter the airport size. An airport with a higher proportion of delays is generally more problematic for travelers since a larger percentage of flights do not depart on time. Here's a summary table:_

```{python}
#| label: Q2
#| code-summary: Read and format data
# Include and execute your code here
import pandas as pd
import numpy as np
# file path
file_path = "C://Users//eduar//OneDrive - BYU-Idaho//2024 Winter//250 DS//flights_missing.csv"
flights_df = pd.read_csv(file_path)
# Replace missing values with NaN
flights_df.replace(r'^\s*$', np.nan, regex=True, inplace=True)
flights_df.replace(-999, np.nan, inplace=True)
flights_df = flights_df.applymap(lambda x: np.nan if isinstance(x, str) and '+' in x else x)
# Convert delay columns to numeric
delay_columns = ['num_of_delays_carrier', 'num_of_delays_late_aircraft',
'num_of_delays_nas', 'num_of_delays_security', 'num_of_delays_weather']
flights_df[delay_columns] = flights_df[delay_columns].apply(pd.to_numeric, errors='coerce')
# Get the total delays per flight
flights_df = flights_df.assign(total_delays_per_flight = flights_df[delay_columns].sum(axis=1, min_count=1))
# Proportion and average delay time calculations
summary_df = (flights_df.groupby('airport_code')
.agg(total_flights=('num_of_flights_total', 'sum'),
total_delayed_flights=('total_delays_per_flight', 'sum'),
total_minutes_delayed=('minutes_delayed_total', 'sum'))
.assign(proportion_delayed=lambda x: x.total_delayed_flights / x.total_flights,
average_delay_time_hours=lambda x: x.total_minutes_delayed / x.total_delayed_flights / 60)
.sort_values(by='average_delay_time_hours', ascending=False)
.reset_index())
# Display
summary_df.index = summary_df.index + 1
summary_df
```

_Considering this data, SFO has the highest proportion of delayed flights at 26.09%, indicating that it has the worst delays among these airports. Despite not having the longest average delay time, the higher chance of any given flight being delayed at SFO makes it more likely for travelers to experience a delay there._


## 3. Optimal Flight Months: Delay Analysis

__What is the best month to fly if you want to avoid delays of any length?__ _Discuss the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)_

_To identify the best month to fly to avoid delays, the metric to look at is the "Average Delay per Flight." This metric provides a measure of how long, on average, flights are delayed during each month, reflecting the overall efficiency of flight operations. From the provided chart below, we can see that September has the lowest average delay per flight, making it the best month to fly to minimize the risk of experiencing a delay._




```{python}
#| label: Q3
#| code-summary: Read and format data
# Include and execute your code here
import pandas as pd
import numpy as np
import plotly.express as px
# File path
file_path = "C://Users//eduar//OneDrive - BYU-Idaho//2024 Winter//250 DS//flights_missing.csv"
flights_df = pd.read_csv(file_path)
# Replace missing values with NaN. It is possible to not use this code since i used it above? I tried but got different numbers.
flights_df.replace(r'^\s*$', np.nan, regex=True, inplace=True)
flights_df.replace(-999, np.nan, inplace=True)
# Replace '1500+' with NaN
for column in flights_df.columns:
if flights_df[column].dtype == object:
flights_df[column] = flights_df[column].map(lambda x: np.nan if isinstance(x, str) and '+' in x else x)
# define
month_column = 'month'
# flights_df.dropna(subset=[month_column], inplace=True)
# Convert delay columns to numeric
delay_columns = ['num_of_delays_carrier', 'num_of_delays_late_aircraft',
'num_of_delays_nas', 'num_of_delays_security', 'num_of_delays_weather']
flights_df[delay_columns] = flights_df[delay_columns].apply(pd.to_numeric, errors='coerce')
# Sum delays for each row = total delays per flight
flights_df['total_delays_per_flight'] = flights_df[delay_columns].sum(axis=1, min_count=1)
# Calculate average delay time per flight for each month
monthly_delay = (flights_df.groupby(month_column)
.agg(total_flights=('num_of_flights_total', 'sum'),
total_delayed_flights=('total_delays_per_flight', 'sum'))
.assign(average_delay_per_flight=lambda x: x.total_delayed_flights / x.total_flights)
.reset_index())
# Sort by month
monthly_delay.sort_values(month_column, inplace=True)
# Plotting
fig = px.bar(monthly_delay, x=month_column, y='average_delay_per_flight',
labels={'average_delay_per_flight': 'Average Delay per Flight'},
title='Average Flight Delay per Month')
fig.show()
```




## 4. Total Flight Weather Delay Analysis

_According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories._ __Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild).__ _You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations: a. 100% of delayed flights in the Weather category are due to weathe. b. 30% of all delayed flights in the Late-Arriving category are due to weather. c. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%._


_The table presents data on flight delays across four categories: 'num_of_delays_late_aircraft', 'num_of_delays_weather', 'num_of_delays_nas', and 'weather_delays'. The 'num_of_delays_weather' column represents severe weather delays, as per BTS's categorization. These are delays strictly due to significant weather events. The 'num_of_delays_late_aircraft' and 'num_of_delays_nas' columns include some delays caused by mild weather, as these are not categorized under 'Weather' by BTS. The 'num_of_delays_late_aircraft' column likely includes delays where an arriving aircraft was late due to mild weather conditions at either the previous airport or en route._

```{python}
#| label: Q4
#| code-summary: Read and format data
# Include and execute your code here
import pandas as pd
import numpy as np
# data path
file_path = "C://Users//eduar//OneDrive - BYU-Idaho//2024 Winter//250 DS//flights_missing.csv"
flights_df = pd.read_csv(file_path)
# Replace missing values with NaN
flights_df.replace(r'^\s*$', np.nan, regex=True, inplace=True)
flights_df.replace(-999, np.nan, inplace=True)
# Replace '1500+' with NaN
for column in flights_df.columns:
if flights_df[column].dtype == object:
flights_df[column] = flights_df[column].map(lambda x: np.nan if isinstance(x, str) and '+' in x else x)
# Convert 'month' column to numeric
flights_df['month'] = pd.to_numeric(flights_df['month'], errors='coerce')
# Replace missing values in Late-Arriving Aircraft with the mean
num_of_delays_late_aircraft_col = 'num_of_delays_late_aircraft'
flights_df[num_of_delays_late_aircraft_col].fillna(flights_df[num_of_delays_late_aircraft_col].mean(), inplace=True)
# Calculate weather-related delays
flights_df['weather_delays'] = flights_df['num_of_delays_weather'] # 100% of delays in Weather category
# 30% of Late-Arriving Aircraft delays due to weather
flights_df['weather_delays'] += flights_df[num_of_delays_late_aircraft_col] * 0.3
# NAS category weather delay calculation
nas_col = 'num_of_delays_nas'
flights_df['weather_delays'] += np.where(
flights_df['month'].between(4, 8), # April to August
flights_df[nas_col] * 0.4, # 40% of NAS delays due to weather
flights_df[nas_col] * 0.65 # Other months, 65% of NAS delays due to weather
)
# Display the first 5 rows
flights_df.head()
```




## 5. Flight Delay Weather Analysis

__Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Discuss what you learn from this graph.__



```{python}
#| label: Q5
#| code-summary: Read and format data
# Include and execute your code here
import pandas as pd
import numpy as np
import plotly.express as px
# Calculate the total number of flights and weather-related delays for each airport
airport_delays = flights_df.groupby('airport_code').agg(
total_flights=('num_of_flights_total', 'sum'),
weather_delays=('weather_delays', 'sum')
)
# Calculate the proportion of weather-related delays
airport_delays['weather_delay_proportion'] = airport_delays['weather_delays'] / airport_delays['total_flights']
# Sort by proportion of weather-related delays
airport_delays = airport_delays.sort_values(by='weather_delay_proportion', ascending=False).reset_index()
# Create a barplot
fig = px.bar(airport_delays, x='airport_code', y='weather_delay_proportion',
title='Proportion of All Flights Delayed by Weather at Each Airport',
labels={'weather_delay_proportion': 'Proportion of Weather Delays', 'airport_code': 'Airport Code'})
fig.show()
```

1 change: 0 additions & 1 deletion Projects/project3.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,3 @@ execute:

---

### Paste in a template
Loading

0 comments on commit efea14a

Please sign in to comment.