replacing templates for projects

On branch main modified: Projects/project2.qmd modified: Projects/project3.qmd modified: Projects/project4.qmd modified: Projects/project5.qmd
1Ramirez7 · Mar 30, 2024 · efea14a · efea14a
1 parent 130957d
commit efea14a
Show file tree

Hide file tree

Showing 4 changed files with 976 additions and 12 deletions.
diff --git a/Projects/project2.qmd b/Projects/project2.qmd
@@ -1,14 +1,14 @@
 ---
-title: "Client Report - [Insert Project Title]"
+title: "Client Report - JSON & Missing"
 subtitle: "Course DS 250"
-author: "[STUDENT NAME]"
+author: "Eduardo Ramirez"
 format:
   html:
     self-contained: true
     page-layout: full
     title-block-banner: true
     toc: true
-    toc-depth: 3
+    toc-depth: 5
     toc-location: body
     number-sections: false
     html-math-method: katex
@@ -25,4 +25,273 @@ execute:
 
 ---
 
-### Paste in a template
+
+## Airport Delays Analysis: Metrics and Weather Impact
+
+_paste your elevator pitch here_
+_A SHORT (4-5 SENTENCES) PARAGRAPH THAT `DESCRIBES KEY INSIGHTS` TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS._
+
+```{python}
+#| label: project data
+#| code-summary: Read and format project data
+# Include and execute your code here
+
+# df = pd.read_csv("C://Users//eduar//OneDrive - BYU-Idaho//2024 Winter//250 DS//flights_missing.json")
+# print(df.columns)
+
+
+```
+
+
+## 1. Standardizing Data Types: Handling Missing
+
+__Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”).__ _In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value._
+
+_type your results and analysis here_
+
+```{python}
+#| label: Q1
+#| code-summary: Read and format data
+# Include and execute your code here
+
+import pandas as pd
+import numpy as np
+import plotly.express as px
+
+# Load data 
+file_path = "C://Users//eduar//OneDrive - BYU-Idaho//2024 Winter//250 DS//flights_missing.csv"
+flights_df = pd.read_csv(file_path)
+
+# Replace  missing values with NaN
+flights_df.fillna(value=np.nan, inplace=True)  # Null values
+flights_df.replace(r'^\s*$', np.nan, regex=True, inplace=True)  
+flights_df.replace(-999, np.nan, inplace=True)  
+
+# Row with a NaN value
+row_with_nan = flights_df[flights_df.isna().any(axis=1)].head(1)
+
+# Display row in JSON format
+json_output = row_with_nan.to_json(orient='records', default_handler=str)
+json_output = json_output.replace('null', 'NaN') # maybe skipping a step here
+print(json_output)
+
+
+
+```
+
+
+
+## 2. Analyzing Airport Delay Performance
+
+__Which airport has the worst delays?__ _Discuss the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours._
+
+_To determine which airport has the worst delays, the metric used "proportion of delayed flights" as the primary metric because it indicates the likelihood of a flight being delayed no matter the airport size. An airport with a higher proportion of delays is generally more problematic for travelers since a larger percentage of flights do not depart on time. Here's a summary table:_
+
+```{python}
+#| label: Q2
+#| code-summary: Read and format data
+# Include and execute your code here
+
+import pandas as pd
+import numpy as np
+
+# file path 
+file_path = "C://Users//eduar//OneDrive - BYU-Idaho//2024 Winter//250 DS//flights_missing.csv"
+flights_df = pd.read_csv(file_path)
+
+# Replace missing values with NaN
+flights_df.replace(r'^\s*$', np.nan, regex=True, inplace=True)
+flights_df.replace(-999, np.nan, inplace=True)
+flights_df = flights_df.applymap(lambda x: np.nan if isinstance(x, str) and '+' in x else x)
+
+# Convert delay columns to numeric
+delay_columns = ['num_of_delays_carrier', 'num_of_delays_late_aircraft', 
+                 'num_of_delays_nas', 'num_of_delays_security', 'num_of_delays_weather']
+flights_df[delay_columns] = flights_df[delay_columns].apply(pd.to_numeric, errors='coerce')
+
+# Get the total delays per flight
+flights_df = flights_df.assign(total_delays_per_flight = flights_df[delay_columns].sum(axis=1, min_count=1))
+
+# Proportion and average delay time calculations
+summary_df = (flights_df.groupby('airport_code')
+              .agg(total_flights=('num_of_flights_total', 'sum'),
+                   total_delayed_flights=('total_delays_per_flight', 'sum'),
+                   total_minutes_delayed=('minutes_delayed_total', 'sum'))
+              .assign(proportion_delayed=lambda x: x.total_delayed_flights / x.total_flights,
+                      average_delay_time_hours=lambda x: x.total_minutes_delayed / x.total_delayed_flights / 60)
+              .sort_values(by='average_delay_time_hours', ascending=False)
+              .reset_index())
+
+# Display
+summary_df.index = summary_df.index + 1
+summary_df
+
+```
+
+_Considering this data, SFO has the highest proportion of delayed flights at 26.09%, indicating that it has the worst delays among these airports. Despite not having the longest average delay time, the higher chance of any given flight being delayed at SFO makes it more likely for travelers to experience a delay there._
+
+
+## 3. Optimal Flight Months: Delay Analysis
+
+__What is the best month to fly if you want to avoid delays of any length?__ _Discuss the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)_
+
+_To identify the best month to fly to avoid delays, the metric to look at is the "Average Delay per Flight." This metric provides a measure of how long, on average, flights are delayed during each month, reflecting the overall efficiency of flight operations. From the provided chart below, we can see that September has the lowest average delay per flight, making it the best month to fly to minimize the risk of experiencing a delay._
+
+
+
+
+```{python}
+#| label: Q3
+#| code-summary: Read and format data
+# Include and execute your code here
+
+import pandas as pd
+import numpy as np
+import plotly.express as px
+
+# File path
+file_path = "C://Users//eduar//OneDrive - BYU-Idaho//2024 Winter//250 DS//flights_missing.csv"
+flights_df = pd.read_csv(file_path)
+
+# Replace missing values with NaN. It is possible to not use this code since i used it above? I tried but got different numbers. 
+flights_df.replace(r'^\s*$', np.nan, regex=True, inplace=True)
+flights_df.replace(-999, np.nan, inplace=True)
+
+# Replace '1500+' with NaN 
+for column in flights_df.columns:
+    if flights_df[column].dtype == object:
+        flights_df[column] = flights_df[column].map(lambda x: np.nan if isinstance(x, str) and '+' in x else x)
+
+# define
+month_column = 'month' 
+# flights_df.dropna(subset=[month_column], inplace=True)
+
+# Convert delay columns to numeric
+delay_columns = ['num_of_delays_carrier', 'num_of_delays_late_aircraft', 
+                 'num_of_delays_nas', 'num_of_delays_security', 'num_of_delays_weather']
+flights_df[delay_columns] = flights_df[delay_columns].apply(pd.to_numeric, errors='coerce')
+
+# Sum delays for each row = total delays per flight
+flights_df['total_delays_per_flight'] = flights_df[delay_columns].sum(axis=1, min_count=1)
+
+# Calculate average delay time per flight for each month
+monthly_delay = (flights_df.groupby(month_column)
+                 .agg(total_flights=('num_of_flights_total', 'sum'),
+                      total_delayed_flights=('total_delays_per_flight', 'sum'))
+                 .assign(average_delay_per_flight=lambda x: x.total_delayed_flights / x.total_flights)
+                 .reset_index())
+
+# Sort by month
+monthly_delay.sort_values(month_column, inplace=True)
+
+# Plotting
+fig = px.bar(monthly_delay, x=month_column, y='average_delay_per_flight', 
+             labels={'average_delay_per_flight': 'Average Delay per Flight'},
+             title='Average Flight Delay per Month')
+fig.show()
+
+
+
+```
+
+
+
+
+## 4. Total Flight Weather Delay Analysis
+
+_According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories._ __Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild).__ _You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations: a. 100% of delayed flights in the Weather category are due to weathe. b. 30% of all delayed flights in the Late-Arriving category are due to weather. c. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%._
+
+
+_The table presents data on flight delays across four categories: 'num_of_delays_late_aircraft', 'num_of_delays_weather', 'num_of_delays_nas', and 'weather_delays'. The 'num_of_delays_weather' column represents severe weather delays, as per BTS's categorization. These are delays strictly due to significant weather events. The 'num_of_delays_late_aircraft' and 'num_of_delays_nas' columns include some delays caused by mild weather, as these are not categorized under 'Weather' by BTS. The 'num_of_delays_late_aircraft' column likely includes delays where an arriving aircraft was late due to mild weather conditions at either the previous airport or en route._
+
+```{python}
+#| label: Q4
+#| code-summary: Read and format data
+# Include and execute your code here
+
+import pandas as pd
+import numpy as np
+
+# data path
+file_path = "C://Users//eduar//OneDrive - BYU-Idaho//2024 Winter//250 DS//flights_missing.csv"
+flights_df = pd.read_csv(file_path)
+
+
+# Replace missing values with NaN
+flights_df.replace(r'^\s*$', np.nan, regex=True, inplace=True)
+flights_df.replace(-999, np.nan, inplace=True)
+
+# Replace '1500+' with NaN 
+for column in flights_df.columns:
+    if flights_df[column].dtype == object:
+        flights_df[column] = flights_df[column].map(lambda x: np.nan if isinstance(x, str) and '+' in x else x)
+
+# Convert 'month' column to numeric
+flights_df['month'] = pd.to_numeric(flights_df['month'], errors='coerce')
+
+# Replace missing values in Late-Arriving Aircraft with the mean
+num_of_delays_late_aircraft_col = 'num_of_delays_late_aircraft'
+flights_df[num_of_delays_late_aircraft_col].fillna(flights_df[num_of_delays_late_aircraft_col].mean(), inplace=True)
+
+# Calculate weather-related delays
+flights_df['weather_delays'] = flights_df['num_of_delays_weather']  # 100% of delays in Weather category
+
+# 30% of Late-Arriving Aircraft delays due to weather
+flights_df['weather_delays'] += flights_df[num_of_delays_late_aircraft_col] * 0.3
+
+# NAS category weather delay calculation
+nas_col = 'num_of_delays_nas'
+flights_df['weather_delays'] += np.where(
+    flights_df['month'].between(4, 8),  # April to August
+    flights_df[nas_col] * 0.4,  # 40% of NAS delays due to weather
+    flights_df[nas_col] * 0.65  # Other months, 65% of NAS delays due to weather
+)
+
+# Display the first 5 rows
+flights_df.head()
+
+
+
+```
+
+
+
+
+## 5. Flight Delay Weather Analysis
+
+__Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Discuss what you learn from this graph.__
+
+
+
+```{python}
+#| label: Q5
+#| code-summary: Read and format data
+# Include and execute your code here
+
+import pandas as pd
+import numpy as np
+import plotly.express as px
+
+
+# Calculate the total number of flights and weather-related delays for each airport
+airport_delays = flights_df.groupby('airport_code').agg(
+    total_flights=('num_of_flights_total', 'sum'),
+    weather_delays=('weather_delays', 'sum')
+)
+
+# Calculate the proportion of weather-related delays
+airport_delays['weather_delay_proportion'] = airport_delays['weather_delays'] / airport_delays['total_flights']
+
+# Sort by proportion of weather-related delays
+airport_delays = airport_delays.sort_values(by='weather_delay_proportion', ascending=False).reset_index()
+
+# Create a barplot
+fig = px.bar(airport_delays, x='airport_code', y='weather_delay_proportion',
+             title='Proportion of All Flights Delayed by Weather at Each Airport',
+             labels={'weather_delay_proportion': 'Proportion of Weather Delays', 'airport_code': 'Airport Code'})
+fig.show()
+
+
+
+```
+
diff --git a/Projects/project3.qmd b/Projects/project3.qmd
@@ -25,4 +25,3 @@ execute:
 
 ---
 
-### Paste in a template
Original file line number	Diff line number	Diff line change
Expand Up		@@ -25,4 +25,3 @@ execute:

		---

		### Paste in a template