From 0ca1bb884a2e3dcf179d31974b7d8988b4c42e18 Mon Sep 17 00:00:00 2001 From: ERamirez Date: Wed, 6 Nov 2024 19:05:43 -0700 Subject: [PATCH] testing --- .github/workflows/publish.yml | 1 + Projects/project3.qmd | 176 +++++++++++++++++++++++++++++++++- 2 files changed, 174 insertions(+), 3 deletions(-) diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index 1003aca..b4f3c56 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -43,6 +43,7 @@ jobs: any::tidyverse any::knitr any::DT + any::nycflights13 # added this for w2 assignment test - name: Render and Publish uses: quarto-dev/quarto-actions/publish@v2 diff --git a/Projects/project3.qmd b/Projects/project3.qmd index 65bde43..03b78ba 100644 --- a/Projects/project3.qmd +++ b/Projects/project3.qmd @@ -1,5 +1,5 @@ --- -title: "Client Report - Sample Name" +title: "Client Report - Finding Relationships in Baseball" subtitle: "Course DS 250" author: "Eduardo Ramirez" format: @@ -8,7 +8,7 @@ format: page-layout: full title-block-banner: true toc: true - toc-depth: 4 + toc-depth: 3 toc-location: body number-sections: false html-math-method: katex @@ -25,4 +25,174 @@ execute: --- -__Sample project coming soon__ + + +## Baseball Stats SQL Analysis + + +_In my SQL project, I explored three baseball queries. First, I analyzed the salaries of BYU-Idaho alumni in pro baseball, finding only two players with significant salary differences. Then, I examined batting averages, looking at players with varying numbers of at-bats, from just one in a year to over 100 in their careers, revealing insights from perfect averages to legends like Ty Cobb. Finally, I compared player salaries based on stats like home runs and hit-by-pitches, discovering that factors like doubles have a more significant impact on salaries than home runs._ + + + +## BYU-Idaho Players' Salaries Query + +__Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.__ + +_The SQL query I conducted revealed that only two baseball players from BYU-Idaho have made it to professional leagues, as indicated by the salary data. These players, identified as 'lindsma01' and Steph, show varied salary ranges, with Lindsman earning as high as $4,000,000 in 2014 for team CHA. In contrast, 'stephga01' had salaries ranging from $150,000 to $1,025,000. This data highlights the limited number of BYU-Idaho alumni in professional baseball and showcases the considerable salary differences between these two players._ + +```{python} +#| label: Q1 +#| code-summary: Read and format data +# Include and execute your code here + +import pandas as pd +import sqlite3 + + +# https://github.com/1Ramirez7/Portfolio_/blob/main/data/lahmansbaseballdb.sqlite +# https://raw.githubusercontent.com/1Ramirez7/Portfolio_/main/data/starwars1.csv + + +sqlite_file = 'lahmansbaseballdb.sqlite' + + +con = sqlite3.connect(sqlite_file) + +query = ''' + SELECT cp.playerID, cp.schoolID, s.salary, s.yearID, s.teamID + FROM CollegePlaying cp + JOIN Salaries s ON cp.playerID = s.playerID + WHERE cp.schoolID = 'idbyuid' + ORDER BY s.salary DESC; +''' + +df = pd.read_sql_query(query, con) +df + + +``` + + + +## Top Batting Averages Analyzed + +__This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)__ + +__Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.__ + +__Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.__ + +__Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.__ + +_The results presented here are from a series of SQL queries designed to analyze baseball players' batting averages. Initially, we examined players with at least one at-bat in a given year, sorting them by their batting average. Surprisingly, several players had a perfect average of 1.0, though this likely indicates a very low number of at-bats. Next, the focus shifted to those with a minimum of 10 at-bats, revealing more realistic averages with the highest being around 0.64. Finally, we analyzed career batting averages for players with over 100 at-bats, where historical greats like Ty Cobb emerged with an impressive average of around 0.366. This multi-tiered approach effectively highlights different aspects of batting performance across varying levels of player experience and activity._ + +```{python} +#| label: Q2 +#| code-summary: Read and format data +# Include and execute your code here + +query_part1 = """ +SELECT playerID, yearID, CAST(H AS FLOAT) / AB AS batting_average +FROM Batting +WHERE AB > 0 +ORDER BY batting_average DESC, playerID +LIMIT 5; +""" + +df_part1 = pd.read_sql_query(query_part1, con) +print(df_part1) + +query_part2 = """ +SELECT playerID, yearID, CAST(H AS FLOAT) / AB AS batting_average +FROM Batting +WHERE AB >= 10 +ORDER BY batting_average DESC, playerID +LIMIT 5; +""" + +df_part2 = pd.read_sql_query(query_part2, con) +print(df_part2) + + +query_part3 = """ +SELECT playerID, CAST(SUM(H) AS FLOAT) / SUM(AB) AS career_batting_average +FROM Batting +GROUP BY playerID +HAVING SUM(AB) >= 100 +ORDER BY career_batting_average DESC +LIMIT 5; +""" + +df_part3 = pd.read_sql_query(query_part3, con) +print(df_part3) + + +``` + + + + +## Baseball Team Metrics Showdown + +__Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph using Plotly Express to visualize the comparison. What do you learn?__ + +_In this project, I wanted to research more about the salaries of players. It's quite common for the player with the most home runs to be paid more than others who don’t hit as many home runs. I ran an SQL query to obtain the average salaries of the top 10 players with the most home runs (HR), and the top ten players with the most hit-by-pitches (HBP). The results were not shocking: the average salary for the top ten HR players was $9,911,355.4, and for HBP, it was $4,604,570.79. Why did I decide to use HBP for comparison? Well, players hitting 60-plus home runs are rare, so I wanted to research which baseball statistic correlated more closely with player salaries. I downloaded player stats from the SQL baseball database and prepared the data for regression analysis. First, I used the 2023 consumer price index (CPI) to convert all salary values to 2023 dollars, ensuring players from 1986 and 2015 were evaluated equally, adjusting for inflation. Since salaries are mostly in the millions of dollars, I took the natural log of the salaries and used that as my independent variable, the main variable considered in the regression analysis. I also removed all players with zero salary values, leaving 1811 players in the data set, as having zero values would render the regression analysis meaningless._ +_The regression analysis showed that HBP had the most correlation with a change in player salary. The results indicated that for every unit increase in HBP, the player's salary increased by $130,082. For every HR unit increase, the salary increased by $19,683. For every triple (3B) unit increase, the salary decreased by $8,430. For every double (2B) unit increase, the salary increased by $67,764. For every stolen base (SB), the salary increased by $39,741. The regression could only explain 15% of the dependent variable, and all the independent variables in the regression, besides HBP, were not statistically significant. In other words, the regression analysis suggested that there are more independent variables that dictate player salaries. HBP was still statistically significant, indicating it is a significant variable in explaining changes in player salaries, but my regression analysis could not identify the other 85% of independent variables that influence salary changes._ + +_This research revealed some interesting insights. It showed that home runs are not as significant in determining player salaries as previously thought. It also showed that players with more triples tend to lose salary, while those with more doubles benefit significantly more. Does this suggest that reaching second base is more advantageous than third base? Further statistical research might reveal two classes of 2B players: those who advance to home and those who end up at 3B but don’t make it home. I hypothesize that players who go from 2B to home are of higher quality than those who stop at 3B. It seems that staying in second place or reaching home guarantees a higher salary, while ending up at 3B might lead to a loss in earnings. SB is ranked just below 2B, and HBP is ranked as having the most significant impact on player salaries. Outside of the statistics, I think better players are hit more often by pitches because pitchers recognize their higher likelihood of hitting the ball, especially home runs, so allowing a walk to first base is more strategic. Players who steal bases, in my opinion, are generally better than those who don’t. In conclusion, I believe players who excel at reaching 2B are more successful, which is why 2B has the most impact on a player's salary compared to other stats. However, my regression analysis could only account for up to 15% of the salary variations, so there are many other factors affecting player salaries, but HBP is definitely one of them._ + + + +```{python} +#| label: Q3 +#| code-summary: Read and format data +# Include and execute your code here + +import pandas as pd +import sqlite3 + +# Connect to the SQLite database +sqlite_file = 'lahmansbaseballdb.sqlite' +con = sqlite3.connect(sqlite_file) + +# SQL query for average salary of top 10 players in HBP +query_hbp_avg_salary = """ +SELECT AVG(salary) as average_salary_hbp +FROM Salaries +WHERE playerID IN ( + SELECT playerID + FROM Batting + GROUP BY playerID + ORDER BY SUM(HBP) DESC + LIMIT 10 +); +""" + +# SQL query for average salary of top 10 players in HR +query_hr_avg_salary = """ +SELECT AVG(salary) as average_salary_hr +FROM Salaries +WHERE playerID IN ( + SELECT playerID + FROM Batting + GROUP BY playerID + ORDER BY SUM(HR) DESC + LIMIT 10 +); +""" + +# Execute queries and fetch data +avg_salary_hbp = pd.read_sql_query(query_hbp_avg_salary, con) +avg_salary_hr = pd.read_sql_query(query_hr_avg_salary, con) + +# Close the database connection +con.close() + +# Display the results +print("Average Salary of Top 10 HBP Players:", avg_salary_hbp.iloc[0, 0]) +print("Average Salary of Top 10 HR Players:", avg_salary_hr.iloc[0, 0]) + + +``` + +_Returning to my initial comparison, although the average salary for the top 10 players in HR and 2B clearly shows that hitting more home runs equates to more money, the statistics indicate that higher 2B numbers result in better salaries. While one player in a league might have the most home runs and be paid the most, multiple players can earn more by achieving higher 2B numbers. If we were all baseball players, we might all aspire to be the top HR hitter, but it is financially smarter to aim for more 2B numbers than to seek glory alone._