Skip to content

Commit

Permalink
testing
Browse files Browse the repository at this point in the history
  • Loading branch information
1Ramirez7 committed Nov 7, 2024
1 parent e6d5585 commit 0ca1bb8
Show file tree
Hide file tree
Showing 2 changed files with 174 additions and 3 deletions.
1 change: 1 addition & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ jobs:
any::tidyverse
any::knitr
any::DT
any::nycflights13 # added this for w2 assignment test

- name: Render and Publish
uses: quarto-dev/quarto-actions/publish@v2
Expand Down
176 changes: 173 additions & 3 deletions Projects/project3.qmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Client Report - Sample Name"
title: "Client Report - Finding Relationships in Baseball"
subtitle: "Course DS 250"
author: "Eduardo Ramirez"
format:
Expand All @@ -8,7 +8,7 @@ format:
page-layout: full
title-block-banner: true
toc: true
toc-depth: 4
toc-depth: 3
toc-location: body
number-sections: false
html-math-method: katex
Expand All @@ -25,4 +25,174 @@ execute:

---

__Sample project coming soon__


## Baseball Stats SQL Analysis


_In my SQL project, I explored three baseball queries. First, I analyzed the salaries of BYU-Idaho alumni in pro baseball, finding only two players with significant salary differences. Then, I examined batting averages, looking at players with varying numbers of at-bats, from just one in a year to over 100 in their careers, revealing insights from perfect averages to legends like Ty Cobb. Finally, I compared player salaries based on stats like home runs and hit-by-pitches, discovering that factors like doubles have a more significant impact on salaries than home runs._



## BYU-Idaho Players' Salaries Query

__Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.__

_The SQL query I conducted revealed that only two baseball players from BYU-Idaho have made it to professional leagues, as indicated by the salary data. These players, identified as 'lindsma01' and Steph, show varied salary ranges, with Lindsman earning as high as $4,000,000 in 2014 for team CHA. In contrast, 'stephga01' had salaries ranging from $150,000 to $1,025,000. This data highlights the limited number of BYU-Idaho alumni in professional baseball and showcases the considerable salary differences between these two players._

```{python}
#| label: Q1
#| code-summary: Read and format data
# Include and execute your code here
import pandas as pd
import sqlite3
# https://github.com/1Ramirez7/Portfolio_/blob/main/data/lahmansbaseballdb.sqlite
# https://raw.githubusercontent.com/1Ramirez7/Portfolio_/main/data/starwars1.csv
sqlite_file = 'lahmansbaseballdb.sqlite'
con = sqlite3.connect(sqlite_file)
query = '''
SELECT cp.playerID, cp.schoolID, s.salary, s.yearID, s.teamID
FROM CollegePlaying cp
JOIN Salaries s ON cp.playerID = s.playerID
WHERE cp.schoolID = 'idbyuid'
ORDER BY s.salary DESC;
'''
df = pd.read_sql_query(query, con)
df
```



## Top Batting Averages Analyzed

__This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)__

__Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.__

__Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.__

__Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.__

_The results presented here are from a series of SQL queries designed to analyze baseball players' batting averages. Initially, we examined players with at least one at-bat in a given year, sorting them by their batting average. Surprisingly, several players had a perfect average of 1.0, though this likely indicates a very low number of at-bats. Next, the focus shifted to those with a minimum of 10 at-bats, revealing more realistic averages with the highest being around 0.64. Finally, we analyzed career batting averages for players with over 100 at-bats, where historical greats like Ty Cobb emerged with an impressive average of around 0.366. This multi-tiered approach effectively highlights different aspects of batting performance across varying levels of player experience and activity._

```{python}
#| label: Q2
#| code-summary: Read and format data
# Include and execute your code here
query_part1 = """
SELECT playerID, yearID, CAST(H AS FLOAT) / AB AS batting_average
FROM Batting
WHERE AB > 0
ORDER BY batting_average DESC, playerID
LIMIT 5;
"""
df_part1 = pd.read_sql_query(query_part1, con)
print(df_part1)
query_part2 = """
SELECT playerID, yearID, CAST(H AS FLOAT) / AB AS batting_average
FROM Batting
WHERE AB >= 10
ORDER BY batting_average DESC, playerID
LIMIT 5;
"""
df_part2 = pd.read_sql_query(query_part2, con)
print(df_part2)
query_part3 = """
SELECT playerID, CAST(SUM(H) AS FLOAT) / SUM(AB) AS career_batting_average
FROM Batting
GROUP BY playerID
HAVING SUM(AB) >= 100
ORDER BY career_batting_average DESC
LIMIT 5;
"""
df_part3 = pd.read_sql_query(query_part3, con)
print(df_part3)
```




## Baseball Team Metrics Showdown

__Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph using Plotly Express to visualize the comparison. What do you learn?__

_In this project, I wanted to research more about the salaries of players. It's quite common for the player with the most home runs to be paid more than others who don’t hit as many home runs. I ran an SQL query to obtain the average salaries of the top 10 players with the most home runs (HR), and the top ten players with the most hit-by-pitches (HBP). The results were not shocking: the average salary for the top ten HR players was $9,911,355.4, and for HBP, it was $4,604,570.79. Why did I decide to use HBP for comparison? Well, players hitting 60-plus home runs are rare, so I wanted to research which baseball statistic correlated more closely with player salaries. I downloaded player stats from the SQL baseball database and prepared the data for regression analysis. First, I used the 2023 consumer price index (CPI) to convert all salary values to 2023 dollars, ensuring players from 1986 and 2015 were evaluated equally, adjusting for inflation. Since salaries are mostly in the millions of dollars, I took the natural log of the salaries and used that as my independent variable, the main variable considered in the regression analysis. I also removed all players with zero salary values, leaving 1811 players in the data set, as having zero values would render the regression analysis meaningless._
_The regression analysis showed that HBP had the most correlation with a change in player salary. The results indicated that for every unit increase in HBP, the player's salary increased by $130,082. For every HR unit increase, the salary increased by $19,683. For every triple (3B) unit increase, the salary decreased by $8,430. For every double (2B) unit increase, the salary increased by $67,764. For every stolen base (SB), the salary increased by $39,741. The regression could only explain 15% of the dependent variable, and all the independent variables in the regression, besides HBP, were not statistically significant. In other words, the regression analysis suggested that there are more independent variables that dictate player salaries. HBP was still statistically significant, indicating it is a significant variable in explaining changes in player salaries, but my regression analysis could not identify the other 85% of independent variables that influence salary changes._

_This research revealed some interesting insights. It showed that home runs are not as significant in determining player salaries as previously thought. It also showed that players with more triples tend to lose salary, while those with more doubles benefit significantly more. Does this suggest that reaching second base is more advantageous than third base? Further statistical research might reveal two classes of 2B players: those who advance to home and those who end up at 3B but don’t make it home. I hypothesize that players who go from 2B to home are of higher quality than those who stop at 3B. It seems that staying in second place or reaching home guarantees a higher salary, while ending up at 3B might lead to a loss in earnings. SB is ranked just below 2B, and HBP is ranked as having the most significant impact on player salaries. Outside of the statistics, I think better players are hit more often by pitches because pitchers recognize their higher likelihood of hitting the ball, especially home runs, so allowing a walk to first base is more strategic. Players who steal bases, in my opinion, are generally better than those who don’t. In conclusion, I believe players who excel at reaching 2B are more successful, which is why 2B has the most impact on a player's salary compared to other stats. However, my regression analysis could only account for up to 15% of the salary variations, so there are many other factors affecting player salaries, but HBP is definitely one of them._



```{python}
#| label: Q3
#| code-summary: Read and format data
# Include and execute your code here
import pandas as pd
import sqlite3
# Connect to the SQLite database
sqlite_file = 'lahmansbaseballdb.sqlite'
con = sqlite3.connect(sqlite_file)
# SQL query for average salary of top 10 players in HBP
query_hbp_avg_salary = """
SELECT AVG(salary) as average_salary_hbp
FROM Salaries
WHERE playerID IN (
SELECT playerID
FROM Batting
GROUP BY playerID
ORDER BY SUM(HBP) DESC
LIMIT 10
);
"""
# SQL query for average salary of top 10 players in HR
query_hr_avg_salary = """
SELECT AVG(salary) as average_salary_hr
FROM Salaries
WHERE playerID IN (
SELECT playerID
FROM Batting
GROUP BY playerID
ORDER BY SUM(HR) DESC
LIMIT 10
);
"""
# Execute queries and fetch data
avg_salary_hbp = pd.read_sql_query(query_hbp_avg_salary, con)
avg_salary_hr = pd.read_sql_query(query_hr_avg_salary, con)
# Close the database connection
con.close()
# Display the results
print("Average Salary of Top 10 HBP Players:", avg_salary_hbp.iloc[0, 0])
print("Average Salary of Top 10 HR Players:", avg_salary_hr.iloc[0, 0])
```

_Returning to my initial comparison, although the average salary for the top 10 players in HR and 2B clearly shows that hitting more home runs equates to more money, the statistics indicate that higher 2B numbers result in better salaries. While one player in a league might have the most home runs and be paid the most, multiple players can earn more by achieving higher 2B numbers. If we were all baseball players, we might all aspire to be the top HR hitter, but it is financially smarter to aim for more 2B numbers than to seek glory alone._

0 comments on commit 0ca1bb8

Please sign in to comment.