pyretrosheet
is under active development and is not feature complete.
Load, analyze, and enrich retrosheet.org MLB data using Python representations.
Retrosheet provides play-by-play and other miscellaneous MLB data (at the time of writing, includes all play-by-play data for all AL and NL seasons from 1919 to 2022).
pyretrosheet
provides functionality for:
- downloading Retrosheet play-by-play data
- parsing and loading play-by-play data into Python objects to make the data easier to understand and analyze
- enriching data to include player and summary statistics not encoded directly by Retrosheet
pyretrosheet
does not provide functionality for:
- downloading/using MLB data from sources other than Retrosheet
See Retrosheet Data Resources for other tools that parse Retrosheet event files. At the time of writing, these resources focus on loading/dumping to other data formats like CSV and SQL databases.
pip install pyretrosheet
By default, data downloaded from retrosheet.org is stored at ~/.pyretrosheet/data/
,
but can be overriden via the data_dir
argument.
import pyretrosheet
games = pyretrosheet.load_games(year=2022)
print(games[0])
"""
Game(
id=GameID(home_team_id='SFN', date=datetime.date(2022, 4, 8), game_number=0, raw='id,SFN202204080'),
home_team_id=SFN,
visiting_team_id=MIA,
num_chronological_events=150,
earned_runs={'bleir001': 1, 'alcas001': 2, 'bassa001': 1, 'benda001': 1, 'webbl001': 1, 'dovac001': 3, 'leond003': 1},
)
"""
TODO: Add more examples
Retrosheet's Event File Spec defines the encoding for event files (play-by-play game data). The spec (as of 11/30/2023) can also be found at docs/event_file_spec.txt.
There is a wide amount of data encoded into these files and this package does not cover all encodings.
Contributions are welcome for any encodings not covered!
- Loading all games from a given event file
- Record types
id
info
start
sub
play
data
com
-
Record types
play
's pitching encodingradj
badj
padj
ladj
presadj
-
Miscellaneous data
- replays
- ejections
- umpire changes
- protests
- suspensions
pyretrosheet
provides enriched Retrosheet data to provide:
- TODO
help: Show this help.
setup: Install the package and dev dependencies into a virtualenv.
test: Run pytest on the tests dir.
test_all_data: Run pytest on all Retrosheet data.
format: Run black and isort on package and tests dirs.
lint: Run ruff and mypy on package files.
coverage: Run test coverage and update coverage badge
bump_version: Increment patch version references in the project
publish_to_testpypi: Publish the package to test.pypi.org.
publish_to_pypi: Publish the package to pypi.org.
- ReadTheDocs
- Verify enriched data with alternative sources like Baseball Reference
- Determine top-level interface for querying data
- Implement index of game files to easily lookup games for:
- a specific team within a year
- a specific game
- Stats
- Hits (H)
- Walks (W)
- Hit By Pitches (HBP)
- Sacrifice Flys (SF)
- At Bats (AB)
- Singles (S)
- Doubles (D)
- Triples (T)
- Home Runs (HR)
- Composite
- Batting Average (BA)
- Slugging Percentage (SP)
- On Base Percentage (OBP)
- Difficult and Needs Lots of Validation
- Runs (R)
- Runs Batted In (RBI)
- Aggregate stats
- Mean, Median, Std. Dev, Min, Max
- Parse out 'info' fields into
pyretrosheet.models.Game
properties - Encoding pitches from
play
data - Improve error handling for inability to retrieve Retrosheet data
- Improve README Usage examples
- Add CONTRIBUTING.md
- Add interface to load stats
The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at 20 Sunset Rd., Newark, DE 19711.
- Project skeleton generated via
cookiecutter https://github.com/rozelie/Python-Project-Cookiecutter
- Thank you Retrosheet team for making your data free and publicly available!