A project to parse retrosheet baseball data in python. All data contained at Retrosheet site is copyright © 1996-2003 by Retrosheet. All Rights Reserved.
The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org"
The motivation behind this project is to enhance python-based baseball analytics, from data collection to advanced predictive modeling techniques.
If you are looking for a complete solution out of the box, check Chadwick Bureau
If you are looking for a quick way to check stats, see Baseball-Reference
If you want a web-scrapping solution, check pybaseball
Run the following code to create the folder structure
git clone https://github.com/calestini/retrosheet.git
Note: This package is a work in progress, and the files are not yet fully parsed, and statistics not fully validated.
The code below will save data from 1921 to 2017 in your machine. Be careful as it will take some time to download it all (10min with a decent machine and decent internet connection). Final datasets add up to ~ 3GB
from retrosheet import Retrosheet
rs = Retrosheet()
rs.batch_parse(yearFrom=1921, yearTo=2017, batchsize=10) #10 files at a time
[========================================] 100.0% ... Completed 1921-1930
[========================================] 100.0% ... Completed 1931-1940
[========================================] 100.0% ... Completed 1941-1950
[========================================] 100.0% ... Completed 1951-1960
[========================================] 100.0% ... Completed 1961-1970
[========================================] 100.0% ... Completed 1971-1980
[========================================] 100.0% ... Completed 1981-1990
[========================================] 100.0% ... Completed 1991-2000
[========================================] 100.0% ... Completed 2001-2010
[========================================] 100.0% ... Completed 2011-2017
- plays.csv
- teams.csv
- rosters.csv
- lineup.csv
- pitching.csv
- fielding.csv
- batting.csv
- running.csv
- info.csv
- Our own summary of Retrosheet terminology can be found here
- For the events file, the pitches field sometimes repeats over the following role, whenever there was a play (CS, SB, etc.). In these cases, the code needs to remove the duplication.
- Main baseball statistics --> here
- Hit location diagram are here
- Link to downloads here
- Glossary of Baseball
- Information about the event files can be found here
- Documentation on the datasets can be found here
- Putouts and Assists rules
- What does 'BF' in '1/BF' stand for? bunt fly?
- Why some specific codes for modifier are 2R / 2RF / 8RM / 8RS / 8RXD / L9Ls / RNT ?
- Finish parsing pitches
- Clean-up code and logic
- Test primary stats with game logs
- Test innings ending in 3 outs
- Playoff files
- Parks files
- Player files
- Create sql export option
- Aggregate more advanced metrics
- Map out location
- Add additional data if possible
- Load game-log data
- Load player / manager/ umpire data
- Josh Donaldson (player_id = donaj001)
Source | R | H | HR | SB |
---|---|---|---|---|
Official | 526 | 860 | 174 | 32 |
ThisPackage | 524 | 853 | 173 | 32 |
- Nelson Cruz (player_id = cruzn002)
Source | R | H | HR | SB |
---|---|---|---|---|
Official | 768 | 1447 | 317 | 75 |
ThisPackage | 767 | 1427 | 317 | 75 |