-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functionality to list StatsBomb open competitions and available matches #363
base: master
Are you sure you want to change the base?
Conversation
Just to add, if the basic functionality is approved I'll add some tests |
A few remarks:
def list_open_data(fmt="dataframe"):
try:
from statsbombpy import sb
except ImportError:
print("Please install the statsbombpy library to use this function.")
return
try:
# Fetch all available competitions as a dictionary
competitions = sb.competitions(fmt="dict")
# Initialize an empty list to store DataFrames for matches
all_matches = []
# Loop through each competition and fetch matches using competition_id and season_id
for competition in competitions.values():
matches = sb.matches(
competition_id=competition['competition_id'],
season_id=competition['season_id'],
fmt=fmt
)
all_matches.append(matches)
# Combine all matches into a single DataFrame
if fmt == "dataframe":
import pandas as pd
combined_matches = pd.concat(all_matches, ignore_index=True)
return combined_matches
elif fmt == "dict":
return all_matches
else:
raise ValueError("Invalid format. Use 'dataframe' or 'dict'.")
except Exception as e:
raise RuntimeError(f"An error occurred while fetching data: {e}")
# Example usage:
df = list_open_data(fmt="dataframe")
print(df.head()) |
This seems fine to me too! I wasn't sure about relying on the sb package initially, but why not. |
@probberechts do you have any suggestions for passing the competition / season filters? It might make sense to have this be a dict with comp / season_id value pairs, but it might also make sense to simply be able to provide a competition_id and or a competition / season_id pair? Also, thinking about the functionality I have built and that you are suggesting, the first step is always going to be a two step process. You will need to identify the competition_ids first to even know what you want. So, would it be an idea to have both as an option? |
Maybe simply allow everything? You can define a @dataclass
class StatsBombPartitionIdentifier:
competition_id: Optional[int] = None
season_id: Optional[int] = None
@classmethod
def from_dict(cls, data: Dict):
"""Create a PartitionIdentifier from a dictionary."""
return cls(
competition_id=data.get('competition_id'),
season_id=data.get('season_id')
)
@classmethod
def from_tuple(cls, data: Tuple[Optional[int], Optional[int]]):
"""Create a PartitionIdentifier from a tuple."""
if len(data) != 2:
raise ValueError("Tuple must have exactly two elements: (competition_id, season_id)")
return cls(
competition_id=data[0],
season_id=data[1]
) Then, the input to Although, I don't think I would use these identifiers a lot. I can see it might be nice to have this option since it will make data retrieval more efficient (you won't have to retrieve the file with matches for each competition-season pair). However, the only ID that I know by hard is the one of the EPL and would probably rather wait an extra second rather than typing another parameter 🙃. So, maybe we don't have to add this option.
Yes, typically it will be. But I prefer to do these two steps directly in Pandas. Below are a few use cases in which I think this functionality would be useful: 🤔 I would like to do some analysis with 360 data, which datasets can I use? df = list_open_data(fmt="dataframe")
df["has_360"] = df["match_status_360"] == "available"
print(df.groupby(["competition", "season"])["has_360"].sum().sort_values(ascending=False)) 🤔 What is the ID of the 2022 World Cup final? df = list_open_data(fmt="dataframe")
print(df[df.match_date == "2022-12-18"]) 🤔 Get the metadata of a particular game. df = list_open_data(fmt="dataframe")
print(df[df.match_id == 7584]) |
I've changed it to your suggested implementation. Where we allow a single combination of Some notes:
@koenvo do we need tests for this? And if so, what would they look like? @probberechts note: I see now that I missed your previous message, not sure this is something we need. I'm also trying to keep it a bit simple, in the end it's just a wrapper. And if people are now getting the full list of data first, then they can do the rest themselves from that. |
import requests as re | ||
|
||
|
||
def get_response(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this to a more generic place like utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. These lines weren't necessary anymore so I've removed them.
@probberechts what do you think of putting It might need an additional "get or fetch" method to either get it directly from (local) storage, or retrieve it first. |
I think adding ingestify might be overkill for our needs. We only need to download ~10 small JSON files, maybe once in a while. The main reason I proposed using the statsbombpy package is that it already parses everything neatly into dataframes. I am not very familiar with ingestify, but I guess it does not do that part. |
I also don't know what ingestify does, but does indeed feel like overkill. The statsbombpy package works fine, plus it allows users to directly grab non open data too if they have access |
Ok. Let’s go for this solution. I do think it would be good to introduce some models for competition, seasons and match etc to prevent having StatsBomb specific structures in through the code. |
How/where do you see this? |
Yes, this is a good point! I guess what you mean is that we should add something like a @dataclass
class Game:
game_id: int
home_team: Team
away_team: Team
date: datetime
@dataclass
class Season:
season_year: str # e.g., "2023-2024"
season_id: int
games: List[Game] = field(default_factory=list)
@dataclass
class Competition:
name: str # e.g., "Premier League"
competition_id: int
seasons: List[Season] = field(default_factory=list)
@dataclass
class DatasetCollection:
"""Represents a dataset containing games from multiple competitions."""
competitions: List[Competition] = field(default_factory=list)
def to_df():
"""Retrieve all games from all competitions and seasons as a dataframe."""
raise NotImplementedException() |
Hi!
While working on the docs I realized it wasn't very straightforward to see which competitions and matches are available in the StatsBomb open data repository, at least not from within Kloppy.
I have added functionality such that we can do:
A couple things to consider:
dataset.to_df()
if pandas has not been installed.detailed
(bool=False) parameter to both functions. Setting it toFalse
will provide all available columns, leaving it to the defaultFalse
will give us only the most relevant columnskloppy.infra.serializers.event.statsbomb.helpers
. I'm not sure if this is desirable.Instead we simply do: