Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to list StatsBomb open competitions and available matches #363

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

UnravelSports
Copy link
Contributor

@UnravelSports UnravelSports commented Nov 12, 2024

Hi!

While working on the docs I realized it wasn't very straightforward to see which competitions and matches are available in the StatsBomb open data repository, at least not from within Kloppy.

I have added functionality such that we can do:

from kloppy import statsbomb

statsbomb.list_open_competitions(detailed=True)
statsbomb.list_open_matches(223, 282, detailed=True)

A couple things to consider:

  • Both these functions return a pandas dataframe, but Kloppy has no explicit dependency on pandas. I have inserted a warning (similar to the one shown when doing dataset.to_df() if pandas has not been installed.
  • I have added a detailed (bool=False) parameter to both functions. Setting it to False will provide all available columns, leaving it to the default False will give us only the most relevant columns
  • I have moved the underlying functions to kloppy.infra.serializers.event.statsbomb.helpers. I'm not sure if this is desirable.
  • Because of point 1, we cannot explicitly state the return type of our functions, ie we cannot do the following:
def list_open_matches(
    competition_id: int, season_id: int, detailed: bool = False
) -> pd.DataFrame:

Instead we simply do:

def list_open_matches(
    competition_id: int, season_id: int, detailed: bool = False
):

@UnravelSports
Copy link
Contributor Author

Just to add, if the basic functionality is approved I'll add some tests

@probberechts
Copy link
Contributor

A few remarks:

  1. I prefer the initial proposal of adding a list_open_data function that immediately retrieves all match ids for all competitions. It can have optional arguments to filter on a competition and / or season. With list_open_competitions and list_open_matches it's again a two-step operation to retrieve a certain match id.
  2. StatsBomb has a great well-maintained Python API. Would it be an option to simply wrap their API?
def list_open_data(fmt="dataframe"):
    try:
        from statsbombpy import sb
    except ImportError:
        print("Please install the statsbombpy library to use this function.")
        return

    try:
        # Fetch all available competitions as a dictionary
        competitions = sb.competitions(fmt="dict")

        # Initialize an empty list to store DataFrames for matches
        all_matches = []

        # Loop through each competition and fetch matches using competition_id and season_id
        for competition in competitions.values():
            matches = sb.matches(
                competition_id=competition['competition_id'], 
                season_id=competition['season_id'],
                fmt=fmt
            )
            all_matches.append(matches)

        # Combine all matches into a single DataFrame
        if fmt == "dataframe":
            import pandas as pd
            combined_matches = pd.concat(all_matches, ignore_index=True)
            return combined_matches
        elif fmt == "dict":
            return all_matches
        else:
            raise ValueError("Invalid format. Use 'dataframe' or 'dict'.")

    except Exception as e:
        raise RuntimeError(f"An error occurred while fetching data: {e}")

# Example usage:
df = list_open_data(fmt="dataframe")
print(df.head())

@UnravelSports
Copy link
Contributor Author

This seems fine to me too! I wasn't sure about relying on the sb package initially, but why not.

@UnravelSports
Copy link
Contributor Author

UnravelSports commented Nov 14, 2024

@probberechts do you have any suggestions for passing the competition / season filters? It might make sense to have this be a dict with comp / season_id value pairs, but it might also make sense to simply be able to provide a competition_id and or a competition / season_id pair?

Also, thinking about the functionality I have built and that you are suggesting, the first step is always going to be a two step process. You will need to identify the competition_ids first to even know what you want. So, would it be an idea to have both as an option?

@probberechts
Copy link
Contributor

do you have any suggestions for passing the competition / season filters?

Maybe simply allow everything? You can define a StatsBombPartitionIdentifier object and then try to parse a dict or tuple into this object.

@dataclass
class StatsBombPartitionIdentifier:
    competition_id: Optional[int] = None
    season_id: Optional[int] = None

    @classmethod
    def from_dict(cls, data: Dict):
        """Create a PartitionIdentifier from a dictionary."""
        return cls(
            competition_id=data.get('competition_id'),
            season_id=data.get('season_id')
        )

    @classmethod
    def from_tuple(cls, data: Tuple[Optional[int], Optional[int]]):
        """Create a PartitionIdentifier from a tuple."""
        if len(data) != 2:
            raise ValueError("Tuple must have exactly two elements: (competition_id, season_id)")
        return cls(
            competition_id=data[0],
            season_id=data[1]
        )

Then, the input to list_open_data can be a StatsBombPartitionIdentifier | list[StatsBombPartitionIdentifier] | dict | list[dict] | tuple | list[tuple].

Although, I don't think I would use these identifiers a lot. I can see it might be nice to have this option since it will make data retrieval more efficient (you won't have to retrieve the file with matches for each competition-season pair). However, the only ID that I know by hard is the one of the EPL and would probably rather wait an extra second rather than typing another parameter 🙃. So, maybe we don't have to add this option.

the first step is always going to be a two step process

Yes, typically it will be. But I prefer to do these two steps directly in Pandas. Below are a few use cases in which I think this functionality would be useful:

🤔 I would like to do some analysis with 360 data, which datasets can I use?

df = list_open_data(fmt="dataframe") 
df["has_360"] = df["match_status_360"] == "available"
print(df.groupby(["competition", "season"])["has_360"].sum().sort_values(ascending=False))  

🤔 What is the ID of the 2022 World Cup final?

df = list_open_data(fmt="dataframe") 
print(df[df.match_date == "2022-12-18"]) 

🤔 Get the metadata of a particular game.

df = list_open_data(fmt="dataframe") 
print(df[df.match_id == 7584]) 

@UnravelSports
Copy link
Contributor Author

UnravelSports commented Nov 18, 2024

I've changed it to your suggested implementation. Where we allow a single combination of competition_id and season_id or we pull everything.

Some notes:

  • Have suppressed the NoAuthWarning in the sb.matches() for-loop because one warning is enough, and we get that warning when we grab the competitions.
  • I had to add the competition_id and season_id to the dataframe implementation, because for some reason statsbomb was not giving that by default.

@koenvo do we need tests for this? And if so, what would they look like?

@probberechts note: I see now that I missed your previous message, not sure this is something we need. I'm also trying to keep it a bit simple, in the end it's just a wrapper. And if people are now getting the full list of data first, then they can do the rest themselves from that.

import requests as re


def get_response(path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to a more generic place like utils.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. These lines weren't necessary anymore so I've removed them.

@koenvo
Copy link
Contributor

koenvo commented Nov 18, 2024

@probberechts what do you think of putting ingestify in between? I might have broken the current statsbomb open data Source, but when it's fixed it provides all necessary parts to work with the open data.

It might need an additional "get or fetch" method to either get it directly from (local) storage, or retrieve it first.

@probberechts
Copy link
Contributor

@probberechts what do you think of putting ingestify in between? I might have broken the current statsbomb open data Source, but when it's fixed it provides all necessary parts to work with the open data.

It might need an additional "get or fetch" method to either get it directly from (local) storage, or retrieve it first.

I think adding ingestify might be overkill for our needs. We only need to download ~10 small JSON files, maybe once in a while. The main reason I proposed using the statsbombpy package is that it already parses everything neatly into dataframes. I am not very familiar with ingestify, but I guess it does not do that part.

@UnravelSports
Copy link
Contributor Author

I also don't know what ingestify does, but does indeed feel like overkill. The statsbombpy package works fine, plus it allows users to directly grab non open data too if they have access

@koenvo
Copy link
Contributor

koenvo commented Nov 18, 2024

Ok. Let’s go for this solution.

I do think it would be good to introduce some models for competition, seasons and match etc to prevent having StatsBomb specific structures in through the code.

@UnravelSports
Copy link
Contributor Author

How/where do you see this?

@probberechts
Copy link
Contributor

I do think it would be good to introduce some models for competition, seasons and match etc to prevent having StatsBomb specific structures in through the code.

Yes, this is a good point! I guess what you mean is that we should add something like a DatasetCollection entity to kloppy?

@dataclass
class Game:
    game_id: int
    home_team: Team
    away_team: Team
    date: datetime

@dataclass
class Season:
    season_year: str  # e.g., "2023-2024"
    season_id: int
    games: List[Game] = field(default_factory=list)

@dataclass
class Competition:
    name: str  # e.g., "Premier League"
    competition_id: int
    seasons: List[Season] = field(default_factory=list)

@dataclass
class DatasetCollection:
    """Represents a dataset containing games from multiple competitions."""
    competitions: List[Competition] = field(default_factory=list)
        
   def to_df():
       """Retrieve all games from all competitions and seasons as a dataframe."""
       raise NotImplementedException()     

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants