Skip to content

Utilized Python tools to delve into IMDb data, uncovering trends in title releases and viewer preferences. Visualized patterns in genres and title runtimes, offering insights into evolving media consumption. Applied regression models like linear, polynomial, and random forest to predict title ratings, revealing factors impacting viewer choices.

Notifications You must be signed in to change notification settings

JESUSC1/IMDb-Data-Analysis-Exercise-Part-1

Repository files navigation

IMDb-Data-Analysis-Exercise-Part-1

IMDb Image

Dove deep into IMDb data using Python and visualization tools, unveiling title release patterns and viewer predilections. Applied regression models to predict title ratings, and set the groundwork for building recommender systems for TV shows/movies or revenue prediction models using IMDb data.

Data Source

The primary data source for this analysis is IMDb, an extensive online database that provides detailed information about films, TV series, podcasts, video games, and other media content.

Analysis

  • Initiated the exploratory analysis by identifying the time span of the dataset and categorizing titles by type and genre.
  • Visualized the number of titles released each year, identifying predominant title types like TV episodes, movies, and short films.
  • Explored viewer preferences, determining genres like Drama, Comedy, and Documentary as the most popular.
  • Analyzed title runtime trends over the years, highlighting shifts in movie and TV episode durations.

Libraries Used

The analysis utilizes the following Python libraries and packages:

  • Seaborn: For enhanced data visualization.
  • Sklearn: For machine learning and data preprocessing (mean_squared_error, LinearRegression, PolynomialFeatures, RandomForestRegressor, train_test_split, OneHotEncoder).
  • Matplotlib: For data visualization.
  • Numpy: For numerical computations.
  • Pandas: For data manipulation and analysis.
  • Urllib: For URL handling and web access.
  • OS: For interacting with the operating system.
  • IO: For handling streams.
  • Gzip: For working with gzipped files.
  • Zipfile: For extracting and creating zip archives.

Key Achievements

  • Successfully analyzed and visualized IMDb data, uncovering key trends and patterns in title releases and viewer preferences.
  • Applied regression models, including linear, polynomial, and random forest, to predict title ratings based on runtime, gaining insights into factors influencing viewer ratings.

Conclusion

The "IMDb-Data-Analysis-Exercise-Part-1" provides an in-depth look into IMDb data, revealing valuable insights into media consumption trends, viewer preferences, and title characteristics. This foundational analysis sets the stage for more advanced studies, including the development of recommendation systems.

Future Work

The next phase, "Part 2", will focus on obtaining data directly from the IMDb database using their API. It will also delve into creating a recommender system for TV shows/movies using the comprehensive IMDb dataset.

Note

To fully understand the conclusions drawn in this analysis, it is recommended to go through the entire notebook, including the code and its outputs. You can view the HTML version of the notebook here.

Author

Jesus Cantu Jr.

Last Updated

June 6, 2023

About

Utilized Python tools to delve into IMDb data, uncovering trends in title releases and viewer preferences. Visualized patterns in genres and title runtimes, offering insights into evolving media consumption. Applied regression models like linear, polynomial, and random forest to predict title ratings, revealing factors impacting viewer choices.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published