Skip to content

CCNU-internship-Dec2020/Scrape-IMDb-By-Searching-Name

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scrape IMDb Movies By Searching Name

Movie details scraping in Python3.

.
├── LICENSE
├── README.md
├── Scrape_IMDb.txt                Full scrape data writen in line.
├── add_pic_clarity.py      5-     Improve the sharpness of pictures.
├── content.py              3-     The main scrape script.
├── data.txt                       The final data in a list of lists.
├── excess_log.txt                 Data of getting the title_id in IMDb in line.
├── find_lost.py            4-     A script of finding which movie do not have the summary or poster.
├── get_ttid.py             2-     Use the given movie name to search movie_id in IMDb website.
├── get_url.py              1-     A script of encodeURIComponent() like function.
├── lost_id_1.txt                  Result of find_lost.py
├── movie_id_sort.py        6-     A script of mixing data and sorting.
├── movies.dat                     The original data.
├── searchMovUrlList.txt           The encode URLs.
├── searchMovUrlList_byLine.txt    The encode URLs in line.
└── table.html                     bs4

Install

This project uses python3, requests and Beautiful Soup. Go check them out if you don't have them locally installed.

python -m pip3 install requests
pip3 install beautifulsoup4

Usage

[New update]: Begin from the second step

First step: Get the searching URL

We are given the data set which contains the movie names. In the first step we should use the name to search the matching movies preview in the IMDb website.

$ python3 get_url.py

This script convert the string of movie name and its release year to IMDb's searching format in order to make the result more correct, which means the first searching result in list is what we want.

Second step: Get the movie's title id in IMDb

Use the 'hand-made' URL to search movies in the website. Extract the <a> tag which contains a relative file path to find the sole movie-id.

$ python3 get_ttid.py

We save the data in format to excess_log.txt

Step three: Locate the Summary and Poster

Go to the movies detail website to scrape the summary and poster of the film. Then store the data to Scrape_IMDb.txt

$ python3 content.py

They provide picture host for us.

Step four: Find the lost Summary and Poster

In step three, several movie details like poster can be scraped unsuccessfully because of the ttid is not correct, or other reasons. What we should do is compare the movie_id of excess_log.txt and Scrape_IMDb.txt, output the details to lost_id_1.txt in format like excess_log.txt.

$ python3 find_lost.py

Then change the ttid manually, and use the new output file repeat Step three and Step four!

Step five: Improve the sharpness

The poster is a preview and not clearly enough. Use the script to improve the sharpness of images by modify the picture's URL.

$ python3 add_pic_clarity.py

The final step: Sort data

The step four would add many new data which are unordered. In this step, sorting the data and get them in a list of lists.

$ python3 movie_id_sort.py > data.txt

Well done! Enjoy our script!:satisfied:

Input and Output

Step 2: get_ttid.py

input output log network data loss
movies.dat excess_log.txt err_log.txt lost ‼️

Step 3: content.py

input output log network data loss
excess_log.txt Scrape_IMDb.txt scrape_err_log.txt lost ‼️

Step 4: find_lost.py

input output log network data loss
excess_log.txt & Scrape_IMDb.txt lost_id_1.txt NULL do not need maybe ‼️

Step 5: add_pic_clarity.py

input output log network data loss
Scrape_IMDb.txt Scrape_IMDb_pic_clarity.txt NULL do not need no

The last step: movie_id_sort.py

input output log network data loss
Scrape_IMDb_pic_clarity.txt data.txt NULL do not need no

Contributors

Thanks to all the people who contribute. Feel free to dive in! Open an issue or submit PRs. @wwyqianqian @Darren2017

License

MIT © CCNU-internship-Dec2020

About

Movie details scraping in Python3.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published