Skip to content

The script crawls through (max 50) urls of a page and crawls further. The code then visits the immediate URLS (assuming the usual navigation bar nature), grabs a screenshot (full page) and stores them in a folder.

Notifications You must be signed in to change notification settings

kanishk307/Crawler-Selenium-Screenshot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crawler-Selenium-Screenshot

  • The script crawls through urls (max 50) of a page
  • Based on the internal links, it crawls further
  • The code then filters and finds the immediate URLS (assuming the usual navigation bar nature)
  • Create a folder dynamically with a unique name
  • Grab a screenshot (full page) of all the immediate web pages (as PNG), name them dynamically and store them in a folder
  • Save the webpage locally
  • Save the pdf files locally

Output process alt text

Folder containing Screenshots alt text

Inputs that need to be provided : Base URL (eg: https://umd.edu/virusinfo OR https://www.ucf.edu/coronavirus)

You can run the python file as it is (with required imports). I think it is interactive to run the python notebook.

Constraint: chromedriver's location needs to be specified in the PATH. Download chromedriver based on your OS. (https://chromedriver.chromium.org/)

Folder naming scheme : URL_Screenshots

File naming scheme: URL-page_slug

About

The script crawls through (max 50) urls of a page and crawls further. The code then visits the immediate URLS (assuming the usual navigation bar nature), grabs a screenshot (full page) and stores them in a folder.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages