PGD is a toolchain that works on any operating system that is capable of running Chrome and Python. It has no limit to the number of images it can retrieve and download. It does not require any subprocess call or specific configuration.
- Google Chrome
- Download CRX file from latest release.
- Type
chrome://extensions/
in the Chrome browser top bar. - Toggle
Developer Mode
switch on from the top right corner. - Drag and drop the crx file to the middle of the window.
- Clone the process-google-dataset repo.
- Type
chrome://extensions/
in the Chrome browser top bar. - Toggle
Developer Mode
switch on from the top right corner. - Click
Load Unpacked
and select the cloned repo root directory.
- Navigate to https://images.google.com.
- Search for the dataset keyword. (eg. car)
- To get more data, simply keep scrolling to the bottom of the search page and loading more data. The tool will retrive all the data it can see.
- Find the extension logo at the top right corner and click "Parse and Download Metadata".
- A JSON file will be downloaded to the "Downloads" directory.
- Python 3
python3 download.py --json-path /path/to/downloaded/json/file/from/extension/ --label cars --output-dir /path/to/output/directory
--label
Name of subdirectory/label that describes data.
--json-path
Path to JSON file downloaded from extension.
--output-dir
Directory where a new directory will be created based on label name and the images will be stored.
--timeout
(OPTIONAL) Timeout time in seconds when the downloader will move on to the next image.