A collection of libraries used to quickly download images from various locations based on keywords.
imagedownloader was originally created to collect a large number of unique images for use in machine learning training data. Because of this, the libraries may be a little "overfitted" for our needs, but maybe they will help you, too!
imagedownloader has two pecuilar behaviors that are worth noting.
-
imagedownloader downloads all images as ".jpg" files. This shortcut was out of pure laziness. In theory, the extension can be extracted from the URL of the image Tracked in Issue #1
-
imagedownloader attempts to keep track of the images it has downloaded on a past run, and uses the image's URL to do a preliminary duplicate-check. Thus, when images are saved, they are saved with a filename of their hashed URL.
To dive right into the code, be sure to look at sample.py
. Otherwise, here is
a small API reference
The imagedownloader library uses inheritence to organize its various downloaders.
The base functionality is found in image_downloader.py
, which handles
spawning threads, downloading images, and keeping track of duplicates.
All subclasses of ImageDownloader should implement get_image_urls_from_source
,
which takes the query string, the offset integer, and the count integer in order
to return a list of image URLs.
A BingImageDownloader
instance may be construced using the standard
ImageDownloader
semantics with the addition of a
Bing Search API key.
downloader = BingImageDownloader("images", 8, ["cats", "dogs"], 10000, apiKey)
A DeviantArtImageDownloader
instance may be constructed using the standard
ImageDownloader
semantics.
downloader = DeviantArtImageDownloade("images", 8, ["cats", "dogs"], 10000)