Skip to content

A dataset comprised of over 40 million images sourced from Wikimedia Commons

License

Notifications You must be signed in to change notification settings

ryanrudes/wikimedia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open In Colab

Introduction

Wikimedia Commons Image Dataset is comprised of over 40 million URLs to Wikimedia Commons images.

Requirements

These are only required if you plan to run the data scraper yourself, which is unnecessary.

These are the requirements for the PyTorch DataLoader:

Data

Data is represented in a certain compressed format. URLs are newline-delimited.

URLs to Wikimedia Commons images are formatted as follows:
https://upload.wikimedia.org/wikipedia/commons/thumb/<ID-1>/<ID-2>/<FILENAME>
or
https://upload.wikimedia.org/wikipedia/commons/<ID-1>/<ID-2>/<FILENAME> (no /thumb/)

<ID-1> is 1 character in length, and <ID-2> is 2

For each URL, it is compressed as follows: <THUMB><ID-1><ID-2><FILENAME>

where <THUMB> is a binary integer, indicating whether /thumb/ is a component of the path.

There are 41666578 URLs in total, equating to 4.73 GB.

Usage

Included is a:

  • PyTorch Dataset and DataLoader
  • TensorFlow Dataset

PyTorch DataLoader Usage

To demo the PyTorch DataLoader, first cd to the main directory. Then, download the dataset:

kaggle datasets download -d ryanrudes/wikimedia --unzip

Then, run the script:

python loaders/pytorch.py

You can use this dataset by simply importing the DataLoader class, for example:

from loaders.pytorch import WikimediaCommonsLoader

loader = WikimediaCommonsLoader()

for batch in loader:
    print (batch.shape)

>>> torch.Size([32, 3, 256, 256])
>>> torch.Size([32, 3, 256, 256])
>>> torch.Size([32, 3, 256, 256])
>>> torch.Size([32, 3, 256, 256])
>>> torch.Size([32, 3, 256, 256])
...

You can modify the following arguments of WikimediaCommonsLoader. Their default values are given below:

path        = 'filtered.txt'
verbose     = True
max_retries = None
timeout     = None
shuffle     = True
max_buffer  = 4096
workers     = 8
transform   = None
batch_size  = 32
resize_to   = 512
crop_to     = 256

Or, you can use the backbone WikimediaCommonsDataset class, which returns the raw images, one by one, without applying any transformations, whereas WikimediaCommonsLoader performs resizing and random cropping:

from loaders.pytorch import WikimediaCommonsDataset

dataset = WikimediaCommonsDataset()

for image in dataset:
    print (image.shape)

>>> (120, 100, 3)
>>> (80, 120, 3)
>>> (120, 80, 3)
>>> (98, 120, 3)
>>> (120, 97, 3)
>>> (120, 120, 3)
...

The WikimediaCommonsDataset class takes almost the same arguments as WikimediaCommonsLoader, excluding batch_size, resize_to, and crop_to.

Links and Further Info

The dataset is available on Kaggle

This is licensed under the MIT license. Click here to learn more. All image links in this dataset are in the public domain. The only exception would be links to Wikimedia Foundation logos, which were already filtered prior to the data upload.