GitHub - ryanrudes/wikimedia: A dataset comprised of over 40 million images sourced from Wikimedia Commons

Introduction

Wikimedia Commons Image Dataset is comprised of over 40 million URLs to Wikimedia Commons images.

Requirements

These are only required if you plan to run the data scraper yourself, which is unnecessary.

tqdm
bs4

These are the requirements for the PyTorch DataLoader:

Data

Data is represented in a certain compressed format. URLs are newline-delimited.

URLs to Wikimedia Commons images are formatted as follows:
https://upload.wikimedia.org/wikipedia/commons/thumb/<ID-1>/<ID-2>/<FILENAME>
or
https://upload.wikimedia.org/wikipedia/commons/<ID-1>/<ID-2>/<FILENAME> (no /thumb/)

<ID-1> is 1 character in length, and <ID-2> is 2

For each URL, it is compressed as follows: <THUMB><ID-1><ID-2><FILENAME>

where <THUMB> is a binary integer, indicating whether /thumb/ is a component of the path.

There are 41666578 URLs in total, equating to 4.73 GB.

Usage

Included is a:

PyTorch Dataset and DataLoader
TensorFlow Dataset

PyTorch `DataLoader` Usage

To demo the PyTorch DataLoader, first cd to the main directory. Then, download the dataset:

kaggle datasets download -d ryanrudes/wikimedia --unzip

Then, run the script:

python loaders/pytorch.py

You can use this dataset by simply importing the DataLoader class, for example:

from loaders.pytorch import WikimediaCommonsLoader

loader = WikimediaCommonsLoader()

for batch in loader:
    print (batch.shape)

>>> torch.Size([32, 3, 256, 256])
>>> torch.Size([32, 3, 256, 256])
>>> torch.Size([32, 3, 256, 256])
>>> torch.Size([32, 3, 256, 256])
>>> torch.Size([32, 3, 256, 256])
...

You can modify the following arguments of WikimediaCommonsLoader. Their default values are given below:

path        = 'filtered.txt'
verbose     = True
max_retries = None
timeout     = None
shuffle     = True
max_buffer  = 4096
workers     = 8
transform   = None
batch_size  = 32
resize_to   = 512
crop_to     = 256

Or, you can use the backbone WikimediaCommonsDataset class, which returns the raw images, one by one, without applying any transformations, whereas WikimediaCommonsLoader performs resizing and random cropping:

from loaders.pytorch import WikimediaCommonsDataset

dataset = WikimediaCommonsDataset()

for image in dataset:
    print (image.shape)

>>> (120, 100, 3)
>>> (80, 120, 3)
>>> (120, 80, 3)
>>> (98, 120, 3)
>>> (120, 97, 3)
>>> (120, 120, 3)
...

The WikimediaCommonsDataset class takes almost the same arguments as WikimediaCommonsLoader, excluding batch_size, resize_to, and crop_to.

Links and Further Info

The dataset is available on Kaggle

This is licensed under the MIT license. Click here to learn more. All image links in this dataset are in the public domain. The only exception would be links to Wikimedia Foundation logos, which were already filtered prior to the data upload.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
loaders		loaders
LICENSE.txt		LICENSE.txt
README.md		README.md
filter.py		filter.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Requirements

Data

Usage

PyTorch `DataLoader` Usage

Links and Further Info

About

Languages

License

ryanrudes/wikimedia

Folders and files

Latest commit

History

Repository files navigation

Introduction

Requirements

Data

Usage

PyTorch DataLoader Usage

Links and Further Info

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

PyTorch `DataLoader` Usage