GitHub - adiwishnitzer1/fastdup: FastDup is a tool for gaining insights from a large image collection. It can find anomalies, duplicate and near duplicate images, clusters of similaritity, learn the normal behavior and temporal interactions between images. It can be used for smart subsampling of a higher quality dataset, outlier removal, novelty detection of new information to be

Easily Manage, Clean & Curate Visual Data at Scale

fastdup by Visual-Layer is an unsupervised powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset quality and reduce your data operations costs at an unparalleled scale.

From the authors of XGBoost, Apache TVM & Turi Create. Danny Bickson, Carlos Guestrin & Amir Alush

Large Image Datasets Today are a Mess Blog | Processing LAION400m Video

Introducing fastdup V1.0::tada:

Clean & simple API: The new API is simpler to use
Native Windows support: Windows now has first-class, full feature support in fastdup
Amazing documentation: New and imporved fasdtdup documentation
Sleek galleries: New and improved galleries to get a better view of your data
Extensive labels support : Improved support for handling image and bounding box labels
Additional image formats support: Apple’s HEIC+HEIF, 16 bit grayscale TIFF
Support for Python3.10
Fully backcompatible to old API

fastdup identifies these data issues:

What makes fastdup unique?

Quality: fastdup can assist you in reaching a high quality dataset by finding and removing anomalies and outliers from your datasets. Finding duplicate and near duplicate of images (&videos) and finding clusters of similarity at a large scale!
Cost : fastdup can also help you in reducing your data operations costs by facilitating the intelligent sampling of high-quality or novel datasets prior to labeling, as well as support the quality assessment of labeled data.
Scale: fastdup graph engine is written in C++ and is highly efficient and works in an incredible scale! Running locally on a CPU only machine and can handle up to 400M images on a single CPU machine!

Get insights on your data with just 3 lines of code:

Installation

# upgrade pip to its latest version
pip install -U pip

# install fastdup
pip install fastdup
    
# Alternatively, use explicit python version (XX)
python3.XX -m pip install fastdup

Supported Python: 3.7, 3.8, 3.9, 3.10
Supported OS: Windows 10, 11 and 2019 Server (Native), Windows WSL, Ubuntu (20.04, 18.04), Mac OSX 10+ (Intel and M1 CPUs), Amazon Linux 2, CentOS 7, RedHat 4.8.
Full installation instructions are here

Running fastdup

import fastdup

fd = fastdup.create(work_dir, images_dir)
fd.run(nearest_neighbors_k=5, cc_threshold=0.96)

fd.vis.duplicates_gallery()     #create a visual gallery of found duplicates
fd.vis.outliers_gallery()       #create a visual gallery of anomalies
fd.vis.component_gallery()     #create visualiaiton of connected components
fd.vis.stats_gallery()          #create visualization of images stastics (for example blur)

Working on the Oxford Pet Dataset. Detecting identical pairs, similar-pairs (search) and outliers

Getting started examples

Full documentation

Support and feature requests

Join our Slack channel

Have a question? Use our discussion forum

🚀 fastdup enterprise early access

Sign up at Visual Layer

What our users think about fastdup:

User community contributions

License questions

Please reach us at info@visual-layer.com

Disclaimer

Usage Tracking

We have added experimental crash report collection, using sentry.io. It does not collect user data other than anonymized IP address data, and it only logs fastdup library's own actions. We do NOT collect folder name, user name, image names, image content only aggregate performance statistics like total number of images, average runtime per image, total free memory, total free disk space, number of cores etc. Collecting fastdup crashes will help us improve stability.

The code for the data collection is found here. On MAC we use Google crashpad.

It is always possible to opt out of the experimental crash report collection via either of the following two options:

Define an environment variable called SENTRY_OPT_OUT
or run() with turi_param='run_sentry=0'

About Visual-Layer

Visual Layer Inc.

Name		Name	Last commit message	Last commit date
Latest commit History 516 Commits
examples		examples
fastdup		fastdup
gallery		gallery
src		src
.gitignore		.gitignore
CLOUD.md		CLOUD.md
Dockerfile		Dockerfile
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
RUN.md		RUN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Easily Manage, Clean & Curate Visual Data at Scale

Introducing fastdup V1.0::tada:

fastdup identifies these data issues:

What makes fastdup unique?

Get insights on your data with just 3 lines of code:

Installation

Running fastdup

Getting started examples

Full documentation

Support and feature requests

🚀 fastdup enterprise early access

What our users think about fastdup:

User community contributions

License questions

Disclaimer

About Visual-Layer

About

Releases

Packages

Languages

License

adiwishnitzer1/fastdup

Folders and files

Latest commit

History

Repository files navigation

Easily Manage, Clean & Curate Visual Data at Scale

Introducing fastdup V1.0::tada:

fastdup identifies these data issues:

What makes fastdup unique?

Get insights on your data with just 3 lines of code:

Installation

Running fastdup

Getting started examples

Full documentation

Support and feature requests

🚀 fastdup enterprise early access

What our users think about fastdup:

User community contributions

License questions

Disclaimer

About Visual-Layer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages