Skip to content

8. Class Reference

Maksymilian edited this page Dec 9, 2024 · 22 revisions

Processor Class:

The Processor class is the core component responsible for identifying and grouping duplicate files based on the multi step processing workflow. It utilizes various algorithms and grouping strategies to efficiently process and classify files into sets of similar files.


Constructors:


Processor(Grouper grouper, Collection<Algorithm<?>> algorithms)

  • Parameters:
    • grouper - A Grouper instance to perform the initial division of files based on a distinction predicate (e.g., CRC32 checksum).
    • algorithms - A collection of Algorithm objects applied to the files during the "Algorithm Application" step. The order of the algorithms matter for the processing.
  • Throws:
    • NullPointerException - If either grouper, or algorithms are null, or the algorithm collection is empty or contains null elements.
  • Purpose: Initializes the Processor with the provided grouping strategy and set of algorithms for processing the files.

Methods:


Map<File, Set<File>> process(@NotNull Collection<@NotNull File> files) throws IOException

  • Parameters:

    • files - A collection of File objects to be processed. Typically, these files are of the same type (e.g., images) and are grouped based on similarity.
  • Returns:

    • A Map where the key is a file considered the "original" in a group of similar files, and the value is a set of files considered duplicates or similar files.
  • Throws:

    • NullPointerException - If the input collection contains null or is null.
    • IOException - If any I/O error occurs during processing.
  • Purpose: This method processes the input collection of files through the following steps:

    1. Initial Division: Files are divided into subsets based on a distinction predicate.
    2. Algorithm Application: A series of algorithms is applied to the subsets to refine the grouping further.
    3. Original File Identification: The first file in each group is identified as the "original", and the groups are reorganized accordingly.

private Set<Set<File>> algorithmsApplication(@NotNull Set<Set<File>> groupedFiles) throws IOException

  • Parameters:

    • groupedFiles - A set of sets of files, where each set represents a group of similar files.
  • Returns:

    • A new set of sets of files after applying all algorithms and consolidating the groups.
  • Throws:

    • IOException - If any error occurs during the algorithm application.
  • Purpose: This method applies each algorithm in the algorithms collection to the grouped files and consolidates the results by merging groups with identical keys and removing groups with only one file.


private <T> Map<T, Set<File>> applyAlgorithm(@NotNull Algorithm<T> algorithm, @NotNull Set<Set<File>> groupedFiles)

  • Parameters:

    • algorithm - The Algorithm to apply the grouped files.
    • groupedFiles - A set of sets of files to process with the algorithm.
  • Returns:

    • A Map where the key is the characteristic (e.g., perceptual hash or CRC32 checksum) and the value is a set of files sharing that characteristic.
  • Purpose: This method applies a single algorithm to the grouped files and returns a map of results.


private Set<Set<File>> postAlgorithmConsolidation(@NotNull Map<?, Set<File>> algorithmOutput)

  • Parameters:

    • algorithmOutput - A map containing the results of the algorithm application, where the key is a shared characteristic and the value is a set of files that share that characteristic.
  • Returns:

    • A set of sets of files after consolidating the results by removing groups with only one file and merging groups with identical keys.
  • Purpose: This method consolidates the results of an algorithm by eliminating groups that contain only one file and merging groups with identical keys.


private Map<File, Set<File>> originalDistinction(@NotNull Set<Set<File>> groupedFiles)

  • Parameters:

    • groupedFiles - A set of sets of files representing groups of similar files.
  • Returns:

    • A new Map where:
      • The key is the "original" file (the first file in each group).
      • The value is a Set of files considered duplicates or similar files.
  • Throws:

    • NullPointerException - If groupedFiles contains null.
  • Purpose: This method identifies the "original" file in each group and reorganizes the groups into a map, where each key is the original file and each value is a set of similar files (including the original file itself).


private Set<File> consolidate(@NotNull Set<File> s1, @NotNull Set<File> s2)

  • Parameters:

    • s1 - The first set to merge.
    • s2 - The second set to merge.
  • Returns:

    • A new set containing all elements from both s1 and s2.
  • Purpose: This method merges two sets into one, ensuring that all elements from both sets are included.


Logger:


  • The Processor class uses a Logger instance (logger) from the SLF4J API to log messages during the various stages of file processing. For example, it logs the start of processing, division of files, application of algorithms, and the identification of original files.

Usage Example:


Grouper grouper = new Crc32Grouper();
List<Algorithm<?>> algorithms = List.of(new PerceptualHash(), new PixelByPixel());
Processor processor = new Processor(grouper, algorithms);

Collection<File> files = List.of(new File("image1.jpg"), new File("image2.jpg"));
Map<File, Set<File>> result = processor.process(files);
result.forEach((original, duplicates) -> {
    System.out.println("Original: " + original);
    duplicates.forEach(duplicate -> System.out.println("  Duplicate: " + duplicate));
});

Algorithms Package:

Algorithm Interface:

The Algorithm interface represents a functional abstraction for an algorithm that operates on a set of files, dividing them into smaller subsets based on some shared characteristic, resulting in a map where each key corresponds to a group of files that share that characteristic.


Methods:


Map<K, Set<File>> apply(Set<File> group)

  • Parameters:

    • group - A Set of File objects to be processed by the algorithm. These files are typically of the same type (e.g., images), and the goal is to group them based on some shared characteristic.
  • Returns:

    • A Map where each key (K) corresponds to a set of files that share the same characteristic (e.g., checksum, hash, metadata). The key is computed from the shared property of the files.
  • Purpose:

    • This method applies the algorithm to the given set of files, partitioning them into smaller groups. Each group corresponds to a characteristic shared by all the files in the group. For example, the characteristic could be a checksum, perceptual hash, or file size.
    • The key used to map each set of files should be deterministic. This means that the same group of files will always produce the same output map when the algorithm is applied, ensuring consistency in grouping.

Functional Interface:


  • The Algorithm interface is marked with the @FunctionalInterface annotation, indicating that it is designed to be used with lambda expressions or method references.
  • Purpose of Functional Interface:
    • It can be easily implemented using a lambda expression or a method reference, allowing flexibility in defining various algorithms for grouping files. This allows for the easy application of different strategies, such as comparing files based on checksum, perceptual hash, or other metrics.

Usage Example:


Algorithm<String> checksumAlgorithm = (group) -> {
    // Example algorithm logic to group files by checksum (dummy implementation)
    Map<String, Set<File>> result = new HashMap<>();
    for (File file : group) {
        String checksum = getChecksum(file); // Example method to calculate checksum
        result.computeIfAbsent(checksum, k -> new HashSet<>()).add(file);
    }
    return result;
};

Set<File> files = new HashSet<>(List.of(new File("file1.txt"), new File("file2.txt")));
Map<String, Set<File>> groupedFiles = checksumAlgorithm.apply(files);

PerceptualHash Class

The PerceptualHash class implements the Algorithm interface and is used to compute perceptual hashes for images. This class groups similar images by generating and comparing perceptual hashes, which are unique identifiers derived from an image's content. The process of generating these hashes allows for comparing images based on their visual similarities rather than their exact content, making it useful for image de-duplication or similarity detection.


Methods:


Map<String, Set<File>> apply(@NotNull Set<File> group)

  • Parameters:

    • group - A Set of File objects representing the images to be processed. Each image will be grouped based on its perceptual hash.
  • Returns:

    • A Map where each key is a perceptual hash (a String), and each value is a Set of files that share the same hash, representing images that are visually similar.
  • Purpose:

    • The method processes each image in the input set by resizing it, extracting its pixel values, applying the Discrete Cosine Transform (DCT), and generating a perceptual hash. The images are then grouped based on these hashes, and the result is returned as a map. Images with the same perceptual hash are considered similar and grouped together.

@NotNull private BufferedImage resize(@NotNull File file)

  • Parameters:

    • file - The image File to be resized.
  • Returns:

    • A BufferedImage that is resized to 8x8 pixels and converted to grayscale.
  • Purpose:

    • Resizes the image to a fixed 8x8 pixel size to standardize it for hash generation. The image is also converted to grayscale to simplify the process and reduce the detail that could interfere with the hash calculation.

private double[][] extractSample(BufferedImage image)

  • Parameters:

    • image - A BufferedImage that has already been resized.
  • Returns:

    • A 2D double array representing the pixel values of the image.
  • Purpose:

    • Extracts the pixel values from the resized image and stores them in a matrix (2D array), which will be used in further steps for hash generation.

private String buildHash(double[][] matrix)

  • Parameters:

    • matrix - A 2D double array representing the pixel values of the image.
  • Returns:

    • A String representing the perceptual hash of the image, generated by comparing each pixel with the average value of the matrix.
  • Purpose:

    • Constructs a binary string (the perceptual hash) by comparing each pixel's value with the average value of all pixels in the matrix. If a pixel's value is greater than the average, it is marked as '1'; otherwise, it is marked as '0'.

private double getAvg(double[][] matrix)

  • Parameters:

    • matrix - A 2D double array representing the pixel values of the image.
  • Returns:

    • The average pixel value of the matrix, computed across all pixels.
  • Purpose:

    • Calculates the average value of the pixel values in the matrix, which is used in the buildHash method to compare each pixel against the average for hash generation.

Description:


The PerceptualHash class generates perceptual hashes for images, allowing for the grouping of similar images. The algorithm works by performing several steps:

  1. Resize: Each image is resized to 8x8 pixels.
  2. Extract Sample: The pixel values of the resized image are extracted into a matrix.
  3. Discrete Cosine Transform (DCT): The DCT is applied (via the DCT::apply method) to reduce high-frequency components and focus on the low-frequency ones.
  4. Generate Hash: A hash is created by comparing the pixel values with the average value of the matrix, where pixels greater than the average are marked as 1, and those below the average are marked as 0.
  5. Group by Hash: Images are then grouped by their perceptual hashes. Images with the same hash are considered visually similar.

The final result is a map of perceptual hashes, where the key is the hash and the value is a set of images that share that hash.


Usage Example:


Set<File> images = new HashSet<>(List.of(new File("image1.jpg"), new File("image2.jpg")));
PerceptualHash perceptualHashAlgorithm = new PerceptualHash();
Map<String, Set<File>> groupedImages = perceptualHashAlgorithm.apply(images);

groupedImages.forEach((hash, files) -> {
    System.out.println("Hash: " + hash);
    files.forEach(file -> System.out.println("  " + file.getName()));
});

This example shows how to apply the PerceptualHash algorithm to a set of image files. The result is a map where images with the same perceptual hash are grouped together, indicating that they are visually similar.


PixelByPixel Class

The PixelByPixel class implements the Algorithm interface and is used for image matching based on pixel-by-pixel comparison. The goal of this algorithm is to group identical images from a set by comparing them at the pixel level. It efficiently handles large datasets using parallel processing and caching to optimize performance.


Methods:


Map<File, Set<File>> apply(Set<File> group)

  • Parameters:

    • group - A Set of File objects representing the images to be processed.
  • Returns:

    • A Map where each key is a file, and the corresponding value is a set of files that are identical to the key file. Images that are pixel-identical are grouped together. And value set contain the key value.
  • Purpose:

    • This method processes a group of image files, comparing them pixel-by-pixel to identify identical images. The images are grouped by their pixel-level equivalence and stored in a map for the result. It utilizes a queue to manage image files and processes them in parallel.

private void process(@NotNull Map<File, Set<File>> result, @NotNull Queue<File> groupQueue)

  • Parameters:

    • result - A mutable Map that stores the groups of identical images.
    • groupQueue - A mutable Queue containing the files to be processed.
  • Purpose:

    • This method iterates through the queue, selecting a "key" image and comparing it to the other images in the queue. Identical images are removed from the queue and grouped together. The grouping is done in parallel to speed up processing.

private BufferedImage getCachedImage(@NotNull File file)

  • Parameters:

    • file - The image File to retrieve from the cache.
  • Returns:

    • A BufferedImage corresponding to the given file.
  • Purpose:

    • This method retrieves an image from the AdaptiveCache. If the image is not found in the cache, it will be loaded from the disk and added to the cache for future use. This avoids repeatedly reading the same image from disk, optimizing performance.

private boolean compareImages(@NotNull BufferedImage img1, @NotNull BufferedImage img2)

  • Parameters:

    • img1 - The first image to compare.
    • img2 - The second image to compare.
  • Returns:

    • true if the images are identical pixel-by-pixel, otherwise false.
  • Purpose:

    • This method compares two images by first checking if their dimensions match. If the dimensions are the same, it then compares the raw pixel data by examining the byte data of the image's raster. If the byte data matches, the images are considered identical.

Description:


The PixelByPixel class is an image comparison algorithm that uses an exact, pixel-by-pixel method to identify identical images. It operates as follows:

  1. Load Images: It loads images from disk using a cache (via the AdaptiveCache class). If an image is not found in the cache, it is loaded from the disk and added to the cache.
  2. Pixel Comparison: The images are compared pixel by pixel, checking if they are exactly identical.
  3. Group Identical Images: Images that are identical (based on pixel comparison) are grouped together in a Map. The key of the map is the original image, and the value is a set of identical images.
  4. Parallel Processing: The comparison is parallelized to speed up processing, especially when handling large datasets of images.
  5. Cache Usage: The AdaptiveCache is used to optimize the image loading process, reducing the need to reload images repeatedly.

The algorithm assumes that all images in the group have the same resolution and format.


Usage Example:


Set<File> imageFiles = new HashSet<>(Arrays.asList(file1, file2, file3));
PixelByPixel algorithm = new PixelByPixel();
Map<File, Set<File>> result = algorithm.apply(imageFiles);

result.forEach((key, identicalImages) -> {
    System.out.println("Original Image: " + key.getName());
    identicalImages.forEach(file -> System.out.println("  Identical Image: " + file.getName()));
});

In this example, the PixelByPixel algorithm is applied to a set of image files. The result is a map where each key is an image file, and the value is a set of images that are identical to the key image. The images are grouped based on pixel-by-pixel comparison.


Math Package:

DCT Class

The DCT class is responsible for applying the Discrete Cosine Transform (DCT) and quantization to a given matrix of image coefficients. It serves as a pipeline that combines both transformations sequentially to prepare image data for perceptual hashing, compression, or other applications.


Dependencies:


  • pl.magzik.algorithms.math.dct.Transformer - Handles the Discrete Cosine Transform operation.
  • pl.magzik.algorithms.math.dct.Quantifier - Handles quantization of the DCT coefficients.

Constructor:


private DCT(Quantifier quantifier, Transformer transformer)

  • Parameters:
    • quantifier - The Quantifier instance, that handles quantization of the DCT coefficients.
    • transformer - The Transformer instance, that handles the Discrete Cosine transform operation.

Methods:


static double[][] apply(double[][] matrix)

  • Parameters:

    • matrix - A 2D array of doubles representing the input matrix (e.g., grayscale pixel values from an image).
  • Returns:

    • A 2D array of doubles representing the quantized DCT coefficients.
  • Purpose:

    • This method performs both DCT and quantization in sequence.
    • It creates new instances of Quantifier and Transformer, initializes the DCT pipeline, and processes the input matrix.
  • How It Works:

    • Step 1: The matrix is passed to the transform method of Transformer, which applies the Discrete Cosine Transform.
    • Step 2: The resulting DCT coefficients are passed to the quantize method of Quantifier, which reduces their precision.
    • Step 3: The final quantized matrix is returned as output.

private double[][] applyInternal(double[][] matrix)

  • Parameters:

    • matrix - A 2D array of doubles representing the input matrix.
  • Returns:

    • A 2D array of double representing the quantized DCT coefficients.
  • Purpose:

    • This method is used internally to apply the transformation pipeline.
    • It first calls the transform method of the Transformer instance to compute the DCT.
    • Then, it calls the quantize method of the Quantifier instance to apply quantization.

DCT package:


Quantifier Class:

The Quantifier class is responsible for performing quantization on a matrix of DCT (Discrete Cosine Transform) coefficients.


Constructors:

Quantifier(int[][] quantizationMatrix)

  • Parameters:
    • quantizationMatrix - A 2D integer array representing the quantization matrix.

Quantifier()

  • Default Matrix: The default matrix follows the JPEG standard for 8x8 blocks:
{ {16, 11, 10, 16, 24, 40, 51, 61},
  {12, 12, 14, 19, 26, 58, 60, 55},
  {14, 13, 16, 24, 40, 57, 69, 56},
  {14, 17, 22, 29, 51, 87, 80, 62},
  {18, 22, 37, 56, 68, 109, 103, 77},
  {24, 35, 55, 64, 81, 104, 113, 92},
  {49, 64, 78, 87, 103, 121, 120, 101},
  {72, 92, 95, 98, 112, 100, 103, 99} };

Methods:

double[][] quantize(double[][] coeffs)

  • Parameters:

    • coeffs - A 2D double array representing the matrix of DCT coefficients.
  • Returns:

    • A 2D double array of quantized coefficients.
  • Throws:

    • IllegalArgumentException - If the dimensions of the input coeffs matrix don't match the quantization matrix dimensions.
  • Purpose:

    • Applies quantization to the given matrix of DCT coefficients using the quantization matrix.
    • Each DCT coefficient is divided by its corresponding quantization value and then rounded.
  • How It Works:

    • Loops through each value in the coefficient matrix.
    • Divides each coefficient by the corresponding value in the quantization matrix.
    • Rounds the result and stores it in a new matrix.

Transformer Class:

The Transformer class provides methods to perform the Discrete Cosine Transform (DCT) on both 1D vectors and 2D matrices. It leverages the efficient JTransforms library to compute DCT operations.


Methods:

double[] transform(double[] vector)

  • Parameters:

    • vector - A 1D array of double values to be transformed.
  • Returns:

    • A new double[] array containing the transformed values.
  • Purpose:

    • Computes the 1D DCT for the given vector using the DoubleDCT_1D class from the JTransforms library.
  • How It Works:

    • Clones the input vector to avoid mutating the original data.
    • Initializes a DoubleDCT_1D object with the vector`s length.
    • Calls forward() with scaling = true to normalize the result.

double[][] transform(double[][] matrix)

  • Parameters:

    • matrix - A 2D array of double values representing the input data.
  • Returns:

    • A new 2D double[][] array containing the transformed values.
  • Purpose:

    • Performs a 2D DCT on the input matrix. This is achieved by:
      1. Applying a 1D DCT to each row of the matrix.
      2. Applying a 1D DCT to each column of the intermediate result.
  • How It Works:

    1. Copies the input matrix into new transformed array.
    2. Transforms each row using the transform(double[] vector) method.
    3. Extracts each column, transforms it, and writes back the result into the matrix.

Cache Package:

AdaptiveCache Class:

The AdaptiveCache class provides an adaptive memory-based caching solution using the Caffeine caching library. It dynamically adjusts memory usage based on the available JVM heap memory, ensuring efficient memory management and performance for image caching.


Constants:


private static final double MAXIMUM_MEMORY_PERCENTAGE = 0.6

  • Limits the cache size to 60% of the JVM heap memory.

Constructor


private AdaptiveCache(long maximumWeight)

  • Parameters:
    • maximumWeight - the maximum memory the cache can use.

Methods:


static AdaptiveCache getInstance()

  • Returns:

    • The singleton AdaptiveCache instance.
  • Purpose:

    • Provides access to the singleton instance of the cache.

BufferedImage get(@NotNull File key) throws IOException

  • Parameters:

    • key - A File object representing the image file.
  • Returns:

    • The BufferedImage loaded from cache or disk.
  • Throws:

    • IOException - If the image cannot be loaded.
  • Purpose:

    • Retrieves an image from the cache. If the image is not cached, it loads it from disk and stores it in the cache.

void monitor(long period)

  • Parameters:

    • period - Interval (in seconds) between cache logs.
  • Purpose:

    • Starts a periodic task that logs cache statistics at regular intervals.

private int getImageWeight(File key, @NotNull BufferedImage value)

  • Parameters:

    • key - The image file.
    • value - The BufferedImage whose weight is calculated.
  • Returns:

    • The memory weight of the image in bytes.
  • Purpose:

    • Computes the memory weight of an image in bytes.
    • Assumes each pixel is represented by 4 bytes (RGBA).
  • Formula:

return value.getWidth() * value.getHeight() * 4;

private BufferedImage loadImage(@NotNull File key)

  • Parameters:

    • key - The image file.
  • Returns:

    • The loaded BufferedImage.
  • Throws:

    • UncheckedIOException - If the file cannot be read or the format is unsupported.
  • Purpose:

    • Loads an image from disk using ImageIO.

private static long getMaximumWeight()

  • Returns:

    • Maximum cache weight (in bytes).
  • Purpose:

    • Calculates the maximum cache size based on the available JVM memory (60% of the heap).

Key Design Notes:


  1. Adaptive Memory Management:
  • The maximumWeight is dynamically calculated based on JVM heap size.
  1. Thread-Safe:
  • The cache itself is thread-safe as Caffeine provides synchronized operations internally.
  • The monitor uses AtomicBoolean to ensure it starts only once.
  1. Error Handling:
  • Uses UncheckedIOException to propagate IOException from ImageIO read operations.
  • Logs detailed error information using SLF4J.

Grouping Package:

Grouper Interface:

The Grouper interface represents a functional interface designed to group files into subsets that share a common characteristic, such as having identical checksum or other similarity criteria. It abstracts the process of grouping files for organizational or comparison purposes.


Methods:


Set<Set<File>> divide(Collection<File> col) throws IOException

  • Parameters:

    • col - A Collection of File objects to be divided into subsets. Typically these files are of the same type (e.g., images).
  • Returns:

    • A set of subsets of files, where each subset (Set<File>) contains files that share a common characteristic.
  • Throws:

    • IOException - If an I/O error occurs while reading or processing the files.
  • Purpose: Divides a collection of files into subsets based on a defined distinction or grouping criterion. Each subset contains files that share a common property, such as identical content, checksum, or other user-defined similarities.


Key Design Notes:


  • Generality: The divide methods is intentionally kept generic to accommodate any type of grouping logic.
  • Performance Consideration: Implementations of divide should be optimized for performance, especially when processing large file collections.
  • Immutability of Results: Returning a Set<Set<File>> ensures no duplicate groups exists, and each subset of files can be easily iterated over.

Usage Example:


public class FileSizeGrouper implements Grouper {

    @Override
    public Set<Set<File>> divide(Collection<File> col) throws IOException {
        // Group files by their size using a map
        Map<Long, Set<File>> sizeGroups = new HashMap<>();

        for (File file : col) {
            if (file.isFile()) {
                long size = Files.size(file.toPath());
                sizeGroups.computeIfAbsent(size, k -> new HashSet<>()).add(file);
            }
        }

        return new HashSet<>(sizeGroups.values());
    }
}


public class Main {
    public static void main(String[] args) throws IOException {
        List<File> files = List.of(
                new File("image1.jpg"),
                new File("image2.jpg"),
                new File("duplicate_image1.jpg")
        );

        Grouper grouper = new FileSizeGrouper();
        Set<Set<File>> groupedFiles = grouper.divide(files);

        groupedFiles.forEach(group -> {
            System.out.println("Group:");
            group.forEach(file -> System.out.println(" - " + file.getName()));
        });
    }
}

CRC32Grouper Class:

The CRC32Grouper class implements the Grouper interface to group files based on their CRC32 checksum. Files that share the same checksum are assumed to be identical and grouped together. This approach is useful for detecting duplicate files in collection.


Methods:


Set<Set<File>> divide(Collection<File> col)

  • Parameters:

    • col - A Collection of File objects to group based on their checksum.
  • Returns:

    • A set of subsets of files, where each subset (Set<File) contains files that share the same checksum.
  • Purpose:

    • Divides a collection of files into subsets based on their CRC32 checksum values. Files with the same checksum are grouped together.

private long calculateChecksum(File f) throws IOException

  • Parameters:

    • f - The File for which the checksum is to be calculated.
  • Returns:

    • long - The CRC32 checksum value of the file.
  • Throws:

    • IOException - If an I/O error occurs while reading the file.
  • Purpose: Calculates the CRC32 checksum for a given file. This method reads the file in chunks to optimize memory usage and applies the CRC32 algorithm to generate the checksum.


Key Design Notes:


  1. Parallel Stream Processing:
  • Files are processed in parallel for performance. This ensures that large file collections are grouped efficiently.
  1. Grouping Logic:
  • Uses a Map to associate checksum values with sets of files.
  • Files with identical checksum are collected into the same group.
  1. Filter Unique Groups:
  • Only file groups with more than one file are retained as potential duplicates.

Usage Example:


import pl.magzik.grouping.CRC32Grouper;

import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.Set;

public class Main {
    public static void main(String[] args) {
        List<File> files = List.of(
                new File("file1.txt"),
                new File("file2.txt"),
                new File("duplicate_file1.txt")
        );

        Grouper grouper = new CRC32Grouper();


        Set<Set<File>> groupedFiles = grouper.divide(files);

        groupedFiles.forEach(group -> {
            System.out.println("Group of duplicate files:");
            group.forEach(file -> System.out.println(" - " + file.getName()));
        });
    }
}

IO Package:

FileOperation Interface:

The FileOperation interface defines a standardized contract for performing file management operations such as:

  1. Loading files.
  2. Moving files to a specified directory.
  3. Deleting files.

It supports operations on both collections of files and individual arrays of files. Default methods ensure flexibility by delegating array-based operations to their corresponding collection-based methods.


Methods:


List<File> load(Collection<File> files) throws IOException

  • Parameters:

    • files - A Collection of File objects to be loaded.
  • Returns:

    • `List - A list containing the loaded files.
  • Throws:

    • IOException - If an I/O error occurs while loading the files.
  • Purpose: Loads the provided collection of files. The operation may involve verifying file existence, reading metadata, or other preparatory operations.


default List<File> load(File... files) throws IOException

  • Parameters:

    • files - An array of files to be loaded.
  • Returns:

    • List<File> - A list containing the loaded files.
  • Throws:

  • Purpose: Loads the provided array of files. Delegates to the collection-based load(Collection<File>) method.


void move(File destination, Collection<File> files) throws IOException

  • Parameters:

    • destination - The target directory for the moved files.
    • files - The Collection of File objects to be moved.
  • Throws:

    • IOException - If an I/O error occurs while moving the files.
  • Purpose: Moves the provided collection of files to a specified destination directory.


default void move(File destination, File... files) throws IOException

  • Parameters:

    • destination - The target directory for the moved files.
    • files - An array of files to be moved.
  • Throws:

    • IOException - If an I/O error occurs while moving the files.
  • Purpose: Moves the provided array of files to the specified destination directory. Delegates to the collection-based move(File, Collection<File>) method.


void delete(Collection<File> files) throws IOException

  • Parameters:

    • files - The Collection of File objects to be deleted.
  • Throws:

    • IOException - If an I/O error occurs while deleting the files.
  • Purpose: Deletes the provided collection of files.


default void delete(File... files) throws IOException

  • Parameters:

    • files - An array of files to be deleted.
  • Throws:

    • IOException - If an I/O error occurs while deleting the files.
  • Purpose: Deletes the provided array of files. Delegates to the collection-based delete(Collection<File>) method.


Usage Example:


public class SimpleFileOperation implements FileOperation {

    @Override
    public List<File> load(Collection<File> files) throws IOException {
        for (File file : files) {
            if (!file.exists()) {
                throw new IOException("File not found: " + file.getName());
            }
        }
        return List.copyOf(files);
    }

    @Override
    public void move(File destination, Collection<File> files) throws IOException {
        if (!destination.isDirectory()) {
            throw new IOException("Destination must be a directory.");
        }

        for (File file : files) {
            File target = new File(destination, file.getName());
            if (!file.renameTo(target)) {
                throw new IOException("Failed to move file: " + file.getName());
            }
        }
    }

    @Override
    public void delete(Collection<File> files) throws IOException {
        for (File file : files) {
            if (!file.delete()) {
                throw new IOException("Failed to delete file: " + file.getName());
            }
        }
    }

    public static void main(String[] args) {
        FileOperation fileOperation = new SimpleFileOperation();

        try {
            File file1 = new File("file1.txt");
            File file2 = new File("file2.txt");
            File destination = new File("targetDirectory");

            fileOperation.load(file1, file2);
            fileOperation.move(destination, file1, file2);
            fileOperation.delete(file1, file2);

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

FileOperator Class:

The FileOperator class provides an efficient and asynchronous implementation of file operations such as loading, moving, and deleting files. It is designed for I/O-bound operations and utilizes virtual threads and CompletableFuture for parallelism and non-blocking processing.

The operations performed by FileOperator include:

  1. Pre-validation: Ensures that provided files exist and are accessible.
  2. Regular File Validation: Processes and validates individual files using a FilePredicate.
  3. Directory Validation: Recursively processes directories up to a specified depth, validating and collecting files.

It uses a configurable depth for directory traversal and an ExecutorService for asynchronous processing.


Constructors:


FileOperator(FilePredicate filePredicate, int depth)

  • Parameters:

    • filePredicate - The FilePredicate used to validate files.
    • depth - The directory traversal depth.
  • Purpose: Creates a FileOperator with the specified file predicate, directory traversal depth, and a default virtual thread executor.


FileOperator(FilePredicate filePredicate, int depth, ExecutorService executorService)

  • Parameters:

    • filePredicate - The FilePredicate used to validate files.
    • depth - The directory traversal depth.
    • executorService - The ExecutorService used in asynchronous operations.
  • Purpose: Allows injecting a custom ExecutorService for task execution.


Methods:


void setDepth(int depth)

  • Parameters:

    • depth - the depth to set.
  • Purpose: Sets the depth for directory traversal.


List<File> load(Collection<File> files) throws IOException

  • Parameters:

    • files - The Collection of File objects to be loaded.
  • Returns:

    • A List<File> of validated files.
  • Throws:

    • IOException - If pre-validation fails.
  • Purpose: Loads, validates, and processes the provided collection of files and directories.

    1. Pre-validation: Ensures files exist.
    2. Regular File Validation: Validates regular files concurrently.
    3. Directory Validation: Recursively processes directories to extract and validate files up to the specified depth.

private List<File> handleRegularFiles(Collection<File> files)

  • Parameters:

    • files - The Collection of File objects to be validated.
  • Returns:

    • A List<File> of validates files.
  • Purpose:

    • This method filters the provided collection of files to include only regular files (not directories), and processes each file asynchronously using the configured ExecutorService. Each file is validated using the FileValidator. If a file is valid, it is included in the result list; otherwise, it is ignored.

private List<File> handleDirectories(Collection<File> files)

  • Parameters:

    • files - The Collection of File objects to be validated.
  • Returns:

    • A List<File> of validates files.
  • Purpose:

    • This method filters the provided collection of files to include only directories, and processes each directory asynchronously using the configured ExecutorService. It recursively walks through each directory up to the specified depth, extracting all files and validating them.

void move(File destination, Collection<File> files) throws IOException

  • Parameters:

    • destination - Target directory where files will be moved.
    • files - The Collection of File objects to move.
  • Throws:

    • IOException - If a file move operation fails.
  • Purpose: Moves files to the specified destination directory.

    • Each file move operation is executed asynchronously using CompletableFuture.
    • If a file cannot be moved, it logs the error and throws IOException.

void delete(Collection<File> files) throws IOException

  • Parameters:

    • files - The Collection of File objects to delete.
  • Throws:

    • IOException - If a file deletion operation fails.
  • Purpose: Deletes the specified collection of files.

    • Each file delete operation is executed asynchronously using CompletableFuture.
    • If a file cannot be deleted, it logs the error and throws IOException.

Key Design Notes:


  1. Virtual Threads: Ensures scalability for I/O-bound operations.
  2. Error Handling: Errors are logged, and exceptions are rethrown as IOException for consistency.
  3. Custom Predicate: Validation logic can be customized using the FilePredicate.
  4. Depth Control: Directory recursion is limited by the specified depth.

Usage Example:


public class FileOperatorExample {
    public static void main(String[] args) {
        // Define a predicate for file validation
        FilePredicate filePredicate = file -> file.getName().endsWith(".txt");

        // Initialize FileOperator with depth of 2
        FileOperator fileOperator = new FileOperator(filePredicate, 2);

        // Files and directories to process
        List<File> files = Arrays.asList(new File("file1.txt"), new File("directory1"));

        try {
            // Load files
            List<File> validatedFiles = fileOperator.load(files);
            System.out.println("Validated Files: " + validatedFiles);

            // Move files
            File destination = new File("destination");
            fileOperator.move(destination, validatedFiles);

            // Delete files
            fileOperator.delete(validatedFiles);
            System.out.println("Files successfully deleted.");
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

FileValidator Class:

The FileValidator class serves as a utility for validating files based on a user-defined FilePredicate. It provides the following functionality:

  1. Pre-validation: Ensures all files in a given collection exist; otherwise, it throws an IOException.
  2. Validation: Checks individual files or paths against a provided FilePredicate.
  3. Error Handling: Wraps and rethrows file-related exceptions in a clean and consistent manner.

Constructor:


FileValidator(FilePredicate predicate)

  • Parameters:
    • predicate - The condition to validate files against.

Methods:


void preValidate(Collection<File> files) throws IOException

  • Parameters:

    • files - The Collection of File objects to validate.
  • Throws:

    • IOException - If any file in the collection doesn't exist.
  • Purpose: Ensures all files in the given collection exist.

    • Filters that files that do not exist using File::exists.
    • Throws an IOException if any file is missing.
    • The exception includes the absolute path of the first missing file.

boolean validate(File file) throws IOException

  • Parameters:

    • file - The File to validate.
  • Returns:

    • true if the files is valid according to the predicate, false otherwise.
  • Throws:

    • IOException - If an error occurs while accessing the file.
  • Purpose: Validates a single File object.

    • Checks if the file is a regular file (file.isFile()).
    • Tests the file against the provided FilePredicate.
    • Returns true if the file passes validation, false otherwise.

boolean validate(Path path) throws IOException

  • Parameters:

    • path - The Path of the file to validate.
  • Returns:

    • true if the file is valid according to the predicate, false otherwise.
  • Throws:

    • IOException - If an error occurs while accessing the file.
  • Purpose: Validates a file represented as a Path object.

    • Converts the Path to File using path.toFile().
    • Delegates to the validate(File) method.

Usage Example:


Path filePath = Path.of("example.txt");
FilePredicate predicate = f -> f.length() > 0; // Files with content
FileValidator validator = new FileValidator(predicate);

try {
    boolean isValid = validator.validate(filePath);
    System.out.println("File is valid: " + isValid);
} catch (IOException e) {
    System.err.println("Error: " + e.getMessage());
}

FileVisitor Class:

The FileVisitor class is an implementation of SimpleFileVisitor designed to perform asynchronous file processing during a file tree traversal operation. It works with a FileValidator to validate files and collects valid files in a thread-safe Set<Path>.

The processing is carried out asynchronously using an ExecutorService allowing efficient parallel execution.


Constructor:


FileVisitor(ExecutorService executorService, FileValidator fileValidator)

  • Parameters:
    • executorService - The ExecutorService for asynchronous execution.
    • fileValidator - The FileValidator used to check files against conditions.

Methods:


FileVisitResult visitFile(Path file, BasicFileAttributes attrs)

  • Parameters:

    • file - The Path of the file being visited.
    • attrs - File attributes for the visited file.
  • Returns:

    • FileVisitResult.CONTINUE - Indicates that the traversal should continue.
  • Purpose: Processes each file encountered during a file tree walk.

    • Validates the file asynchronously using CompletableFuture.
    • If the file is valid (regular file and satisfies the FileValidator), it is added to the files set.

FileVisitResult visitFileFailed(Path file, IOException exc)

  • Parameters:

    • file - The Path of file that could not be visited.
    • exc - The exception thrown during visitation.
  • Returns:

    • FileVisitResult.CONTINUE - Indicates that the traversal should continue even after a failure.
  • Purpose: Handles scenarios where a file cannot be visited due to an error.

    • Logs a warning message indicating the file path and cause of failure.

Set<Path> getFiles()

  • Returns:

    • A Set<Path> containing the paths of valid files.
  • Purpose: Returns the set of valid files collected during the file tree walk.

    • Waits for all asynchronous tasks (CompletableFuture) to complete using CompletableFuture::join.
    • Ensures that the method blocks until all file validation is completed.

Key Design Notes:


  1. Initialization:
  • The class is initialized with an ExecutorService and a FileValidator.
  1. Traversal:
  • visitFile(...) processes each file and validates it asynchronously.
  • visitFileFailed(...) logs errors for files that cannot be accessed.
  1. Asynchronous Validation:
  • File are validated in parallel using CompletableFuture.
  • Valid files are added to the concurrent set.
  1. Result Retrieval:
  • getFiles() blocks until all validation tasks are complete and returns the collected files.

Usage Example:


ExecutorService executorService = Executors.newFixedThreadPool(4);
FileValidator validator = new FileValidator(f -> f.getName().endsWith(".txt"));

FileVisitor visitor = new FileVisitor(executorService, validator);

try {
    Files.walkFileTree(Path.of("/my-directory"), visitor);
    Set<Path> validFiles = visitor.getFiles();
    validFiles.forEach(System.out::println);
} catch (IOException e) {
    System.err.println("Error traversing files: " + e.getMessage());
} finally {
    executorService.shutdown();
}

Predicates Package:

FilePredicate Interface:

The FilePredicate interface represents a functional interface for performing file-based validation or checks. It is enhanced version of java.util.function.Predicate, designed specifically for File objects and capable of throwing IOException.


Methods:


boolean test(File file) throws IOException

  • Parameters:

    • file - The File to evaluate or validate.
  • Returns:

    • true if the file matches the condition (predicate), false otherwise.
  • Throws:

    • IOException - If an I/O error occurs while evaluating the file.
  • Purpose: Evaluates the given File object against a specific condition.


Key Design Notes


  1. IOException Handling:
  • Unlike java.util.function.Predicate, this interface supports methods that may throw an IOException.
  • This is essential for I/O-based operations, such as checking file content or accessing metadata.
  1. Lambda Support:
  • The single abstract method makes the interface suitable for lambda expressions and concise code.
  1. Flexibility:
  • Allows various implementations, from basic checks (file size, existence) to complex content validation.

Usage Examples:


1. Checking if a File is readable

FilePredicate readablePredicate = file -> file.canRead();

File file = new File("example.txt");
try {
    if (readablePredicate.test(file)) {
        System.out.println(file.getName() + " is readable.");
    } else {
        System.out.println(file.getName() + " is not readable.");
    }
} catch (IOException e) {
    System.err.println("Error checking file: " + e.getMessage());
}

2. Checking File size (greater than 1 MB)

FilePredicate largeFilePredicate = file -> file.length() > 1024 * 1024;

File file = new File("largeFile.dat");
try {
    if (largeFilePredicate.test(file)) {
        System.out.println(file.getName() + " is larger than 1MB.");
    } else {
        System.out.println(file.getName() + " is smaller than 1MB.");
    }
} catch (IOException e) {
    System.err.println("Error checking file size: " + e.getMessage());
}

3. Checking if File contains specific content

FilePredicate contentCheckPredicate = file -> {
    return Files.lines(file.toPath()).anyMatch(line -> line.contains("TODO"));
};

File file = new File("sourceCode.java");
try {
    if (contentCheckPredicate.test(file)) {
        System.out.println(file.getName() + " contains 'TODO'.");
    } else {
        System.out.println(file.getName() + " does not contain 'TODO'.");
    }
} catch (IOException e) {
    System.err.println("Error reading file: " + e.getMessage());
}

4. Combining Predicates

While FilePredicate doesn't provide default methods like and() or or(), you can combine predicates manually:

FilePredicate readableAndLarge = file -> file.canRead() && file.length() > 1024 * 1024;

File file = new File("data.txt");
try {
    if (readableAndLarge.test(file)) {
        System.out.println("The file is readable and large.");
    } else {
        System.out.println("The file does not satisfy the condition.");
    }
} catch (IOException e) {
    System.err.println("Error: " + e.getMessage());
}

ImageFilePredicate Class:

The ImageFilePredicate class is an implementation of the FilePredicate interface that validates image files based on their magic numbers. Magic numbers are binary signatures at the start of a file that uniquely identify its format. The predicate supports validation for common image formats like JPG, PNG, GIF, BMP, and more.


Supported Image Formats:


The class comes with default magic numbers for the following formats:

  • JPG, JPEG: FFD8FF
  • PNG: 89504E470D0A1A0A
  • GIF: 474946383761, 474946383961
  • BMP: 424D
  • TIFF: 49492A00, 4D4D002A
  • ICO: 00000100
  • JP2, J2K, JPC: 0000000C6A5020200D0A870A, FF4FFF51

Constructors:


ImageFilePredicate(Map<String, Set<String>> magicNumbers)

  • Parameters:

    • magicNumbers - A Map of file extenstions to magic numbers.
  • Purpose: Allows custom initialization of file extensions and corresponding magic numbers.


ImageFilePredicate()

  • Purpose: Initializes the class with default magic numbers for common image formats.

Methods:


boolean test(File file) throws IOException

  • Parameters:

    • file - The File to be validated.
  • Returns:

    • true if the file matches one of the magic numbers for its extension, false otherwise.
  • Throws:

    • IOException - If the file cannot be read or is corrupted.
  • Purpose: Tests if a file matches a known magic number based on its extension.


private String getExtension(File file)

  • Parameters:

    • file - The File from which the extension is to be extracted.
  • Returns:

    • The file extension (e.g., PNG, JPG), or an empty string if no extension exists.
  • Purpose: Extracts the file extension in uppercase from a file name.


private String getFileMagicNumber(File file, int bytesToRead) throws IOException

  • Parameters:

    • file - The File to read.
    • bytesToRead - The number of bytes to read from the file.
  • Returns:

    • The magic number as a hexadecimal string.
  • Throws:

    • IOException - If an error occurs while reading the file or if the file is too short.
  • Purpose: Reads the first bytesToRead bytes from a file and converts them into a hexadecimal string to represent the file's magic number.


private String bytesToHex(byte[] bytes)

  • Parameters:

    • bytes - The array of bytes to be converted.
  • Returns:

    • The hexadecimal string representation of the bytes.
  • Purpose: Converts a byte array to a hexadecimal string.


Key Design Notes:


  1. File Extension Extraction:
  • The getExtension(...) methods extracts the file extension (e.g., PNG, JPG) to determine the expected magic numbers.
  1. Reading Magic Numbers:
  • The getFileMagicNumber(...) method reads the first few bytes of the file and converts them to a hexadecimal representation.
  1. Validation:
  • The extracted magic number is compared against the known magic numbers for that extension.
  • If a match is found, the file is considered valid.

Usage Example:


Validating an Image File

FilePredicate imagePredicate = new ImageFilePredicate();

File jpgFile = new File("photo.jpg");
try {
    boolean isValid = imagePredicate.test(jpgFile);
    System.out.println(jpgFile.getName() + " is valid: " + isValid);
} catch (IOException e) {
    System.err.println("Failed to validate file: " + e.getMessage());
}

Adding Custom Magic Numbers

Map<String, Set<String>> customMagicNumbers = Map.of(
    "WEBP", Set.of("52494646") // Magic number for WEBP images
);

FilePredicate customPredicate = new ImageFilePredicate(customMagicNumbers);

File webpFile = new File("image.webp");
try {
    if (customPredicate.test(webpFile)) {
        System.out.println("This is a valid WEBP image.");
    } else {
        System.out.println("Invalid WEBP image.");
    }
} catch (IOException e) {
    System.err.println("Validation failed: " + e.getMessage());
}

Clone this wiki locally