Skip to content

8. Class Reference

Maksymilian edited this page Dec 8, 2024 · 22 revisions

Work in progress

Processor Class:

The Processor class is the core component responsible for identifying and grouping duplicate files based on the multi step processing workflow. It utilizes various algorithms and grouping strategies to efficiently process and classify files into sets of similar files.


Constructors:


Processor(Grouper grouper, Collection<Algorithm<?>> algorithms)

  • Parameters:
    • grouper - A Grouper instance to perform the initial division of files based on a distinction predicate (e.g., CRC32 checksum).
    • algorithms - A collection of Algorithm objects applied to the files during the "Algorithm Application" step. The order of the algorithms matter for the processing.
  • Throws:
    • NullPointerException - If either grouper, or algorithms are null, or the algorithm collection is empty or contains null elements.
  • Purpose: Initializes the Processor with the provided grouping strategy and set of algorithms for processing the files.

Methods:


Map<File, Set<File>> process(@NotNull Collection<@NotNull File> files) throws IOException

  • Parameters:

    • files - A collection of File objects to be processed. Typically, these files are of the same type (e.g., images) and are grouped based on similarity.
  • Returns:

    • A Map where the key is a file considered the "original" in a group of similar files, and the value is a set of files considered duplicates or similar files.
  • Throws:

    • NullPointerException - If the input collection contains null or is null.
    • IOException - If any I/O error occurs during processing.
  • Purpose: This method processes the input collection of files through the following steps:

    1. Initial Division: Files are divided into subsets based on a distinction predicate.
    2. Algorithm Application: A series of algorithms is applied to the subsets to refine the grouping further.
    3. Original File Identification: The first file in each group is identified as the "original", and the groups are reorganized accordingly.

private Set<Set<File>> algorithmsApplication(@NotNull Set<Set<File>> groupedFiles) throws IOException

  • Parameters:

    • groupedFiles - A set of sets of files, where each set represents a group of similar files.
  • Returns:

    • A new set of sets of files after applying all algorithms and consolidating the groups.
  • Throws:

    • IOException - If any error occurs during the algorithm application.
  • Purpose: This method applies each algorithm in the algorithms collection to the grouped files and consolidates the results by merging groups with identical keys and removing groups with only one file.


private <T> Map<T, Set<File>> applyAlgorithm(@NotNull Algorithm<T> algorithm, @NotNull Set<Set<File>> groupedFiles)

  • Parameters:

    • algorithm - The Algorithm to apply the grouped files.
    • groupedFiles - A set of sets of files to process with the algorithm.
  • Returns:

    • A Map where the key is the characteristic (e.g., perceptual hash or CRC32 checksum) and the value is a set of files sharing that characteristic.
  • Purpose: This method applies a single algorithm to the grouped files and returns a map of results.


private Set<Set<File>> postAlgorithmConsolidation(@NotNull Map<?, Set<File>> algorithmOutput)

  • Parameters:

    • algorithmOutput - A map containing the results of the algorithm application, where the key is a shared characteristic and the value is a set of files that share that characteristic.
  • Returns:

    • A set of sets of files after consolidating the results by removing groups with only one file and merging groups with identical keys.
  • Purpose: This method consolidates the results of an algorithm by eliminating groups that contain only one file and merging groups with identical keys.


private Map<File, Set<File>> originalDistinction(@NotNull Set<Set<File>> groupedFiles)

  • Parameters:

    • groupedFiles - A set of sets of files representing groups of similar files.
  • Returns:

    • A new Map where:
      • The key is the "original" file (the first file in each group).
      • The value is a Set of files considered duplicates or similar files.
  • Throws:

    • NullPointerException - If groupedFiles contains null.
  • Purpose: This method identifies the "original" file in each group and reorganizes the groups into a map, where each key is the original file and each value is a set of similar files (including the original file itself).


private Set<File> consolidate(@NotNull Set<File> s1, @NotNull Set<File> s2)

  • Parameters:

    • s1 - The first set to merge.
    • s2 - The second set to merge.
  • Returns:

    • A new set containing all elements from both s1 and s2.
  • Purpose: This method merges two sets into one, ensuring that all elements from both sets are included.


Logger:


  • The Processor class uses a Logger instance (logger) from the SLF4J API to log messages during the various stages of file processing. For example, it logs the start of processing, division of files, application of algorithms, and the identification of original files.

Usage Example:


Grouper grouper = new Crc32Grouper();
List<Algorithm<?>> algorithms = List.of(new PerceptualHash(), new PixelByPixel());
Processor processor = new Processor(grouper, algorithms);

Collection<File> files = List.of(new File("image1.jpg"), new File("image2.jpg"));
Map<File, Set<File>> result = processor.process(files);
result.forEach((original, duplicates) -> {
    System.out.println("Original: " + original);
    duplicates.forEach(duplicate -> System.out.println("  Duplicate: " + duplicate));
});

Algorithms Package:

Cache Package:

Grouping Package:

Io Package:

Predicates Package:

Clone this wiki locally