-
Notifications
You must be signed in to change notification settings - Fork 0
8. Class Reference
The Processor
class is the core component responsible for identifying and grouping duplicate files
based on the multi step processing workflow. It utilizes various algorithms and grouping strategies
to efficiently process and classify files into sets of similar files.
Processor(Grouper grouper, Collection<Algorithm<?>> algorithms)
-
Parameters:
-
grouper
- AGrouper
instance to perform the initial division of files based on a distinction predicate (e.g., CRC32 checksum). -
algorithms
- A collection ofAlgorithm
objects applied to the files during the "Algorithm Application" step. The order of the algorithms matter for the processing.
-
-
Throws:
-
NullPointerException
- If eithergrouper
, oralgorithms
are null, or the algorithm collection is empty or containsnull
elements.
-
-
Purpose: Initializes the
Processor
with the provided grouping strategy and set of algorithms for processing the files.
Map<File, Set<File>> process(@NotNull Collection<@NotNull File> files) throws IOException
-
Parameters:
-
files
- A collection ofFile
objects to be processed. Typically, these files are of the same type (e.g., images) and are grouped based on similarity.
-
-
Returns:
- A
Map
where the key is a file considered the "original" in a group of similar files, and the value is a set of files considered duplicates or similar files.
- A
-
Throws:
-
NullPointerException
- If the input collection containsnull
or isnull
. -
IOException
- If any I/O error occurs during processing.
-
-
Purpose: This method processes the input collection of files through the following steps:
- Initial Division: Files are divided into subsets based on a distinction predicate.
- Algorithm Application: A series of algorithms is applied to the subsets to refine the grouping further.
- Original File Identification: The first file in each group is identified as the "original", and the groups are reorganized accordingly.
private Set<Set<File>> algorithmsApplication(@NotNull Set<Set<File>> groupedFiles) throws IOException
-
Parameters:
-
groupedFiles
- A set of sets of files, where each set represents a group of similar files.
-
-
Returns:
- A new set of sets of files after applying all algorithms and consolidating the groups.
-
Throws:
-
IOException
- If any error occurs during the algorithm application.
-
-
Purpose: This method applies each algorithm in the
algorithms
collection to the grouped files and consolidates the results by merging groups with identical keys and removing groups with only one file.
private <T> Map<T, Set<File>> applyAlgorithm(@NotNull Algorithm<T> algorithm, @NotNull Set<Set<File>> groupedFiles)
-
Parameters:
-
algorithm
- TheAlgorithm
to apply the grouped files. -
groupedFiles
- A set of sets of files to process with the algorithm.
-
-
Returns:
- A
Map
where the key is the characteristic (e.g., perceptual hash or CRC32 checksum) and the value is a set of files sharing that characteristic.
- A
-
Purpose: This method applies a single algorithm to the grouped files and returns a map of results.
private Set<Set<File>> postAlgorithmConsolidation(@NotNull Map<?, Set<File>> algorithmOutput)
-
Parameters:
-
algorithmOutput
- A map containing the results of the algorithm application, where the key is a shared characteristic and the value is a set of files that share that characteristic.
-
-
Returns:
- A set of sets of files after consolidating the results by removing groups with only one file and merging groups with identical keys.
-
Purpose: This method consolidates the results of an algorithm by eliminating groups that contain only one file and merging groups with identical keys.
private Map<File, Set<File>> originalDistinction(@NotNull Set<Set<File>> groupedFiles)
-
Parameters:
-
groupedFiles
- A set of sets of files representing groups of similar files.
-
-
Returns:
- A new
Map
where:- The key is the "original" file (the first file in each group).
- The value is a
Set
of files considered duplicates or similar files.
- A new
-
Throws:
-
NullPointerException
- IfgroupedFiles
containsnull
.
-
-
Purpose: This method identifies the "original" file in each group and reorganizes the groups into a map, where each key is the original file and each value is a set of similar files (including the original file itself).
private Set<File> consolidate(@NotNull Set<File> s1, @NotNull Set<File> s2)
-
Parameters:
-
s1
- The first set to merge. -
s2
- The second set to merge.
-
-
Returns:
- A new set containing all elements from both
s1
ands2
.
- A new set containing all elements from both
-
Purpose: This method merges two sets into one, ensuring that all elements from both sets are included.
- The
Processor
class uses aLogger
instance (logger
) from the SLF4J API to log messages during the various stages of file processing. For example, it logs the start of processing, division of files, application of algorithms, and the identification of original files.
Grouper grouper = new Crc32Grouper();
List<Algorithm<?>> algorithms = List.of(new PerceptualHash(), new PixelByPixel());
Processor processor = new Processor(grouper, algorithms);
Collection<File> files = List.of(new File("image1.jpg"), new File("image2.jpg"));
Map<File, Set<File>> result = processor.process(files);
result.forEach((original, duplicates) -> {
System.out.println("Original: " + original);
duplicates.forEach(duplicate -> System.out.println(" Duplicate: " + duplicate));
});
The Algorithm
interface represents a functional abstraction for an algorithm that operates on a set of files, dividing them into smaller subsets based on some shared characteristic, resulting in a map where each key corresponds to a group of files that share that characteristic.
Map<K, Set<File>> apply(Set<File> group)
-
Parameters:
-
group
- ASet
ofFile
objects to be processed by the algorithm. These files are typically of the same type (e.g., images), and the goal is to group them based on some shared characteristic.
-
-
Returns:
- A
Map
where each key (K
) corresponds to a set of files that share the same characteristic (e.g., checksum, hash, metadata). The key is computed from the shared property of the files.
- A
-
Purpose:
- This method applies the algorithm to the given set of files, partitioning them into smaller groups. Each group corresponds to a characteristic shared by all the files in the group. For example, the characteristic could be a checksum, perceptual hash, or file size.
- The key used to map each set of files should be deterministic. This means that the same group of files will always produce the same output map when the algorithm is applied, ensuring consistency in grouping.
- The
Algorithm
interface is marked with the@FunctionalInterface
annotation, indicating that it is designed to be used with lambda expressions or method references. -
Purpose of Functional Interface:
- It can be easily implemented using a lambda expression or a method reference, allowing flexibility in defining various algorithms for grouping files. This allows for the easy application of different strategies, such as comparing files based on checksum, perceptual hash, or other metrics.
Algorithm<String> checksumAlgorithm = (group) -> {
// Example algorithm logic to group files by checksum (dummy implementation)
Map<String, Set<File>> result = new HashMap<>();
for (File file : group) {
String checksum = getChecksum(file); // Example method to calculate checksum
result.computeIfAbsent(checksum, k -> new HashSet<>()).add(file);
}
return result;
};
Set<File> files = new HashSet<>(List.of(new File("file1.txt"), new File("file2.txt")));
Map<String, Set<File>> groupedFiles = checksumAlgorithm.apply(files);
The PerceptualHash
class implements the Algorithm
interface and is used to compute perceptual hashes for images. This class groups similar images by generating and comparing perceptual hashes, which are unique identifiers derived from an image's content. The process of generating these hashes allows for comparing images based on their visual similarities rather than their exact content, making it useful for image de-duplication or similarity detection.
Map<String, Set<File>> apply(@NotNull Set<File> group)
-
Parameters:
-
group
- ASet
ofFile
objects representing the images to be processed. Each image will be grouped based on its perceptual hash.
-
-
Returns:
- A
Map
where each key is a perceptual hash (aString
), and each value is aSet
of files that share the same hash, representing images that are visually similar.
- A
-
Purpose:
- The method processes each image in the input set by resizing it, extracting its pixel values, applying the Discrete Cosine Transform (DCT), and generating a perceptual hash. The images are then grouped based on these hashes, and the result is returned as a map. Images with the same perceptual hash are considered similar and grouped together.
@NotNull private BufferedImage resize(@NotNull File file)
-
Parameters:
-
file
- The imageFile
to be resized.
-
-
Returns:
- A
BufferedImage
that is resized to 8x8 pixels and converted to grayscale.
- A
-
Purpose:
- Resizes the image to a fixed 8x8 pixel size to standardize it for hash generation. The image is also converted to grayscale to simplify the process and reduce the detail that could interfere with the hash calculation.
private double[][] extractSample(BufferedImage image)
-
Parameters:
-
image
- ABufferedImage
that has already been resized.
-
-
Returns:
- A 2D
double
array representing the pixel values of the image.
- A 2D
-
Purpose:
- Extracts the pixel values from the resized image and stores them in a matrix (2D array), which will be used in further steps for hash generation.
private String buildHash(double[][] matrix)
-
Parameters:
-
matrix
- A 2Ddouble
array representing the pixel values of the image.
-
-
Returns:
- A
String
representing the perceptual hash of the image, generated by comparing each pixel with the average value of the matrix.
- A
-
Purpose:
- Constructs a binary string (the perceptual hash) by comparing each pixel's value with the average value of all pixels in the matrix. If a pixel's value is greater than the average, it is marked as '1'; otherwise, it is marked as '0'.
private double getAvg(double[][] matrix)
-
Parameters:
-
matrix
- A 2Ddouble
array representing the pixel values of the image.
-
-
Returns:
- The average pixel value of the matrix, computed across all pixels.
-
Purpose:
- Calculates the average value of the pixel values in the matrix, which is used in the
buildHash
method to compare each pixel against the average for hash generation.
- Calculates the average value of the pixel values in the matrix, which is used in the
The PerceptualHash
class generates perceptual hashes for images, allowing for the grouping of similar images. The algorithm works by performing several steps:
- Resize: Each image is resized to 8x8 pixels.
- Extract Sample: The pixel values of the resized image are extracted into a matrix.
-
Discrete Cosine Transform (DCT): The DCT is applied (via the
DCT::apply
method) to reduce high-frequency components and focus on the low-frequency ones. -
Generate Hash: A hash is created by comparing the pixel values with the average value of the matrix, where pixels greater than the average are marked as
1
, and those below the average are marked as0
. - Group by Hash: Images are then grouped by their perceptual hashes. Images with the same hash are considered visually similar.
The final result is a map of perceptual hashes, where the key is the hash and the value is a set of images that share that hash.
Set<File> images = new HashSet<>(List.of(new File("image1.jpg"), new File("image2.jpg")));
PerceptualHash perceptualHashAlgorithm = new PerceptualHash();
Map<String, Set<File>> groupedImages = perceptualHashAlgorithm.apply(images);
groupedImages.forEach((hash, files) -> {
System.out.println("Hash: " + hash);
files.forEach(file -> System.out.println(" " + file.getName()));
});
This example shows how to apply the PerceptualHash
algorithm to a set of image files. The result is a map where images with the same perceptual hash are grouped together, indicating that they are visually similar.
The PixelByPixel
class implements the Algorithm
interface and is used for image matching based on pixel-by-pixel comparison. The goal of this algorithm is to group identical images from a set by comparing them at the pixel level. It efficiently handles large datasets using parallel processing and caching to optimize performance.
Map<File, Set<File>> apply(Set<File> group)
-
Parameters:
-
group
- ASet
ofFile
objects representing the images to be processed.
-
-
Returns:
- A
Map
where each key is a file, and the corresponding value is a set of files that are identical to the key file. Images that are pixel-identical are grouped together. And value set contain the key value.
- A
-
Purpose:
- This method processes a group of image files, comparing them pixel-by-pixel to identify identical images. The images are grouped by their pixel-level equivalence and stored in a map for the result. It utilizes a queue to manage image files and processes them in parallel.
private void process(@NotNull Map<File, Set<File>> result, @NotNull Queue<File> groupQueue)
-
Parameters:
-
result
- A mutableMap
that stores the groups of identical images. -
groupQueue
- A mutableQueue
containing the files to be processed.
-
-
Purpose:
- This method iterates through the queue, selecting a "key" image and comparing it to the other images in the queue. Identical images are removed from the queue and grouped together. The grouping is done in parallel to speed up processing.
private BufferedImage getCachedImage(@NotNull File file)
-
Parameters:
-
file
- The imageFile
to retrieve from the cache.
-
-
Returns:
- A
BufferedImage
corresponding to the given file.
- A
-
Purpose:
- This method retrieves an image from the
AdaptiveCache
. If the image is not found in the cache, it will be loaded from the disk and added to the cache for future use. This avoids repeatedly reading the same image from disk, optimizing performance.
- This method retrieves an image from the
private boolean compareImages(@NotNull BufferedImage img1, @NotNull BufferedImage img2)
-
Parameters:
-
img1
- The first image to compare. -
img2
- The second image to compare.
-
-
Returns:
-
true
if the images are identical pixel-by-pixel, otherwisefalse
.
-
-
Purpose:
- This method compares two images by first checking if their dimensions match. If the dimensions are the same, it then compares the raw pixel data by examining the byte data of the image's raster. If the byte data matches, the images are considered identical.
The PixelByPixel class is an image comparison algorithm that uses an exact, pixel-by-pixel method to identify identical images. It operates as follows:
-
Load Images: It loads images from disk using a cache (via the
AdaptiveCache
class). If an image is not found in the cache, it is loaded from the disk and added to the cache. - Pixel Comparison: The images are compared pixel by pixel, checking if they are exactly identical.
-
Group Identical Images: Images that are identical (based on pixel comparison) are grouped together in a
Map
. The key of the map is the original image, and the value is a set of identical images. - Parallel Processing: The comparison is parallelized to speed up processing, especially when handling large datasets of images.
-
Cache Usage: The
AdaptiveCache
is used to optimize the image loading process, reducing the need to reload images repeatedly.
The algorithm assumes that all images in the group have the same resolution and format.
Set<File> imageFiles = new HashSet<>(Arrays.asList(file1, file2, file3));
PixelByPixel algorithm = new PixelByPixel();
Map<File, Set<File>> result = algorithm.apply(imageFiles);
result.forEach((key, identicalImages) -> {
System.out.println("Original Image: " + key.getName());
identicalImages.forEach(file -> System.out.println(" Identical Image: " + file.getName()));
});
In this example, the PixelByPixel
algorithm is applied to a set of image files. The result is a map where each key is an image file, and the value is a set of images that are identical to the key image. The images are grouped based on pixel-by-pixel comparison.
The DCT
class is responsible for applying the Discrete Cosine Transform (DCT) and quantization to a given matrix of image coefficients. It serves as a pipeline that combines both transformations sequentially to prepare image data for perceptual hashing, compression, or other applications.
-
pl.magzik.algorithms.math.dct.Transformer
- Handles the Discrete Cosine Transform operation. -
pl.magzik.algorithms.math.dct.Quantifier
- Handles quantization of the DCT coefficients.
private DCT(Quantifier quantifier, Transformer transformer)
-
Parameters:
-
quantifier
- TheQuantifier
instance, that handles quantization of the DCT coefficients. -
transformer
- TheTransformer
instance, that handles the Discrete Cosine transform operation.
-
static double[][] apply(double[][] matrix)
-
Parameters:
-
matrix
- A 2D array of doubles representing the input matrix (e.g., grayscale pixel values from an image).
-
-
Returns:
- A 2D array of doubles representing the quantized DCT coefficients.
-
Purpose:
- This method performs both DCT and quantization in sequence.
- It creates new instances of
Quantifier
andTransformer
, initializes theDCT
pipeline, and processes the input matrix.
-
How It Works:
-
Step 1: The
matrix
is passed to thetransform
method ofTransformer
, which applies the Discrete Cosine Transform. -
Step 2: The resulting DCT coefficients are passed to the
quantize
method ofQuantifier
, which reduces their precision. - Step 3: The final quantized matrix is returned as output.
-
Step 1: The
private double[][] applyInternal(double[][] matrix)
-
Parameters:
-
matrix
- A 2D array of doubles representing the input matrix.
-
-
Returns:
- A 2D array of double representing the quantized DCT coefficients.
-
Purpose:
- This method is used internally to apply the transformation pipeline.
- It first calls the
transform
method of theTransformer
instance to compute the DCT. - Then, it calls the
quantize
method of theQuantifier
instance to apply quantization.
The Quantifier
class is responsible for performing quantization on a matrix of DCT (Discrete Cosine Transform) coefficients.
Quantifier(int[][] quantizationMatrix)
- Parameters:
-
quantizationMatrix
- A 2D integer array representing the quantization matrix.
-
Quantifier()
- Default Matrix: The default matrix follows the JPEG standard for 8x8 blocks:
{ {16, 11, 10, 16, 24, 40, 51, 61},
{12, 12, 14, 19, 26, 58, 60, 55},
{14, 13, 16, 24, 40, 57, 69, 56},
{14, 17, 22, 29, 51, 87, 80, 62},
{18, 22, 37, 56, 68, 109, 103, 77},
{24, 35, 55, 64, 81, 104, 113, 92},
{49, 64, 78, 87, 103, 121, 120, 101},
{72, 92, 95, 98, 112, 100, 103, 99} };
double[][] quantize(double[][] coeffs)
-
Parameters:
-
coeffs
- A 2D double array representing the matrix of DCT coefficients.
-
-
Returns:
- A 2D double array of quantized coefficients.
-
Throws:
-
IllegalArgumentException
- If the dimensions of the inputcoeffs
matrix don't match the quantization matrix dimensions.
-
-
Purpose:
- Applies quantization to the given matrix of DCT coefficients using the quantization matrix.
- Each DCT coefficient is divided by its corresponding quantization value and then rounded.
-
How It Works:
- Loops through each value in the coefficient matrix.
- Divides each coefficient by the corresponding value in the quantization matrix.
- Rounds the result and stores it in a new matrix.
The Transformer
class provides methods to perform the Discrete Cosine Transform (DCT) on both 1D vectors and 2D matrices. It leverages the efficient JTransforms
library to compute DCT operations.
double[] transform(double[] vector)
-
Parameters:
-
vector
- A 1D array ofdouble
values to be transformed.
-
-
Returns:
- A new
double[]
array containing the transformed values.
- A new
-
Purpose:
- Computes the 1D DCT for the given vector using the
DoubleDCT_1D
class from theJTransforms
library.
- Computes the 1D DCT for the given vector using the
-
How It Works:
- Clones the input vector to avoid mutating the original data.
- Initializes a
DoubleDCT_1D
object with the vector`s length. - Calls
forward()
withscaling = true
to normalize the result.
double[][] transform(double[][] matrix)
-
Parameters:
-
matrix
- A 2D array ofdouble
values representing the input data.
-
-
Returns:
- A new 2D
double[][]
array containing the transformed values.
- A new 2D
-
Purpose:
- Performs a 2D DCT on the input matrix. This is achieved by:
- Applying a 1D DCT to each row of the matrix.
- Applying a 1D DCT to each column of the intermediate result.
- Performs a 2D DCT on the input matrix. This is achieved by:
-
How It Works:
- Copies the input matrix into new
transformed
array. - Transforms each row using the
transform(double[] vector)
method. - Extracts each column, transforms it, and writes back the result into the matrix.
- Copies the input matrix into new
The AdaptiveCache
class provides an adaptive memory-based caching solution using the Caffeine
caching library. It dynamically adjusts memory usage based on the available JVM heap memory, ensuring efficient memory management and performance for image caching.
private static final double MAXIMUM_MEMORY_PERCENTAGE = 0.6
- Limits the cache size to 60% of the JVM heap memory.
private AdaptiveCache(long maximumWeight)
-
Parameters:
-
maximumWeight
- the maximum memory the cache can use.
-
static AdaptiveCache getInstance()
-
Returns:
- The singleton
AdaptiveCache
instance.
- The singleton
-
Purpose:
- Provides access to the singleton instance of the cache.
BufferedImage get(@NotNull File key) throws IOException
-
Parameters:
-
key
- AFile
object representing the image file.
-
-
Returns:
- The
BufferedImage
loaded from cache or disk.
- The
-
Throws:
-
IOException
- If the image cannot be loaded.
-
-
Purpose:
- Retrieves an image from the cache. If the image is not cached, it loads it from disk and stores it in the cache.
void monitor(long period)
-
Parameters:
-
period
- Interval (in seconds) between cache logs.
-
-
Purpose:
- Starts a periodic task that logs cache statistics at regular intervals.
private int getImageWeight(File key, @NotNull BufferedImage value)
-
Parameters:
-
key
- The image file. -
value
- TheBufferedImage
whose weight is calculated.
-
-
Returns:
- The memory weight of the image in bytes.
-
Purpose:
- Computes the memory weight of an image in bytes.
- Assumes each pixel is represented by 4 bytes (RGBA).
-
Formula:
return value.getWidth() * value.getHeight() * 4;
private BufferedImage loadImage(@NotNull File key)
-
Parameters:
-
key
- The image file.
-
-
Returns:
- The loaded
BufferedImage
.
- The loaded
-
Throws:
-
UncheckedIOException
- If the file cannot be read or the format is unsupported.
-
-
Purpose:
- Loads an image from disk using
ImageIO
.
- Loads an image from disk using
private static long getMaximumWeight()
-
Returns:
- Maximum cache weight (in bytes).
-
Purpose:
- Calculates the maximum cache size based on the available JVM memory (
60%
of the heap).
- Calculates the maximum cache size based on the available JVM memory (
- Adaptive Memory Management:
- The
maximumWeight
is dynamically calculated based on JVM heap size.
- Thread-Safe:
- The cache itself is thread-safe as Caffeine provides synchronized operations internally.
- The monitor uses
AtomicBoolean
to ensure it starts only once.
- Error Handling:
- Uses
UncheckedIOException
to propagateIOException
fromImageIO
read operations. - Logs detailed error information using SLF4J.
The Grouper
interface represents a functional interface designed to group files into subsets that share a common characteristic, such as having identical checksum or other similarity criteria. It abstracts the process of grouping files for organizational or comparison purposes.
Set<Set<File>> divide(Collection<File> col) throws IOException
-
Parameters:
-
col
- ACollection
ofFile
objects to be divided into subsets. Typically these files are of the same type (e.g., images).
-
-
Returns:
- A set of subsets of files, where each subset (
Set<File>
) contains files that share a common characteristic.
- A set of subsets of files, where each subset (
-
Throws:
-
IOException
- If an I/O error occurs while reading or processing the files.
-
-
Purpose: Divides a collection of files into subsets based on a defined distinction or grouping criterion. Each subset contains files that share a common property, such as identical content, checksum, or other user-defined similarities.
-
Generality: The
divide
methods is intentionally kept generic to accommodate any type of grouping logic. -
Performance Consideration: Implementations of
divide
should be optimized for performance, especially when processing large file collections. -
Immutability of Results: Returning a
Set<Set<File>>
ensures no duplicate groups exists, and each subset of files can be easily iterated over.
public class FileSizeGrouper implements Grouper {
@Override
public Set<Set<File>> divide(Collection<File> col) throws IOException {
// Group files by their size using a map
Map<Long, Set<File>> sizeGroups = new HashMap<>();
for (File file : col) {
if (file.isFile()) {
long size = Files.size(file.toPath());
sizeGroups.computeIfAbsent(size, k -> new HashSet<>()).add(file);
}
}
return new HashSet<>(sizeGroups.values());
}
}
public class Main {
public static void main(String[] args) throws IOException {
List<File> files = List.of(
new File("image1.jpg"),
new File("image2.jpg"),
new File("duplicate_image1.jpg")
);
Grouper grouper = new FileSizeGrouper();
Set<Set<File>> groupedFiles = grouper.divide(files);
groupedFiles.forEach(group -> {
System.out.println("Group:");
group.forEach(file -> System.out.println(" - " + file.getName()));
});
}
}
The CRC32Grouper
class implements the Grouper
interface to group files based on their CRC32 checksum. Files that share the same checksum are assumed to be identical and grouped together. This approach is useful for detecting duplicate files in collection.
Set<Set<File>> divide(Collection<File> col)
-
Parameters:
-
col
- ACollection
ofFile
objects to group based on their checksum.
-
-
Returns:
- A set of subsets of files, where each subset (
Set<File
) contains files that share the same checksum.
- A set of subsets of files, where each subset (
-
Purpose:
- Divides a collection of files into subsets based on their
CRC32 checksum
values. Files with the same checksum are grouped together.
- Divides a collection of files into subsets based on their
private long calculateChecksum(File f) throws IOException
-
Parameters:
-
f
- TheFile
for which the checksum is to be calculated.
-
-
Returns:
-
long
- The CRC32 checksum value of the file.
-
-
Throws:
-
IOException
- If an I/O error occurs while reading the file.
-
-
Purpose: Calculates the CRC32 checksum for a given file. This method reads the file in chunks to optimize memory usage and applies the CRC32 algorithm to generate the checksum.
- Parallel Stream Processing:
- Files are processed in parallel for performance. This ensures that large file collections are grouped efficiently.
- Grouping Logic:
- Uses a
Map
to associate checksum values with sets of files. - Files with identical checksum are collected into the same group.
- Filter Unique Groups:
- Only file groups with more than one file are retained as potential duplicates.
import pl.magzik.grouping.CRC32Grouper;
import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.Set;
public class Main {
public static void main(String[] args) {
List<File> files = List.of(
new File("file1.txt"),
new File("file2.txt"),
new File("duplicate_file1.txt")
);
Grouper grouper = new CRC32Grouper();
Set<Set<File>> groupedFiles = grouper.divide(files);
groupedFiles.forEach(group -> {
System.out.println("Group of duplicate files:");
group.forEach(file -> System.out.println(" - " + file.getName()));
});
}
}
The FileOperation
interface defines a standardized contract for performing file management operations such as:
- Loading files.
- Moving files to a specified directory.
- Deleting files.
It supports operations on both collections of files and individual arrays of files. Default methods ensure flexibility by delegating array-based operations to their corresponding collection-based methods.
List<File> load(Collection<File> files) throws IOException
-
Parameters:
-
files
- ACollection
ofFile
objects to be loaded.
-
-
Returns:
- `List - A list containing the loaded files.
-
Throws:
-
IOException
- If an I/O error occurs while loading the files.
-
-
Purpose: Loads the provided collection of files. The operation may involve verifying file existence, reading metadata, or other preparatory operations.
default List<File> load(File... files) throws IOException
-
Parameters:
-
files
- An array of files to be loaded.
-
-
Returns:
-
List<File>
- A list containing the loaded files.
-
-
Throws:
-
Purpose: Loads the provided array of files. Delegates to the collection-based
load(Collection<File>)
method.
void move(File destination, Collection<File> files) throws IOException
-
Parameters:
-
destination
- The target directory for the moved files. -
files
- TheCollection
ofFile
objects to be moved.
-
-
Throws:
-
IOException
- If an I/O error occurs while moving the files.
-
-
Purpose: Moves the provided collection of files to a specified destination directory.
default void move(File destination, File... files) throws IOException
-
Parameters:
-
destination
- The target directory for the moved files. -
files
- An array of files to be moved.
-
-
Throws:
-
IOException
- If an I/O error occurs while moving the files.
-
-
Purpose: Moves the provided array of files to the specified destination directory. Delegates to the collection-based
move(File, Collection<File>)
method.
void delete(Collection<File> files) throws IOException
-
Parameters:
-
files
- TheCollection
ofFile
objects to be deleted.
-
-
Throws:
-
IOException
- If an I/O error occurs while deleting the files.
-
-
Purpose: Deletes the provided collection of files.
default void delete(File... files) throws IOException
-
Parameters:
-
files
- An array of files to be deleted.
-
-
Throws:
-
IOException
- If an I/O error occurs while deleting the files.
-
-
Purpose: Deletes the provided array of files. Delegates to the collection-based
delete(Collection<File>)
method.
public class SimpleFileOperation implements FileOperation {
@Override
public List<File> load(Collection<File> files) throws IOException {
for (File file : files) {
if (!file.exists()) {
throw new IOException("File not found: " + file.getName());
}
}
return List.copyOf(files);
}
@Override
public void move(File destination, Collection<File> files) throws IOException {
if (!destination.isDirectory()) {
throw new IOException("Destination must be a directory.");
}
for (File file : files) {
File target = new File(destination, file.getName());
if (!file.renameTo(target)) {
throw new IOException("Failed to move file: " + file.getName());
}
}
}
@Override
public void delete(Collection<File> files) throws IOException {
for (File file : files) {
if (!file.delete()) {
throw new IOException("Failed to delete file: " + file.getName());
}
}
}
public static void main(String[] args) {
FileOperation fileOperation = new SimpleFileOperation();
try {
File file1 = new File("file1.txt");
File file2 = new File("file2.txt");
File destination = new File("targetDirectory");
fileOperation.load(file1, file2);
fileOperation.move(destination, file1, file2);
fileOperation.delete(file1, file2);
} catch (IOException e) {
e.printStackTrace();
}
}
}
The FileOperator
class provides an efficient and asynchronous implementation of file operations such as loading, moving, and deleting files. It is designed for I/O-bound operations and utilizes virtual threads and CompletableFuture
for parallelism and non-blocking processing.
The operations performed by FileOperator
include:
- Pre-validation: Ensures that provided files exist and are accessible.
-
Regular File Validation: Processes and validates individual files using a
FilePredicate
. - Directory Validation: Recursively processes directories up to a specified depth, validating and collecting files.
It uses a configurable depth for directory traversal and an ExecutorService
for asynchronous processing.
FileOperator(FilePredicate filePredicate, int depth)
-
Parameters:
-
filePredicate
- TheFilePredicate
used to validate files. -
depth
- The directory traversal depth.
-
-
Purpose: Creates a
FileOperator
with the specified file predicate, directory traversal depth, and a default virtual thread executor.
FileOperator(FilePredicate filePredicate, int depth, ExecutorService executorService)
-
Parameters:
-
filePredicate
- TheFilePredicate
used to validate files. -
depth
- The directory traversal depth. -
executorService
- TheExecutorService
used in asynchronous operations.
-
-
Purpose: Allows injecting a custom
ExecutorService
for task execution.
void setDepth(int depth)
-
Parameters:
-
depth
- the depth to set.
-
-
Purpose: Sets the depth for directory traversal.
List<File> load(Collection<File> files) throws IOException
-
Parameters:
-
files
- TheCollection
ofFile
objects to be loaded.
-
-
Returns:
- A
List<File>
of validated files.
- A
-
Throws:
-
IOException
- If pre-validation fails.
-
-
Purpose: Loads, validates, and processes the provided collection of files and directories.
- Pre-validation: Ensures files exist.
- Regular File Validation: Validates regular files concurrently.
- Directory Validation: Recursively processes directories to extract and validate files up to the specified depth.
private List<File> handleRegularFiles(Collection<File> files)
-
Parameters:
-
files
- TheCollection
ofFile
objects to be validated.
-
-
Returns:
- A
List<File>
of validates files.
- A
-
Purpose:
- This method filters the provided collection of files to include only regular files (not directories), and processes each file asynchronously using the configured
ExecutorService
. Each file is validated using theFileValidator
. If a file is valid, it is included in the result list; otherwise, it is ignored.
- This method filters the provided collection of files to include only regular files (not directories), and processes each file asynchronously using the configured
private List<File> handleDirectories(Collection<File> files)
-
Parameters:
-
files
- TheCollection
ofFile
objects to be validated.
-
-
Returns:
- A
List<File>
of validates files.
- A
-
Purpose:
- This method filters the provided collection of files to include only directories, and processes each directory asynchronously using the configured
ExecutorService
. It recursively walks through each directory up to the specified depth, extracting all files and validating them.
- This method filters the provided collection of files to include only directories, and processes each directory asynchronously using the configured
void move(File destination, Collection<File> files) throws IOException
-
Parameters:
-
destination
- Target directory where files will be moved. -
files
- TheCollection
ofFile
objects to move.
-
-
Throws:
-
IOException
- If a file move operation fails.
-
-
Purpose: Moves files to the specified destination directory.
- Each file move operation is executed asynchronously using
CompletableFuture
. - If a file cannot be moved, it logs the error and throws
IOException
.
- Each file move operation is executed asynchronously using
void delete(Collection<File> files) throws IOException
-
Parameters:
-
files
- TheCollection
ofFile
objects to delete.
-
-
Throws:
-
IOException
- If a file deletion operation fails.
-
-
Purpose: Deletes the specified collection of files.
- Each file delete operation is executed asynchronously using
CompletableFuture
. - If a file cannot be deleted, it logs the error and throws
IOException
.
- Each file delete operation is executed asynchronously using
- Virtual Threads: Ensures scalability for I/O-bound operations.
-
Error Handling: Errors are logged, and exceptions are rethrown as
IOException
for consistency. -
Custom Predicate: Validation logic can be customized using the
FilePredicate
. - Depth Control: Directory recursion is limited by the specified depth.
public class FileOperatorExample {
public static void main(String[] args) {
// Define a predicate for file validation
FilePredicate filePredicate = file -> file.getName().endsWith(".txt");
// Initialize FileOperator with depth of 2
FileOperator fileOperator = new FileOperator(filePredicate, 2);
// Files and directories to process
List<File> files = Arrays.asList(new File("file1.txt"), new File("directory1"));
try {
// Load files
List<File> validatedFiles = fileOperator.load(files);
System.out.println("Validated Files: " + validatedFiles);
// Move files
File destination = new File("destination");
fileOperator.move(destination, validatedFiles);
// Delete files
fileOperator.delete(validatedFiles);
System.out.println("Files successfully deleted.");
} catch (IOException e) {
e.printStackTrace();
}
}
}
The FileValidator
class serves as a utility for validating files based on a user-defined FilePredicate
. It provides the following functionality:
-
Pre-validation: Ensures all files in a given collection exist; otherwise, it throws an
IOException
. -
Validation: Checks individual files or paths against a provided
FilePredicate
. - Error Handling: Wraps and rethrows file-related exceptions in a clean and consistent manner.
FileValidator(FilePredicate predicate)
-
Parameters:
-
predicate
- The condition to validate files against.
-
void preValidate(Collection<File> files) throws IOException
-
Parameters:
-
files
- TheCollection
ofFile
objects to validate.
-
-
Throws:
-
IOException
- If any file in the collection doesn't exist.
-
-
Purpose: Ensures all files in the given collection exist.
- Filters that files that do not exist using
File::exists
. - Throws an
IOException
if any file is missing. - The exception includes the absolute path of the first missing file.
- Filters that files that do not exist using
boolean validate(File file) throws IOException
-
Parameters:
-
file
- TheFile
to validate.
-
-
Returns:
-
true
if the files is valid according to the predicate,false
otherwise.
-
-
Throws:
-
IOException
- If an error occurs while accessing the file.
-
-
Purpose: Validates a single
File
object.- Checks if the file is a regular file (
file.isFile()
). - Tests the file against the provided
FilePredicate
. - Returns
true
if the file passes validation,false
otherwise.
- Checks if the file is a regular file (
boolean validate(Path path) throws IOException
-
Parameters:
-
path
- ThePath
of the file to validate.
-
-
Returns:
-
true
if the file is valid according to the predicate,false
otherwise.
-
-
Throws:
-
IOException
- If an error occurs while accessing the file.
-
-
Purpose: Validates a file represented as a
Path
object.- Converts the
Path
toFile
usingpath.toFile()
. - Delegates to the
validate(File)
method.
- Converts the
Path filePath = Path.of("example.txt");
FilePredicate predicate = f -> f.length() > 0; // Files with content
FileValidator validator = new FileValidator(predicate);
try {
boolean isValid = validator.validate(filePath);
System.out.println("File is valid: " + isValid);
} catch (IOException e) {
System.err.println("Error: " + e.getMessage());
}
The FileVisitor
class is an implementation of SimpleFileVisitor
designed to perform asynchronous file processing during a file tree traversal operation. It works with a FileValidator
to validate files and collects valid files in a thread-safe Set<Path>
.
The processing is carried out asynchronously using an ExecutorService
allowing efficient parallel execution.
FileVisitor(ExecutorService executorService, FileValidator fileValidator)
-
Parameters:
-
executorService
- TheExecutorService
for asynchronous execution. -
fileValidator
- TheFileValidator
used to check files against conditions.
-
FileVisitResult visitFile(Path file, BasicFileAttributes attrs)
-
Parameters:
-
file
- ThePath
of the file being visited. -
attrs
- File attributes for the visited file.
-
-
Returns:
-
FileVisitResult.CONTINUE
- Indicates that the traversal should continue.
-
-
Purpose: Processes each file encountered during a file tree walk.
- Validates the file asynchronously using
CompletableFuture
. - If the file is valid (regular file and satisfies the
FileValidator
), it is added to thefiles
set.
- Validates the file asynchronously using
FileVisitResult visitFileFailed(Path file, IOException exc)
-
Parameters:
-
file
- ThePath
of file that could not be visited. -
exc
- The exception thrown during visitation.
-
-
Returns:
-
FileVisitResult.CONTINUE
- Indicates that the traversal should continue even after a failure.
-
-
Purpose: Handles scenarios where a file cannot be visited due to an error.
- Logs a warning message indicating the file path and cause of failure.
Set<Path> getFiles()
-
Returns:
- A
Set<Path>
containing the paths of valid files.
- A
-
Purpose: Returns the set of valid files collected during the file tree walk.
- Waits for all asynchronous tasks (
CompletableFuture
) to complete usingCompletableFuture::join
. - Ensures that the method blocks until all file validation is completed.
- Waits for all asynchronous tasks (
- Initialization:
- The class is initialized with an
ExecutorService
and aFileValidator
.
- Traversal:
-
visitFile(...)
processes each file and validates it asynchronously. -
visitFileFailed(...)
logs errors for files that cannot be accessed.
- Asynchronous Validation:
- File are validated in parallel using
CompletableFuture
. - Valid files are added to the concurrent set.
- Result Retrieval:
-
getFiles()
blocks until all validation tasks are complete and returns the collected files.
ExecutorService executorService = Executors.newFixedThreadPool(4);
FileValidator validator = new FileValidator(f -> f.getName().endsWith(".txt"));
FileVisitor visitor = new FileVisitor(executorService, validator);
try {
Files.walkFileTree(Path.of("/my-directory"), visitor);
Set<Path> validFiles = visitor.getFiles();
validFiles.forEach(System.out::println);
} catch (IOException e) {
System.err.println("Error traversing files: " + e.getMessage());
} finally {
executorService.shutdown();
}
The FilePredicate
interface represents a functional interface for performing file-based validation or checks. It is enhanced version of java.util.function.Predicate
, designed specifically for File
objects and capable of throwing IOException
.
boolean test(File file) throws IOException
-
Parameters:
-
file
- TheFile
to evaluate or validate.
-
-
Returns:
-
true
if the file matches the condition (predicate),false
otherwise.
-
-
Throws:
-
IOException
- If an I/O error occurs while evaluating the file.
-
-
Purpose: Evaluates the given
File
object against a specific condition.
- IOException Handling:
- Unlike
java.util.function.Predicate
, this interface supports methods that may throw anIOException
. - This is essential for I/O-based operations, such as checking file content or accessing metadata.
- Lambda Support:
- The single abstract method makes the interface suitable for lambda expressions and concise code.
- Flexibility:
- Allows various implementations, from basic checks (file size, existence) to complex content validation.
FilePredicate readablePredicate = file -> file.canRead();
File file = new File("example.txt");
try {
if (readablePredicate.test(file)) {
System.out.println(file.getName() + " is readable.");
} else {
System.out.println(file.getName() + " is not readable.");
}
} catch (IOException e) {
System.err.println("Error checking file: " + e.getMessage());
}
FilePredicate largeFilePredicate = file -> file.length() > 1024 * 1024;
File file = new File("largeFile.dat");
try {
if (largeFilePredicate.test(file)) {
System.out.println(file.getName() + " is larger than 1MB.");
} else {
System.out.println(file.getName() + " is smaller than 1MB.");
}
} catch (IOException e) {
System.err.println("Error checking file size: " + e.getMessage());
}
FilePredicate contentCheckPredicate = file -> {
return Files.lines(file.toPath()).anyMatch(line -> line.contains("TODO"));
};
File file = new File("sourceCode.java");
try {
if (contentCheckPredicate.test(file)) {
System.out.println(file.getName() + " contains 'TODO'.");
} else {
System.out.println(file.getName() + " does not contain 'TODO'.");
}
} catch (IOException e) {
System.err.println("Error reading file: " + e.getMessage());
}
While FilePredicate
doesn't provide default methods like and()
or or()
, you can combine predicates manually:
FilePredicate readableAndLarge = file -> file.canRead() && file.length() > 1024 * 1024;
File file = new File("data.txt");
try {
if (readableAndLarge.test(file)) {
System.out.println("The file is readable and large.");
} else {
System.out.println("The file does not satisfy the condition.");
}
} catch (IOException e) {
System.err.println("Error: " + e.getMessage());
}
The ImageFilePredicate
class is an implementation of the FilePredicate
interface that validates image files based on their magic numbers. Magic numbers are binary signatures at the start of a file that uniquely identify its format. The predicate supports validation for common image formats like JPG, PNG, GIF, BMP, and more.
The class comes with default magic numbers for the following formats:
-
JPG, JPEG:
FFD8FF
-
PNG:
89504E470D0A1A0A
-
GIF:
474946383761
,474946383961
-
BMP:
424D
-
TIFF:
49492A00
,4D4D002A
-
ICO:
00000100
-
JP2, J2K, JPC:
0000000C6A5020200D0A870A
,FF4FFF51
ImageFilePredicate(Map<String, Set<String>> magicNumbers)
-
Parameters:
-
magicNumbers
- AMap
of file extenstions to magic numbers.
-
-
Purpose: Allows custom initialization of file extensions and corresponding magic numbers.
ImageFilePredicate()
- Purpose: Initializes the class with default magic numbers for common image formats.
boolean test(File file) throws IOException
-
Parameters:
-
file
- TheFile
to be validated.
-
-
Returns:
-
true
if the file matches one of the magic numbers for its extension,false
otherwise.
-
-
Throws:
-
IOException
- If the file cannot be read or is corrupted.
-
-
Purpose: Tests if a file matches a known magic number based on its extension.
private String getExtension(File file)
-
Parameters:
-
file
- TheFile
from which the extension is to be extracted.
-
-
Returns:
- The file extension (e.g.,
PNG
,JPG
), or an empty string if no extension exists.
- The file extension (e.g.,
-
Purpose: Extracts the file extension in uppercase from a file name.
private String getFileMagicNumber(File file, int bytesToRead) throws IOException
-
Parameters:
-
file
- TheFile
to read. -
bytesToRead
- The number of bytes to read from the file.
-
-
Returns:
- The magic number as a hexadecimal string.
-
Throws:
-
IOException
- If an error occurs while reading the file or if the file is too short.
-
-
Purpose: Reads the first
bytesToRead
bytes from a file and converts them into a hexadecimal string to represent the file's magic number.
private String bytesToHex(byte[] bytes)
-
Parameters:
-
bytes
- The array of bytes to be converted.
-
-
Returns:
- The hexadecimal string representation of the bytes.
-
Purpose: Converts a byte array to a hexadecimal string.
- File Extension Extraction:
- The
getExtension(...)
methods extracts the file extension (e.g.,PNG
,JPG
) to determine the expected magic numbers.
- Reading Magic Numbers:
- The
getFileMagicNumber(...)
method reads the first few bytes of the file and converts them to a hexadecimal representation.
- Validation:
- The extracted magic number is compared against the known magic numbers for that extension.
- If a match is found, the file is considered valid.
FilePredicate imagePredicate = new ImageFilePredicate();
File jpgFile = new File("photo.jpg");
try {
boolean isValid = imagePredicate.test(jpgFile);
System.out.println(jpgFile.getName() + " is valid: " + isValid);
} catch (IOException e) {
System.err.println("Failed to validate file: " + e.getMessage());
}
Map<String, Set<String>> customMagicNumbers = Map.of(
"WEBP", Set.of("52494646") // Magic number for WEBP images
);
FilePredicate customPredicate = new ImageFilePredicate(customMagicNumbers);
File webpFile = new File("image.webp");
try {
if (customPredicate.test(webpFile)) {
System.out.println("This is a valid WEBP image.");
} else {
System.out.println("Invalid WEBP image.");
}
} catch (IOException e) {
System.err.println("Validation failed: " + e.getMessage());
}