The program counts the number of unique ip addresses from the input file according to the following requirements.
- Requirements
- Naive approach
- Target solution
3.1. Pros & Cons
3.2. Testing & measuring - Installation
- Not very good solutions
5.1. IPv4 Integer representation
5.2. IntHashSet - custom HashSet with Java primitives
5.3. Pagination/Partitioning & Caching
- File size is unlimited
- The program should use as little memory & time as possible
- Java 17 or higher
The file contains ip addresses line by line as follows:
127.0.0.1
4.64.10.13
19.45.6.140
19.45.6.140
3.42.11.8
...
The naive solution is to read the file line by line and use HashSet<String>
to
detect duplicates. The main drawback of this solution is huge memory requirements.
To store the maximum range of ip addresses in HashSet<String>
, the following memory is used:
((7 B + 15 B) / 2) * 2^32 - 1
= 11 B * 4,294,967,295
= 47,244,640,245 B
= 46,137,343 KiB
= 45,055 MiB
= 44 GiB + meta data
It is known that the values of ip-addresses lie in the range from 0 to 232, and an ip-address can be represented as a 32-bit integer value. TThe basic idea is to use an array, each element of which indicates the presence or absence of an ip address and has two states:
- 0 - ip address is absent and
- 1 - ip address is present.
Then, to check if the ip address is unique, we need to check the state of the element of this array whose index is equal to the 32-bit integer value of the ip address.
According to possible values the most efficient structure in terms of memory utilizing would be an array of bits. Then the length N of the array is 232 bits.
The array structure is shown in the following figure:
ip (int): 0 1 2 3 4 5 --> N
array: [0][0][0][0][1][0] ... [0]
|
ip 4 is present
0.0.0.4 in text representation
Memory usage:
Bit: 232 = 4,294,967,296
Byte: 229 = 536,870,912
KiB: 219 = 524,288
MiB: 29 = 512
GiB: 2-1 = 0.5
Java doesn't provide an array of bits, but instead we can use an array of byte, short, int, or long. This means that each array element contains S bits, depending on the type selected. Then the length M of the array could be 232 / S or 232 - log2(S).
data type | S | log2(S) | M | memory usage |
---|---|---|---|---|
byte | 8 | 3 | 229 = 536,870,912 | 229 * 8 bit = 512 MiB |
short | 16 | 4 | 228 = 268,435,456 | 228 * 16 bit = 512 MiB |
int | 32 | 5 | 227 = 134,217,728 | 227 * 32 bit = 512 MiB |
long | 64 | 6 | 226 = 67,108,864 | 226 * 64 bit = 512 MiB |
The array structure is shown in the following figure:
array: [0] [i] ... [M]
| |
|------------------------| |------------------------|
ip (int): 0 1 2 3 4 --> S-1 i i+1 i+2 i+3 i+4 --> i+S-1
[0] [0] [0] [0] [1] ... [0] [0] [0] [0] [0] [0] ... [0]
|
ip 4 is present
0.0.0.4 in text representation
To check if there is a 32-bit ip address p in the array the following steps should be performed:
- find an element e by executing
array[p / S]
, but it doesn't work in Java because all integer types use the high bit as the sign; therefore, to perform unsigned division, we can use the right shift operator>>>>
, as shown:array[p >>> log2(S)]
- check a bit in the found element by executing
where
e & (1 << (p & (S - 1)))
p & (S - 1)
- finding the bit position using modulus of dividingp
byS
; thep % S
operator does not work because of signed integer types in Java1 << (...)
- shifting the bit to the left by the desired positione & (...)
- bit check using the logical AND
- a result of
0
means that the ip address is unique (has not been processed before), other values greater than0
means that the ip address is a duplicate
If the result was 0
, the next action is to write the 32-bit ip address p
to the array as follows:
array[p >>> log2(S)] |= (1 << (p & (S - 1)))
where the |=
operator applies the logical OR and writes the result in the element e.
To count unique ip addresses, the program calculates the difference between all ip addresses and duplicates.
Pros:
- the program uses a relatively small amount of memory even when processing very large files
- with proper micro-optimization, this seems like the fastest solution
Cons:
- the program does not react to the amount of input data and uses the same amount of memory even when processing a small file
- the solution doesn't scale to work with IPv6 addresses
The program was tested on the following file (download link):
size | total ip addresses | unique ip addressed | avg time |
---|---|---|---|
106 GB | 8,000,000,000 (8B) | 1,000,000,000 (1B) | ~18 min |
Environment:
- OS:
Microsoft Windows 11 Home 10.0.22631 N/A Build 22631
- Java version:
java version "20.0.1" 2023-04-18 Java(TM) SE Runtime Environment (build 20.0.1+9-29) Java HotSpot(TM) 64-Bit Server VM (build 20.0.1+9-29, mixed mode, sharing)
- CPU:
11th Gen Intel(R) Core(TM) i9-11900H @ 2.50GHz
- RAM:
32.0 GB (31.7 GB usable)
- SSD:
NVMe PC711 NVMe SK hynix 1TB
Necessary environment:
- Installed Java 20+
- Clone the repository:
$ git clone https://github.com/elimxim/ip4counter
- Run the build script:
$ ./gradlew build
- To unzip the distribution, you can find it in the directory:
$ cd build/distributions/ip4counter-1.X
To run the program:
$ ./ip4counter <path>
The first optimization that comes to mind is to use a 32-bit integer representation of the ip address. This will significantly reduce memory requirements, but not enough.
data type | length | memory (bytes) |
---|---|---|
String (latin1) | 7..15 | 7..15 + meta data |
int (primitive) | N/A | 4 (const) |
Integer | N/A | 4 + meta data |
To store the maximum range of ip addresses in HashSet<Integer>
, the following memory is used:
4 B * 2^32 - 1
= 4 B * 4,294,967,295
= 17,179,869,180 B
= 16,777,216 KiB
= 16,384 MiB
= 16 GiB + meta data
In addition, the file may be larger than the maximum ip address range. This means that the solution have to be able to save data from memory to disk.
To reduce memory usage and boxing-unboxing primitives, a custom HashSet
can be used
that stores primitive integers. Additionally, it can load data directly from a file.
The main disadvantage of this structure is the complexity of implementation.
The basic idea is to be able to save and load HashSet<Integer>
to and from disk
depending on page/partition. This allows the data to be divided into parts that are
available for storage in memory. This means that each page/partition has its own
HashSet<INteger>
to save and load to or from disk as quick as possible.
The file format should contain the size of the buckets in the HashSet
and then the
data that can be read. Due to the integer data type, binary format is the best option.
[size][int1][int2][int3][int4]...[intN]
The page/partition can be the first octet of the ip address. This means that the number of ip addresses belonging to the first octet value is a maximum of 3 bytes.
o1 = 0..255 #first octet
page/partition = o1
4 B * 2^24 - 1
= 4 B * 16,777,215
= 67,108,864 B
= 65,536 KiB
= 64 MiB
This amount of data can be stored and loaded as needed when the program processes ip addresses of the first octet.
However, determining page/partition by values from the first octet may not be the best solution. A more efficient approach is to control the number of page/partition. For example, by decreasing or increasing the number of page/partition by the degree of the number 2.
o1 = 0..255 #first octet
page/partition = p(o1), where 1 <= p(o1) <= 256
Memory utilization depending on the number of pages/partitions is described in the following table. Note that memory usage is specified as maximum for the maximum range of ip addresses.
p | p-size | range-size | octet1 ranges | memory usage |
---|---|---|---|---|
p0 | 1 | 256 / 1 = 256 | 0..255 | 64 MiB * 256 = 16 GiB |
p1 | 2 | 256 / 2 = 128 | 0..127, 128..255 | 64 MiB * 128 = 8 GiB |
p2 | 4 | 256 / 4 = 64 | 0..63, 64..127, 128..191, 192..255 | 64 MiB * 64 = 4 GiB |
pN | 256 | 256 / 256 = 1 | 0, 1, 2, 3, 4, 5, 6, 7, 8, ..., 255 | 64 MiB * 1 = 64 MiB |
To assign ip addresses to pages/partitions, the following function can be used. To calculate a module in Java a bitwise modulo can be used.
p(o1) = o1 mod p-size
To make the program flexible to configure and avoid unnecessary work with the file system, input data from different pages/partitions that are not currently being processed is stored in a cache. When the cache size threshold is reached, the program changes the page/partition depending on the winner. The winning page/partition is the page/partition that appears in the cache the most times. And then the process continues.
Memory utilization depending on the cache size is described in the following table.
cache size | memory usage |
---|---|
1024 | * 4 B = 4 KiB |
1,048,576 | * 4 B = 4 MiB |
1,048,576 x 50 | * 4 B = 200 MiB |
To avoid additional memory allocation, the cache is never cleared.