Place offset manager in commons #373

Claudenw · 2024-12-16T12:56:35Z

Fix for KCON-57

While this looks like a large change, there are multiple cases where files were migrated from s3-source-connector to common module. Those files are counted twice. This change also removes unused classes/files.

Significant changes are in OffsetManager, S3SourceTask, S3SourceRecord and AWSV2SourceClient.

Made OffsetManager generic to handle multiple OffsetManagerRecord types while simplifying access from sources.

Source should implement an instance of OffsetManager.OffsetManagerEntry that tracks the specific data for the source.

OffsetManagerEntry is included in the Source specific record (e.g. S3SourceRecord), is updated as processing continues, and is the source of record for many of the S3 and Kafka specific values (e.g. partition, topic, S3Object key) as well as some dynamic data such as the current record number.

Transformer was modified to update the OffsetManagerEntry as records are returned.

Due to bug in Kafka this implementation can not guarantee write once functionality. https://issues.apache.org/jira/browse/KAFKA-14947

Added javadoc.

Claudenw · 2024-12-19T08:33:17Z

Units tests pass, there is an issue with the integration tests not picking up the changes in commons.

s3-source-connector/src/main/java/io/aiven/kafka/connect/s3/source/S3SourceTask.java

commons/src/main/java/io/aiven/kafka/connect/common/source/input/ByteArrayTransformer.java

commons/src/main/java/io/aiven/kafka/connect/common/source/input/JsonTransformer.java

commons/src/main/java/io/aiven/kafka/connect/common/source/input/ParquetTransformer.java

s3-source-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/AWSV2SourceClient.java

...-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/S3ObjectSummaryIterator.java

aindriu-aiven · 2024-12-19T10:33:03Z

...-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/S3ObjectSummaryIterator.java

+            if (objectListing.isTruncated()) {
+                // get the next set of data and create an iterator on it.
+                request.setStartAfter(null);
+                request.withContinuationToken(objectListing.getContinuationToken());


I am pretty sure the continuation token is all that is required here, you can create a new request and only add the contiuation token (possibly also require the bucket though)

...-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/S3ObjectSummaryIterator.java

...rce-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/SourceRecordIterator.java

aindriu-aiven

I had a few comments some are for future follow ups but we should create issues for them so we dont miss them.

muralibasani · 2024-12-20T09:09:13Z

...rce-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/SourceRecordIterator.java

-            throw new AmazonClientException(e);
-        }
+        this.s3ObjectIterator = IteratorUtils.filteredIterator(sourceClient.getIteratorOfObjects(null),
+                s3Object -> extractOffsetManagerEntry(s3Object));


Lambda can be replaced with method reference

Suggested change

s3Object -> extractOffsetManagerEntry(s3Object));

this::extractOffsetManagerEntry);

muralibasani · 2024-12-20T09:16:11Z

commons/src/main/java/io/aiven/kafka/connect/common/source/input/Transformer.java

+     *            the Abstract Config to use.
+     * @return a Stream of SchemaAndValue objects.
+     */
+    public final Stream<SchemaAndValue> getRecords(final IOSupplier<InputStream> inputStreamIOSupplier,


this is looking great, much simplified version

muralibasani

Need to find why no events are pushed to kafka offsets topic

muralibasani · 2024-12-20T12:59:11Z

s3-source-connector/src/main/java/io/aiven/kafka/connect/s3/source/S3SourceTask.java

@@ -119,6 +118,7 @@ public List<SourceRecord> poll() throws InterruptedException {

            while (!connectorStopped.get()) {
                try {
+                    waitForObjects();
                    extractSourceRecords(results);
                    LOGGER.info("Number of records extracted and sent: {}", results.size());
                    return results;


I have an extract of what is sent to kafka offsets topic, before this PR, and with this PR.

Before this PR :

SourceRecord{ sourcePartition={bucket=test-bucket0, topic=bytesTest, topicPartition=0}, sourceOffset={object_key_s3-source-connector-for-apache-kafka-test-2024-12-20T13:34:01.62052/bytesTest-00000-1734698057527.txt=1} } ConnectRecord{topic='bytesTest', kafkaPartition=0, key=[B@6e96f788, keySchema=null, value=[B@49e57a97, valueSchema=null, timestamp=null, headers=ConnectHeaders(headers=)}

With this PR :

SourceRecord{ sourcePartition={partition=0, bucket=test-bucket0, objectKey=s3-source-connector-for-apache-kafka-test-2024-12-20T13:28:08.047694/bytesTest-00000-1734697707480.txt, topic=bytesTest}, sourceOffset={bucket=test-bucket0, topic=bytesTest, partition=0, objectKey=s3-source-connector-for-apache-kafka-test-2024-12-20T13:28:08.047694/bytesTest-00000-1734697707480.txt, recordCount=0} } ConnectRecord{topic='bytesTest', kafkaPartition=0, key=[B@67e2252f, keySchema=null, value=[B@1d001ae2, valueSchema=null, timestamp=null, headers=ConnectHeaders(headers=)}

There are some duplicate keys sent in sourcePartition, and sourceOffset, which should be removed.

Have tested locally, and no events are pushed to connect-offset-topic- topic

Am not sure where the problem is, am going to debug further. May be something to do with the new structure

muralibasani · 2024-12-24T13:18:50Z

...rce-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/S3OffsetManagerEntry.java

+     */
+    @Override
+    public OffsetManager.OffsetManagerKey getManagerKey() {
+        return () -> Map.of(BUCKET, data.get(BUCKET), OBJECT_KEY, data.get(OBJECT_KEY));


Instead of objectkey storing as keys, it is better to store partition ids in key.
We will have fewer number of keys.

Just verified lenses s3 source connector and adobe s3 source connector, and they store partitionids.

Can we think about this too ?

topic.partitions we have this config. Our earlier implementation was based on this.

@gharris1727 your suggestion will be helpful here.
According to javadocs of OffsetStorageReader : offsets() method, I was thinking we would have to store topic and partition id in offset storage keys atleast ?

@Override public OffsetManager.OffsetManagerKey getManagerKey() { return () -> Map.of(BUCKET, data.get(BUCKET), TOPIC, TOPIC, PARTITION, PARTITION); }

When we have several objects under specified topics and partitions and to retrieve the stored offset map, how can be better structure the keys ?

muralibasani · 2024-12-24T21:09:13Z

...ce-connector/src/integration-test/java/io/aiven/kafka/connect/s3/source/IntegrationTest.java

+                IntegrationBase.consumeOffsetMessages(consumer).forEach(s -> {
+                    offsetRecs.merge(s.getKey(), s.getRecordCount(), (x, y) -> x > y ? x : y);
+                });
+                // FIXME after KAFKA-14947 is fixed.


But it is already working in feature branch. Not sure if it's totally related

¨Claude added 9 commits December 16, 2024 16:09

initial work

3ee947a

Updated OffsetManager and tests

6fc32f4

updated tests

7252107

fixed some tests

2579b09

fixes for some tests

bb70786

fixes for iterator

dc434f0

partial fix

688a731

partial fix

218bf20

partial fixes

69ea274

Claudenw force-pushed the KCON-57_place_OffsetManager_in_commons branch from b5278e0 to 69ea274 Compare December 17, 2024 15:15

¨Claude added 10 commits December 17, 2024 15:57

Fixe commons

640d007

fixed AWSV2ClientTests

6bc1dec

Fixed SourceClient tests

9e43146

Fixed remaining tests

388b299

common code cleanup

82a7451

fixed s3-source-connector PMDMain errors

a524ff3

fixed s3-source-connector PMTTest errors

18e0eeb

fixed commons checkstyle. it bilds

e47556f

fixed issues with s3-source-connector build

cc5e7c3

fixed spotlessApply issues

934ab93

Claudenw marked this pull request as ready for review December 19, 2024 08:32

Claudenw requested review from a team as code owners December 19, 2024 08:32