S3 source connector #317

muralibasani · 2024-10-24T09:24:16Z

No description provided.

...ce-connector/src/integration-test/java/io/aiven/kafka/connect/s3/source/IntegrationBase.java

aindriu-aiven · 2024-11-05T09:20:59Z

...ce-connector/src/integration-test/java/io/aiven/kafka/connect/s3/source/IntegrationTest.java

+    }
+
+    @Test
+    void multiPartUploadBytesTest(final TestInfo testInfo) throws ExecutionException, InterruptedException {


I don't think this test is required, as it is testing the multipart upload rather then the source connector.
The multipart upload writes the file on closing of the stream, so it appears as any old file to the source connector.

aindriu-aiven · 2024-11-05T10:25:37Z

...ector/src/main/java/io/aiven/kafka/connect/s3/source/AivenKafkaConnectS3SourceConnector.java

+ * AivenKafkaConnectS3SourceConnector is a Kafka Connect Connector implementation that watches a S3 bucket and generates
+ * tasks to ingest contents.
+ */
+public class AivenKafkaConnectS3SourceConnector extends SourceConnector {


I'd personally still really like to rename this S3SourceConnector

This was defined aligning with Sink Connector (AivenKafkaConnectS3SinkConnector).
Happy to change if we have a 2nd opinion.

Yeah this is a personal preference, I feel like we should remove erroneous use of Aiven in class names.

s3-source-connector/src/main/java/io/aiven/kafka/connect/s3/source/S3SourceTask.java

s3-source-connector/src/main/java/io/aiven/kafka/connect/s3/source/input/AvroTransformer.java

s3-source-connector/src/main/java/io/aiven/kafka/connect/s3/source/input/JsonTransformer.java

...urce-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/AivenS3SourceRecord.java

s3-source-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/RecordProcessor.java

muralibasani · 2024-11-06T09:07:05Z

@aindriu-aiven thanks for the review. PR #329 open to fix the review.

aindriu-aiven · 2024-11-06T14:34:24Z

s3-source-connector/src/main/java/io/aiven/kafka/connect/s3/source/S3SourceTask.java

+
+    private void waitForObjects() throws InterruptedException {
+        while (!sourceRecordIterator.hasNext() && !connectorStopped.get()) {
+            LOGGER.debug("Blocking until new S3 files are available.");


Looking at this, should we break out of this loop after a certain amount of retries and return null, and then wait for the next polling thread again? It is much for muchness but I figure we shouldn't have a sleep that could in the event of some mistake in config potentially use a thread and sleep indefinitely?

Agree, we have a ticket for poll/synchronized which also mentioned about this sleep.
I believe this retries will be addressed in the same.

aindriu-aiven · 2024-11-07T07:18:28Z

...ce-connector/src/integration-test/java/io/aiven/kafka/connect/s3/source/IntegrationBase.java

+public interface IntegrationBase {
+
+    String DOCKER_IMAGE_KAFKA = "confluentinc/cp-kafka:7.7.0";
+    String PLUGINS_S_3_SOURCE_CONNECTOR_FOR_APACHE_KAFKA = "plugins/s3-source-connector-for-apache-kafka/";


Suggested change

String PLUGINS_S_3_SOURCE_CONNECTOR_FOR_APACHE_KAFKA = "plugins/s3-source-connector-for-apache-kafka/";

String PLUGINS_S3_SOURCE_CONNECTOR_FOR_APACHE_KAFKA = "plugins/s3-source-connector-for-apache-kafka/";

This is actually from a point @AnatolyPopov made

aindriu-aiven · 2024-11-07T07:18:38Z

...ce-connector/src/integration-test/java/io/aiven/kafka/connect/s3/source/IntegrationBase.java

+
+    String DOCKER_IMAGE_KAFKA = "confluentinc/cp-kafka:7.7.0";
+    String PLUGINS_S_3_SOURCE_CONNECTOR_FOR_APACHE_KAFKA = "plugins/s3-source-connector-for-apache-kafka/";
+    String S_3_SOURCE_CONNECTOR_FOR_APACHE_KAFKA_TEST = "s3-source-connector-for-apache-kafka-test-";


Suggested change

String S_3_SOURCE_CONNECTOR_FOR_APACHE_KAFKA_TEST = "s3-source-connector-for-apache-kafka-test-";

String S3_SOURCE_CONNECTOR_FOR_APACHE_KAFKA_TEST = "s3-source-connector-for-apache-kafka-test-";

aindriu-aiven · 2024-11-07T08:24:28Z

...rce-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/SourceRecordIterator.java

+
+                long currentOffset;
+
+                if (offsetManager.getOffsets().containsKey(partitionMap)) {


Would it make sense to move this if-else into the offset manager, we can create an interface, and re-use this functionality in all source connectors going forward?

Yes, it would be better that way.

Partially fixes https://aiven.atlassian.net/browse/KCON-2 Currently when max.tasks is set to above 1, then each of those tasks are processing all objects in the bucket, which should not be the case. This pr does the below (bug fix for distributed mode) * based on the hash of the object key, assigns the objects to tasks * Updated integration tests with max tasks > 1

… all (#351) Currently the transformers load the files and get a list of records. This could cause performance issues for large files. * With Stream/StreamSupport, only when next() is called from iterator, a record is transformed.

* Migrating tests to Awaitility instead of plain Thread.sleep * Some refactoring to unify message consumption logic in tests where possible.

Current implementation cannot handle large Avro files, due to the initialisation of stream in try resources within transformer. - In this custom splitter - The tryAdvance method reads one record at a time and processes it. - Updated integration test with large number of avro records in one object

Move all the source common config so it can be re-used by other source connectors. * Adds two config fragments which are logically groupings for source connector configuration * Parquet is moved with library changes to use the same parquet version as the sink connector. --------- Signed-off-by: Aindriu Lavelle <aindriu.lavelle@aiven.io>

Signed-off-by: Aindriu Lavelle <aindriu.lavelle@aiven.io>

The update makes updates to the SourceRecordIterator to remove the requirement for a S3Client and specific S3 knowledge from the iterator. The iterator will now also call for more files after the initial set of files has been processed. The only remaining work to be done is to remove the construction of the S3Object into an iterator from the SourceRecordIterator in a follow up PR which will allow it to be completely re-useable. --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Aindriu Lavelle <aindriu.lavelle@aiven.io> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Signed-off-by: Aindriu Lavelle <aindriu.lavelle@aiven.io>

KCON-36 If a certain number of records are already processed in a file during dr, do not re process those records. Number of processed recs are already stored in offset storage. Retrieve that and skip in the stream.

Add Errors tolerance configuration. 1) Allows configuration of the connect framework feature and how it should handle source records which are malformed or unable to be added to a kafka topic. 2) Also checks RecordProcessor and if a failed or malformed record should be ignored or cause the failure of the Connector. --------- Signed-off-by: Aindriu Lavelle <aindriu.lavelle@aiven.io>

* This update means we can now use the PREFIX in the AWS API allowing users to configure it to be more specific about what they want processed by the connector. --------- Signed-off-by: Aindriu Lavelle <aindriu.lavelle@aiven.io>

Fix for KCON-82 This PR fixes the bugs and simplifies the creation of a proper transformer by relieving the developer of having to think about the stream and just think about how to create the item. Streaming Tests are included for all current Transformer implementations. Transformer is converted to an abstract class. An abstract Transformer.StreamSpliterator class is created to handle the common checking for end of stream and closing the input file(s).

KCON-25 Addresses #316 (comment) - Removed the converters instantiation - For avro using AvroData utils to create SchemaAndValue - For json, as there are no utils, relying on json converter - Deleted the transformation of data (serialization, toConnectData) in transformers With this change, redundant transformation is removed, making it flexible for consumers

Add Service Loader for quick start up Signed-off-by: Aindriu Lavelle <aindriu.lavelle@aiven.io>

KCON-9 Read me to configure aws s3 source connector

The AWS 1.X sdk is in maintenance mode and will be out of support by December 2025. Key differences are * Use of the builder pattern when creating objects * get and set removed from getters and setters e.g. getKey(), setKey(newKey) -> key(), key(newKey) * S3Client is immutable * different package names * Additional built in functionality removing some of the work from the connector implementation and having the existing library handle it. SDK 1.X still in use by sink connector but that will be required to be updated as well in the future, but this means the s3-commons code has both the 1.x and 2.x jars. --------- Signed-off-by: Aindriu Lavelle <aindriu.lavelle@aiven.io>

muralibasani marked this pull request as ready for review October 31, 2024 11:36

muralibasani requested review from a team as code owners October 31, 2024 11:36

muralibasani force-pushed the s3-source-release branch from ee006b8 to 399c883 Compare October 31, 2024 16:08