Apoursam/spark kafka input #1272

eldernewborn · 2024-11-01T01:35:14Z

[vpj] Spark input module to read from a Kafka topic and populate a dataframe

Introduced the VenicePubsubSource which is a table provider, backed by Kafka topics to materialize contents of a Venice Kafka topic as a dataframe to be used by Spark Jobs.
This is a foundational building block for KIF functionality (repush data from a Kafka topic source) as well as Data consistency Checker.

The implementation offers some niceties like a splitter that can split the task into chunks for better parallelism, which does so considering the available start and end offsets in topics, to account for message TTL and compaction.

Future Work to be done:
Add support for chunked messages and compressed messages.

How was this PR tested?

unit tests, integration tests, manual testing.

Does this PR introduce any user-facing changes?

No. You can skip the rest of this section.
Yes. Make sure to explain your proposed changes and call out the behavior change.

…s deeply entrenched in the system, adopting this approach for KIF repush. formatting fixes / datatype fixes / spotbug fixes

…merBugger.get that comes back as null .

…er and ...

…issing records situations.

…. Polished logging.

ZacAttack · 2024-11-18T21:46:47Z

clients/venice-push-job/src/main/java/com/linkedin/venice/spark/SparkConstants.java

@@ -20,6 +20,16 @@ public class SparkConstants {
      new StructField[] { new StructField(KEY_COLUMN_NAME, BinaryType, false, Metadata.empty()),
          new StructField(VALUE_COLUMN_NAME, BinaryType, true, Metadata.empty()) });

+  public static final StructType KAFKA_INPUT_TABLE_SCHEMA = new StructType(


Would we need to version this at all?

My plan is to keep this as is for the most part ( after getting a few runs in EI ) . and for the consistency checker I'll have an extended form of the table schema that contains the offset vectors and all the goodies prebaked and present for consistency checker.
If there are any specifics that you think we need for this ( and the ETL functionality ) Please advise.

I guess what I'm wondering, do we need to be concerned if this struct type changes.... my initial thinking is that it's probably ok since this used to generate NEW datasets, not process old ones?

ZacAttack · 2024-11-20T23:23:35Z

...venice-push-job/src/main/java/com/linkedin/venice/spark/input/pubsub/PartitionSplitters.java

+
+
+public class PartitionSplitters {
+  // need a method called fullPartitionSplitter, takes in list of partition start and end offsets


Why not use proper javadoc style comments?

ZacAttack · 2024-11-20T23:31:35Z

...venice-push-job/src/main/java/com/linkedin/venice/spark/input/pubsub/PartitionSplitters.java

+public class PartitionSplitters {
+  // need a method called fullPartitionSplitter, takes in list of partition start and end offsets
+  // and returns a list of VenicePubsubInputPartition splits
+  public static Map<Integer, List<List<Long>>> fullPartitionSplitter(Map<Integer, List<Long>> partitionOffsetsMap) {


It looks like theres an assumption that the list in entries for Map<Integer, List> partitionOffsetsMap need to be two entries, a start and an end. I'm not sure I see yet in this PR how this method is used, but it looks like between this and assembleSegment, an interface which uses a Pair might make more sense?

Or is there a flexibility gain in using a List that I'm missing?

ZacAttack · 2024-11-21T00:06:19Z

...venice-push-job/src/main/java/com/linkedin/venice/spark/input/pubsub/PartitionSplitters.java

+    return segment;
+  }
+
+  static long computeIntendedSplitLengthBasedOnCount(Map<Integer, List<Long>> partitionOffsetsMap, int totalSegments) {


What is the semantic difference between offsets and segments?

ZacAttack · 2024-11-22T20:51:53Z

...-push-job/src/main/java/com/linkedin/venice/spark/input/pubsub/table/VenicePubsubSource.java

+    properties.putAll(configs);
+    // the properties here is the entry point for all the configurations
+    // we receive from the outer layer.
+    // schem and partitioning are useless and should be discarded?


I think these were notes?

ZacAttack · 2024-11-22T20:52:30Z

...h-job/src/main/java/com/linkedin/venice/spark/input/pubsub/table/VenicePubsubInputTable.java

+    Properties properties = jobConfig.getPropertiesCopy();
+    properties.putAll(options.asCaseSensitiveMap());
+
+    return new VenicePubsubInputScanBuilder(new VeniceProperties(properties)); // should we flip this to


It looks like this comment isn't necessary?

ZacAttack · 2024-11-22T20:57:49Z

...sh-job/src/main/java/com/linkedin/venice/spark/input/pubsub/table/VenicePubsubInputScan.java

+      PubSubTopicRepository pubSubTopicRepository = new PubSubTopicRepository();
+      PubSubTopic pubSubTopic = pubSubTopicRepository.getTopic(topicName);
+      PubSubClientsFactory clientsFactory = new PubSubClientsFactory(jobConfig);
+      // PubSubAdminAdapter pubsubAdminClient =


Let's clean these out

eldernewborn force-pushed the apoursam/spark-kafka-input branch from 76abd13 to 3e423e0 Compare November 5, 2024 02:28

eldernewborn force-pushed the apoursam/spark-kafka-input branch from 2fce71a to 12f1777 Compare November 12, 2024 19:42

eldernewborn added 2 commits November 12, 2024 16:49

[vpj] Kafka ingestion into rows of a data frame via Spark Table API

2a2f4c3

Spotbug fix1

09cdd59

eldernewborn force-pushed the apoursam/spark-kafka-input branch from ffa9689 to 060ec68 Compare November 13, 2024 00:49

eldernewborn added 9 commits November 12, 2024 17:24

The use of VeniceProperties as the method for passing configuration i…

56cac6d

…s deeply entrenched in the system, adopting this approach for KIF repush. formatting fixes / datatype fixes / spotbug fixes

Unified the usage of property bags across the rest of the code.

4fc913e

Protect the system from NPEs resulting from calling addAll on a consu…

072ee4e

…merBugger.get that comes back as null .

Better offset math to deal with off-by-ones happening throughout .

b1870d9

Rearranged pubsub test.

68cf01c

Removed un-needed clients they are now embedded in the partition read…

09f4b01

…er and ...

Added counter for delivered records, can help with debugging future m…

1d7bebd

…issing records situations.

Removed un-needed tests and comments.

9f72cbb

Explicit imports per venice coding style.

2bc56ee

eldernewborn force-pushed the apoursam/spark-kafka-input branch from 060ec68 to 2bc56ee Compare November 13, 2024 01:31

Added method to expose stats from the record reader and related tests…

57606ec

…. Polished logging.

eldernewborn marked this pull request as ready for review November 13, 2024 16:50

ZacAttack reviewed Nov 18, 2024

View reviewed changes

ZacAttack reviewed Nov 20, 2024

View reviewed changes

ZacAttack reviewed Nov 21, 2024

View reviewed changes

ZacAttack reviewed Nov 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apoursam/spark kafka input #1272

Apoursam/spark kafka input #1272

eldernewborn commented Nov 1, 2024 •

edited

Loading

ZacAttack Nov 18, 2024

eldernewborn Nov 19, 2024

ZacAttack Nov 20, 2024

ZacAttack Nov 20, 2024

ZacAttack Nov 20, 2024

ZacAttack Nov 21, 2024

ZacAttack Nov 22, 2024

ZacAttack Nov 22, 2024

ZacAttack Nov 22, 2024



		public class PartitionSplitters {
		// need a method called fullPartitionSplitter, takes in list of partition start and end offsets

Apoursam/spark kafka input #1272

Are you sure you want to change the base?

Apoursam/spark kafka input #1272

Conversation

eldernewborn commented Nov 1, 2024 • edited Loading

[vpj] Spark input module to read from a Kafka topic and populate a dataframe

How was this PR tested?

Does this PR introduce any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eldernewborn commented Nov 1, 2024 •

edited

Loading