Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC: Data ingestion APIs #5222

Open
devinrsmith opened this issue Mar 5, 2024 · 1 comment
Open

EPIC: Data ingestion APIs #5222

devinrsmith opened this issue Mar 5, 2024 · 1 comment
Assignees
Labels
epic Feature Epic (User Story) triage
Milestone

Comments

@devinrsmith
Copy link
Member

This epic covers a wide variety of related topics around data APIs to make it easier to ingest data into the DH system.

At the lowest level, DH provides StreamPublisher / StreamConsumer constructions. They are both chunk-based APIs. The two most relevant stream consumers are StreamToBlinkTableAdapter (to turn a stream into a Table) and a DHE construction (to ingest a stream into the DIS).

The stream publishers encapsulate the logic specific to the data source, which often includes details about the stream and the underlying data deserialization. For example, (Kafka) with (Avro|Protobuf|JSON). Right now, Kafka is tightly coupled with the deserialization logic, and part of the goal of this epic is to improve that situation.

ObjectProcessor is one of these data ingestion APIs to help improve the situation (#4346). ObjectProcessor is a 1-to-1 API (1 record in, 1 row out) that decouples stream-specific logic from deserialization-logic.

Similar APIs are being built right now that generalize ObjectProcessor further:

  • 1-to-(0|1) (filtering: 1 record in, 0 or 1 rows out)
  • 1-to-n (multi-row: 1 record in, N rows out)
  • type discrimination: record goes to specific stream consumer based on type
  • routing (generalization of type discrimination): record goes to multiple stream consumers (based on configuration)
  • composability and transactionality of above
@devinrsmith devinrsmith added triage epic Feature Epic (User Story) labels Mar 5, 2024
@devinrsmith devinrsmith added this to the 3. Triage milestone Mar 5, 2024
@devinrsmith devinrsmith self-assigned this Mar 5, 2024
@devinrsmith
Copy link
Member Author

#2753

devinrsmith added a commit to devinrsmith/deephaven-core that referenced this issue Apr 11, 2024
This PR adds a declarative JSON configuration object that allows users to specify the schema of a JSON message. It is meant to have good out-of-the-box defaults, while still allowing power users to modify some of the finer parsing details (should this int field be parseable from a string? should null values be allowed? what if a field is missing? etc). The JSON configuration layer is not tied to any specific implementation; it is introspectible, and could have alternative implementations with other parsing backends. (I could imagine a DHE use-case where they do code-generation based on the JSON configuration, somewhat like the DHE avro ObjectProcessor code generator.)

Out of the box, there's an ObjectProcessor implementation based on the Jackson streaming APIs; that is, the data flows from byte[]s (or InputStream, relevant for very-large-files) to the output WritableChunks without the need for the intermediating Jackson databind layer (TreeNode). This saves a large layer of allocation that our current kafka json_spec layer relies upon. The ObjectProcessor layer means that this can be used in other places that expose ObjectProcessor layers and want 1-to-1 record-to-row (currently, Kafka).

Part of deephaven#5222
devinrsmith added a commit that referenced this issue Jun 18, 2024
This PR adds a declarative JSON configuration object that allows users to specify the schema of a JSON message. It is meant to have good out-of-the-box defaults, while still allowing power users to modify some of the finer parsing details. The JSON configuration layer is not tied to any specific implementation; it is introspectible, and could have alternative implementations with other parsing backends.

Out of the box, there's an ObjectProcessor implementation based on the Jackson streaming APIs; that is, the data flows from byte[]s (or InputStream, relevant for very-large-files) to the output WritableChunks without the need for the intermediating Jackson databind layer (TreeNode). This saves a large layer of allocation that our current kafka json_spec layer relies upon. The ObjectProcessor layer means that this can be used in other places that expose ObjectProcessor layers and want 1-to-1 record-to-row (currently, Kafka).

Part of #5222
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Feature Epic (User Story) triage
Projects
None yet
Development

No branches or pull requests

1 participant