Releases: Azure/azure-event-hubs-spark
v2.3.5
Release notes:
- IoT Hub system properties are available
- Documentation updates
- Unit test improvements.
- Various bug fixes. Since the addition of the cached event hub receiver in 2.3.3, there have been some bugs affecting the reliability of the connector. With this latest release, all known bugs have been fixed.
v2.2.5
This is identical to v2.3.5. It's only compatible with Spark 2.1 and Spark 2.2 (whereas v2.3.5 is compatible with Spark 2.3 and up).
v2.3.4
This adds a critical bug fix to 2.3.3
. We recommend everyone upgrade as soon as possible. Release notes:
- Improved exception messaging in the cached receiver.
- Bug fix in the cached receiver
v2.2.4
This is identical to v2.3.4. It's only compatible with Spark 2.1 and Spark 2.2 (whereas v2.3.4 is compatible with Spark 2.3 and up).
v2.2.3
This is identical to v2.3.3
. It's only compatible with Spark 2.1 and Spark 2.2.
v2.3.3
This release focused on bug fixes + refactoring/cleanup. Release notes:
- Cached receiver is now async
- Various bug fixes in schema type handling, receive calls, and client
- Properties can be added in EventHubsSink feature
- Refactoring in send calls and management calls
v2.2.2
v2.3.2
Version 2.3.2
of azure-eventhubs-spark_2.11
. Release notes:
- Bug fixes
- Fixed data loss check in
getBatch
- excess warnings are no longer printed. - Prefetch count can no longer be set below the minimum allowed value.
- Invalid offsets are detected in
translate
.
- Fixed data loss check in
- Enhancements
- cached receivers are used. This allows receivers to be reused across batches which dramatically improves receive times. This changed
required a move to epoch receivers which means each Spark application requires it's own consumer group in order to run properly. - properties has been added to the Structured Streaming schema
- partition has been added to the Structured Streaming schema
- Check for old fromSeqNos in DStreams. If a DStream falls behind (e.g. events expire from the Event Hub before they are consumed by Spark),
then Spark will move to the earliest valid event. Previously, the job would fail in this case. - Add retry mechanism for certain exceptions thrown in
getRunTimeInfo
. Exceptions are retried if the exception (or the inner exception) is
anEventHubException
andgetIsTransient
returnstrue
. - All service API calls are now asynchronously done
ForeachWriter
Implementation has been added. ThisForeachWriter
uses asynchronous sends and performs much better than the existing
EventHubsSink
.translate
has been omptimized- javadocs have been added
- Only the necessary configs are sent to executors (the unneeded onces are trimmed by the driver before they're sent over)
- ConnectionString validation added in
EventHubsConf
- Improved error messaging
- All singletons (
ClientConnectionPool
andCachedEventHubsReceiver
) use globally valid keys.
- cached receivers are used. This allows receivers to be reused across batches which dramatically improves receive times. This changed
- Various documentation updates
- Various unit tests have been added
v2.3.1
Release notes:
- Translate method optimization for quicker startup
- Default starting position is EndOfStream
- Added Receiver Identifier for improved tracing
- Moved to latest Event Hubs Java Client
- Various bug fixes
This version is only compatible with Spark 2.3. Please see the Latest Releases
section on the README to see which version of the library works with your version of Spark 👍
v2.3.0
This tags the first release of the re-written connector. The major changes are listed below:
- No more progress tracker. Instead use "checkpointLocation" in Structured Streaming or "checkpoint" in Spark Streaming. For Spark Streaming, see "Storing Offsets" in the integration guide.
- Switched to sequence number filtering internally. This allows us to know the start and end point of each batch deterministically. In other words, 1000 events from sequence number X is simply X + 1000 whereas 1000 events from byte offset Y cannot be known apriori. This (plus the progress tracker being removed) allows us to have concurrent jobs.
- Parallelized all API calls
- Connection pooling of EventHub clients
- Thread pooling per EventHub client
- Added EventHubsSink
- Added EventHubsRelation (batch style query)
- Preferred location. Spark now consistently schedules partitions on the same executors across batches. This will allow us to use a prefetch queue across batches in future releases.
- Spark 2.3 support
- Databricks (and Azure Databricks) support
- Added Spark core support
- Moved to EventHubs Java Client 1.0.0
- Java support in Spark Streaming
- Allow users to manage their own offsets with the HasOffsetRanges trait. See the integration guide for details!
- Per partition configuration for starting positions, ending positions, and max rates.
- Users can start their jobs from START_OF_STREAM and END_OF_STREAM
- EventHubs receiver timeout and operation timeout are now configurable
- Non-public and international clouds are properly supported
Additionally the repo was improved:
- Documentation rewrite
- README is revamped