Search for large data sources for performance profiling #975

aaronsteers · 2022-09-15T18:34:17Z

aaronsteers
Sep 15, 2022

Wanted to open this discussion to locate large data sources for performance benchmarking.

Related to:

A few known options to kick off the discussion:

Sample Datasets from Snowflake and BigQuery.
- Both have a large library of datasets available to logged in users of these platforms. We could locate a few large data sources in each, and write performance test scripts that assume connectivity to one or the other.
Gov open data.
- We could use something like tap-socrata to access one ore more large datasets in the public domain.
Fabricated data.
- We could alter tap-smoke-test to fabricate large randomized datasets on-demand.

tayloramurphy · 2022-09-15T20:57:02Z

tayloramurphy
Sep 15, 2022
Maintainer

Something in 20 GB range, show extraction on record by record, SDK-based batching, native batching on both. Seems like a good resource https://www.reddit.com/r/bigquery/wiki/datasets

5 replies

aaronsteers Sep 15, 2022
Author

Yeah, what's nice about BigQuery is that it's super easy to get an account. Configuring the tap is a little tricky but aside from that, it is very streamlined: it integrates with people's individual (or professional) Google accounts and it gives access to some very large (and interesting!) raw data.

BuzzCutNorman Sep 28, 2022

I am still partial to the Stack Overflow dataset. I created a tap for the xml files that Stack Overflow makes available on Archieve.org. https://github.com/BuzzCutNorman/tap-stackoverflow-sampledata.

aaronsteers Sep 29, 2022
Author

@BuzzCutNorman - Wow - I just looked at the tap you created and it looks great. 🤩

This creates a really nice repeatable process for anyone in the community who wants to do their own benchmarks:

Download the datasets using the links in your readme.
Install the tap and configure it with the path to the downloaded files.
Run the tap to a target directly, cat the output to a local buffer file, or load it into a database that you want to test as a tap.
Test, observe, tweak, repeat! 🎉

cc @edgarrmondragon, @kgpayne fyi

BuzzCutNorman Sep 29, 2022

@aaronsteers Thanks, tried to keep it minimal. You can also just use one file, or two or more files to check a stream transition. The tap will detect which files are in the folder as long as you don't rename them. Please let me know if you run into any issues using it.

cc: @edgarrmondragon , @kgpayne

BuzzCutNorman Oct 19, 2022

Just updated https://github.com/BuzzCutNorman/tap-stackoverflow-sampledata to SDK 0.12.0 and added batch config documentation.

edgarrmondragon · 2022-09-15T21:00:47Z

edgarrmondragon
Sep 15, 2022
Maintainer

Would getting good metrics logging in the SDK help with this?

I imagine it would at least help to compare record-by-record vs batch by looking at a record count timeseries in e.g. Prometheus. For example, backpressure would become apparent.

2 replies

aaronsteers Sep 29, 2022
Author

@edgarrmondragon - We discussed already offline, but wanted to loop back here for the thread.

I do think that the work to let metrics logs to a specific logger (and log file) would be a big step in the right direction. This replaces the grep approach to trying to find metrics logs within the context of all other logs - and for very long streams, I think that's really helpful.

Users would presumably still need to aggregate the logs, but at least then they are all in one place and ready to be consumed.

edgarrmondragon Sep 29, 2022
Maintainer

@aaronsteers I started a more general discussion on metrics: #1016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search for large data sources for performance profiling #975

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Search for large data sources for performance profiling #975

aaronsteers Sep 15, 2022

Replies: 2 comments · 7 replies

tayloramurphy Sep 15, 2022 Maintainer

aaronsteers Sep 15, 2022 Author

BuzzCutNorman Sep 28, 2022

aaronsteers Sep 29, 2022 Author

BuzzCutNorman Sep 29, 2022

BuzzCutNorman Oct 19, 2022

edgarrmondragon Sep 15, 2022 Maintainer

aaronsteers Sep 29, 2022 Author

edgarrmondragon Sep 29, 2022 Maintainer

aaronsteers
Sep 15, 2022

Replies: 2 comments 7 replies

tayloramurphy
Sep 15, 2022
Maintainer

aaronsteers Sep 15, 2022
Author

aaronsteers Sep 29, 2022
Author

edgarrmondragon
Sep 15, 2022
Maintainer

aaronsteers Sep 29, 2022
Author

edgarrmondragon Sep 29, 2022
Maintainer