Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[native] Add TPC-DS connector #23067

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

pdabre12
Copy link
Contributor

@pdabre12 pdabre12 commented Jun 24, 2024

Description

Adds the native TPC-DS connector and the codegen (dsdgen) files. The TPC-DS connector will be a wrapper for the data generator (dsdgen) provided by the TPC organization, originally implemented in C. This implementation must mimic the behavior of the original C implementation. DuckDB has a TPC-DS connector that uses C++ files to wrap the C implementation, and we will use these C++ files for our implementation.

Motivation and Context

The goal is to add a TPC-DS connector to enhance our ability to write end-to-end tests for Prestissimo and conduct micro-benchmarks in Velox. This addition will significantly improve our testing capabilities by ensuring comprehensive coverage and performance validation.

Impact

With this change, we can now write e2e tests and microbenchmarks.

Test Plan

Native end-to-end tests.
Future enhancements will include adding SpeedTest and ConnectorTest to the Velox repository.

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

TPC-DS Connector Changes
* Add TPC-DS connector for Presto C++. :pr:`23067`

Resolves: #22361

Copy link

linux-foundation-easycla bot commented Jun 24, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@pdabre12 pdabre12 changed the title [native] Adds the TPC-DS connector [WIP] [native] Adds the TPC-DS connector Jul 9, 2024
@steveburnett
Copy link
Contributor

Consider updating the Use Cases in the Presto C++ section of the Presto doc to add TPC-DS to the supported connectors.

@pdabre12 pdabre12 force-pushed the tpcds-connector branch 2 times, most recently from 64ab39e to f08bd87 Compare August 2, 2024 23:49
@steveburnett
Copy link
Contributor

Consider a release note entry similar to the following:

== RELEASE NOTES ==

TPC-DS Connector Changes
* Add TPC-DS connector for Presto C++. :pr:`23067`

steveburnett
steveburnett previously approved these changes Aug 7, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local docs build, looks good. Thanks!

@pramodsatya pramodsatya changed the title [WIP] [native] Adds the TPC-DS connector [native] Adds the TPC-DS connector Aug 7, 2024
@pdabre12 pdabre12 force-pushed the tpcds-connector branch 2 times, most recently from 6d9a0c5 to e69c19d Compare August 9, 2024 20:30
@pdabre12
Copy link
Contributor Author

pdabre12 commented Aug 9, 2024

@aditi-pandit @yingsu00
This PR is now ready for review. Please take a look.
Thanks.

@pdabre12 pdabre12 marked this pull request as ready for review August 9, 2024 20:32
@pdabre12 pdabre12 requested review from elharo and a team as code owners August 9, 2024 20:32
@pdabre12 pdabre12 requested a review from presto-oss August 9, 2024 20:32
@pdabre12
Copy link
Contributor Author

pdabre12 commented Aug 9, 2024

FYI: @czentgr

@pdabre12 pdabre12 requested a review from czentgr August 9, 2024 20:42
Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pramodsatya and @pdabre12.

Have done a quick round of review.

@majetideepak majetideepak changed the title [native] Adds the TPC-DS connector [native] Add TPC-DS connector Sep 6, 2024
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the dsdgen should be moved to the external folder.

@pdabre12 pdabre12 force-pushed the tpcds-connector branch 2 times, most recently from 47e2545 to cc187f0 Compare September 6, 2024 19:54
Copy link
Contributor

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intiial comments. There might be more but to get you going.

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdabre12 Some comments. I think we can merge the external portion first in a separate commit as they are copied externally. That makes it easy to review the new bits. I see that it is already a separate commit.

@aditi-pandit, @tdcmeehan I think we should have a RFC for this work. What do you think? The RFC can be simple.

* TPCH connector, with ``tpch.naming=standard`` catalog property.

* TPCDS connector.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a column naming issue for TPCDS similar to TPCH?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of the column naming issue for TPCH, can you please elaborate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TPCH columns have a prefix. E.g. for the lineitem table the columns have l_ as prefix.

CREATE TABLE tpch.tiny.lineitem ( "l_orderkey" bigint NOT NULL, "l_partkey" bigint NOT NULL, "l_suppkey" bigint NOT NULL, "l_linenumber" integer NOT NULL, "l_quantity" double NOT NULL, "l_extendedprice" double NOT NULL, "l_discount" double NOT NULL, "l_tax" double NOT NULL, "l_returnflag" varchar(1) NOT NULL, "l_linestatus" varchar(1) NOT NULL, "l_shipdate" date NOT NULL, "l_commitdate" date NOT NULL, "l_receiptdate" date NOT NULL, "l_shipinstruct" varchar(25) NOT NULL, "l_shipmode" varchar(10) NOT NULL, "l_comment" varchar(44) NOT NULL )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the TPCDS column does not have a prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong, actually it does have a prefix.

SELECT cc_call_center_sk, cc_name, cc_manager, cc_mkt_id, cc_mkt_class FROM call_center

#pragma once

#include "velox/common/memory/Memory.h"
#include "velox/vector/ComplexVector.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this header? Can we forward declare?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Should be possible to forward declare RowVectorPtr.

Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdabre12, @pramodsatya : Have a first round of comments.

// TpcdsTransactionHandle is special since
// the corresponding class in Java is an enum.

namespace facebook::presto::protocol {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add API's to PrestoToVeloxConnector to handle transaction handles as well ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, could this be taken up in a separate PR? (since it affects transaction handles for all connectors)

vector_size_t parallel,
vector_size_t child);

void initializeTable(const std::vector<VectorPtr>& children, int table);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments for each of the methods in this class as this is the final interface seen over dsdgen

@aditi-pandit
Copy link
Contributor

@pdabre12 Some comments. I think we can merge the external portion first in a separate commit as they are copied externally. That makes it easy to review the new bits. I see that it is already a separate commit.

@aditi-pandit, @tdcmeehan I think we should have a RFC for this work. What do you think? The RFC can be simple.

@pdabre12 @majetideepak : Small RFC is okay. Let's bring this to Native worker working group next session so that it gets immediate attention from others.

steveburnett
steveburnett previously approved these changes Sep 24, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull branch, local docs build.

@pramodsatya pramodsatya force-pushed the tpcds-connector branch 3 times, most recently from fb56a83 to 09a13d0 Compare October 8, 2024 21:02
@pdabre12 pdabre12 force-pushed the tpcds-connector branch 2 times, most recently from a373413 to ce1e567 Compare October 14, 2024 21:55
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pramodsatya @pdabre12 : Please add a README for this folder describing the origin of this code and if you have made any changes to the original dsdgen for the Prestissimo connector.

Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments on the first two commits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test to validate the schema is generated correctly with this config

public TpcdsMetadata()
private final boolean toggleCharToVarchar;

public TpcdsMetadata(boolean toggleCharToVarchar)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can retain the old constructor and have it set toggleCharToVarchar as false, so that it can be used by the Java connector. Then you will not need to change tests.

@@ -57,7 +57,7 @@ public TpcdsRecordSet(Results results, List<Column> columns)
this.columns = ImmutableList.copyOf(columns);
ImmutableList.Builder<Type> columnTypes = ImmutableList.builder();
for (Column column : columns) {
columnTypes.add(getPrestoType(column.getType()));
columnTypes.add(getPrestoType(column.getType(), false));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment that this code is only used by the Java connector which doesn't allow this option.

@@ -109,6 +110,16 @@ private int getSplitsPerNode(Map<String, String> properties)
}
}

private boolean toggleCharToVarchar(Map<String, String> properties)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this option is only valid in Native code as the Java RecordSet always uses false for it.

If we expect that the only use-case for this property is for native code temporarily and should be removed after Prestissimo starts supporting char type, then we can remove this config and just generate the TableMetadata correctly depending on isNativeExecutionEnabled().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is odd for users to get different results (types) based on the worker type. Making this an option makes it explicit for the user and allows them to make adjustments to their pipelines.

Copy link
Contributor

@aditi-pandit aditi-pandit Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TPC-DS standard specifies the column as char and Presto Java supports it correctly as well. We are kind of regressing by allowing the option to use varchar in the schema instead.

The Native engine is working around its limitation of types and using varchar for these fields. I feel this is very explicit knowledge for tools/pipelines using Native engine. They sign up for this behavior when they decide to use the tpc-ds connector with native engine. Though we could make it more explicit if they are forced to set a config to start using the "modified" tpc-ds connector.

@pdabre12 : Do we have a notion of native only/java only session config with side-car enabled ? This is a good candidate for a native only config I feel.

@majetideepak : Do you feel that the varchar/char(n) work will come along in the near term ? If yes, then we can use the correct type in the TPC-DS connector from the beginning itself.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are kind of regressing by allowing the option to use varchar in the schema instead.

We can simply throw an error for Java that varchar is not supported. Don't have to add support. Similarly, we will throw an error for Native that char is not supported until we support it.

They sign up for this behavior when they decide to use the tpc-ds connector with native engine.

They will only know this by reading the documentation or realize this after generating some TPC-DS data. The config will give them an advance notice.

Do you feel that the varchar/char(n) work will come along in the near term

This work is getting closer to completion. It does make sense to wait for this to complete instead of adding workarounds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditi-pandit We should be able to do that, it is still an ongoing effort : #23968

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simply throw an error for Java that varchar is not supported. Don't have to add support. Similarly, we will throw an error for Native that char is not supported until we support it.

This would mean throwing for almost all the tables because most of them have a column with char data type..

@@ -269,6 +275,10 @@ void PrestoServer::run() {
std::make_unique<SystemPrestoToVeloxConnector>("system"));
registerPrestoToVeloxConnector(
std::make_unique<SystemPrestoToVeloxConnector>("$system@system"));
#ifdef PRESTO_ENABLE_TPCDS_CONNECTOR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use this flag for tpcds connector ? Why not always build and enable it by default like TpchConnector ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TPC-H connector option for Velox is present in the Velox CMakeLists.txt.
VELOX_ENABLE_TPCH_CONNECTOR
Since it is ON by default in Velox, it works for Prestissimo. We must fix this.
Having these options reduces build time and avoids dependencies in other cases. We can get a light worker and these options don't hurt.
The ifdef can go to the TPC-DS implementation and keep the usage here clean. See how Parquet and storage adapters are handled.

Map<String, String> tpcdsProperties = ImmutableMap.<String, String>builder()
.put("tpcds.toggle-char-to-varchar", "true")
.build();
queryRunner.createCatalog("tpcds", "tpcds", tpcdsProperties);
Copy link
Contributor

@aditi-pandit aditi-pandit Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires PRESTO_ENABLE_TPCDS_CONNECTOR to always be set. I feel we should remove that compilation option and always build and use TpcdsConnector.

Can you check the Java behavior ? We should make native and java consistent.

for (int i = 0; i < splitsPerNode; i++) {
splits.add(new TpcdsSplit(tableHandle, partNumber, totalParts, ImmutableList.of(node.getHostAndPort()), noSexism));
partNumber++;
if (smallTables.contains(tableHandle.getTableName())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change becomes applicable for both native and java workers. Do we really want to change java behavior ?

const RowTypePtr& type,
size_t vectorSize,
memory::MemoryPool* pool) {
std::vector<VectorPtr> vector;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you mean. Maybe call this allocateChildVectors instead.

}
}

std::string toTableName(Table table) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the reverse of the map in fromTableName for toTableName. PlanNode.cpp has some good examples of the usage of such maps e.g. https://github.com/facebookincubator/velox/blob/main/velox/core/PlanNode.cpp#L1452

/// Returns the schema (RowType) for a particular TPC-DS table.
const velox::RowTypePtr getTableSchema(Table table);

Table fromTableName(const std::string_view& tableName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this closer to toTableName(..)

Copy link
Contributor

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this copied code is full of usage of unsafe functions, no initializations, possible buffer overflows (aka no checks if we could write out of bounds into a given buffer - the code will likely have definitions that will allocate enough memory but that doesn't make it much safer) that will trigger the static scan.

Perhaps we can clean up a lot of this and refactor some of these functions?

* TPCH connector, with ``tpch.naming=standard`` catalog property.

* TPCDS connector.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TPCH columns have a prefix. E.g. for the lineitem table the columns have l_ as prefix.

CREATE TABLE tpch.tiny.lineitem ( "l_orderkey" bigint NOT NULL, "l_partkey" bigint NOT NULL, "l_suppkey" bigint NOT NULL, "l_linenumber" integer NOT NULL, "l_quantity" double NOT NULL, "l_extendedprice" double NOT NULL, "l_discount" double NOT NULL, "l_tax" double NOT NULL, "l_returnflag" varchar(1) NOT NULL, "l_linestatus" varchar(1) NOT NULL, "l_shipdate" date NOT NULL, "l_commitdate" date NOT NULL, "l_receiptdate" date NOT NULL, "l_shipinstruct" varchar(25) NOT NULL, "l_shipmode" varchar(10) NOT NULL, "l_comment" varchar(44) NOT NULL )

* TODO: None
*/
StringBuffer_t* InitBuffer(int nSize, int nIncrement) {
StringBuffer_t* pBuf;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this code was copied but to reduce the static scan can we initialize pointers to NULL?

nRemaining += pBuf->nIncrement;
}

strcat(pBuf->pText, pStr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use functions that limit how much data is copied. In this case we can say

strncat(pBuf->pText, pStr, pBuf->nBytesAllocated);

* TODO: None
*/
int mk_address(ds_addr_t* pAddr, int nColumn, DSDGenContext& dsdGenContext) {
int i, nRegion;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets initialize variables before their first usage, unless they are set in the very next call e.g. like it is done for nMaxCities and nMaxCounties.

int ftodec(decimal_t* dest, double f) {
static char valbuf[20];

sprintf(valbuf, "%f", f);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be

snprintf(valbuf, sizeof(valbuf), "%f", f);

@aditi-pandit
Copy link
Contributor

aditi-pandit commented Nov 13, 2024

@czentgr : Do you have a compiled list of coding patterns to look out for to avoid errors with the security static scan (could be your own doc or something we find online) ? Or is there a way to run the scan on this code from time to time ?Would be really helpful when reviewing code. We can add it to the coding guidelines as well.

@@ -109,6 +110,16 @@ private int getSplitsPerNode(Map<String, String> properties)
}
}

private boolean toggleCharToVarchar(Map<String, String> properties)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tpcds.use-varchar-type is a better name.

pdabre12 and others added 4 commits November 18, 2024 08:33
Co-authored-by: Pramod Satya <pramod.satya@ibm.com>
Co-authored-by: Pramod Satya <pramod.satya@ibm.com>
Co-authored-by: Pratik Joseph Dabre <pdabre12@gmail.com>
Co-authored-by: Pratik Joseph Dabre <pdabre12@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Native] Add Prestissimo TPC-DS connector
6 participants