-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[native] Add TPC-DS connector #23067
base: master
Are you sure you want to change the base?
Conversation
69d9c1b
to
a88f29d
Compare
a88f29d
to
03214b7
Compare
8734eef
to
3f40e32
Compare
Consider updating the Use Cases in the Presto C++ section of the Presto doc to add TPC-DS to the supported connectors. |
4d36367
to
375dc59
Compare
375dc59
to
c9e0947
Compare
64ab39e
to
f08bd87
Compare
Consider a release note entry similar to the following:
|
dc5f0e2
to
8a5fe7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pull updated branch, new local docs build, looks good. Thanks!
32aa4e4
to
42377b8
Compare
1f7db4d
to
1b86b4d
Compare
6d9a0c5
to
e69c19d
Compare
@aditi-pandit @yingsu00 |
FYI: @czentgr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pramodsatya and @pdabre12.
Have done a quick round of review.
presto-native-execution/presto_cpp/presto_protocol/presto_protocol.yml
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/tpcds/dsdgen/dsdgen-c/StringBuffer.cpp
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/CMakeLists.txt
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/tpcds/TpcdsGen.cpp
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/tpcds/TpcdsGen.cpp
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/tpcds/TpcdsGen.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the dsdgen should be moved to the external folder.
47e2545
to
cc187f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intiial comments. There might be more but to get you going.
presto-native-execution/presto_cpp/main/connectors/tpcds/CMakeLists.txt
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/tpcds/CMakeLists.txt
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/tpcds/DSDGenIterator.cpp
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/tpcds/DSDGenIterator.h
Outdated
Show resolved
Hide resolved
cc187f0
to
eb763e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdabre12 Some comments. I think we can merge the external portion first in a separate commit as they are copied externally. That makes it easy to review the new bits. I see that it is already a separate commit.
@aditi-pandit, @tdcmeehan I think we should have a RFC for this work. What do you think? The RFC can be simple.
* TPCH connector, with ``tpch.naming=standard`` catalog property. | ||
|
||
* TPCDS connector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a column naming issue for TPCDS similar to TPCH?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not aware of the column naming issue for TPCH, can you please elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TPCH columns have a prefix. E.g. for the lineitem table the columns have l_
as prefix.
CREATE TABLE tpch.tiny.lineitem ( "l_orderkey" bigint NOT NULL, "l_partkey" bigint NOT NULL, "l_suppkey" bigint NOT NULL, "l_linenumber" integer NOT NULL, "l_quantity" double NOT NULL, "l_extendedprice" double NOT NULL, "l_discount" double NOT NULL, "l_tax" double NOT NULL, "l_returnflag" varchar(1) NOT NULL, "l_linestatus" varchar(1) NOT NULL, "l_shipdate" date NOT NULL, "l_commitdate" date NOT NULL, "l_receiptdate" date NOT NULL, "l_shipinstruct" varchar(25) NOT NULL, "l_shipmode" varchar(10) NOT NULL, "l_comment" varchar(44) NOT NULL )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the TPCDS column does not have a prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong, actually it does have a prefix.
SELECT cc_call_center_sk, cc_name, cc_manager, cc_mkt_id, cc_mkt_class FROM call_center
#pragma once | ||
|
||
#include "velox/common/memory/Memory.h" | ||
#include "velox/vector/ComplexVector.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this header? Can we forward declare?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Should be possible to forward declare RowVectorPtr.
presto-native-execution/presto_cpp/main/connectors/tpcds/TpcdsGen.cpp
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/tpcds/utils/append_info-c.hpp
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/tpcds/utils/append_info-c.hpp
Outdated
Show resolved
Hide resolved
presto-tpcds/src/main/java/com/facebook/presto/tpcds/TpcdsMetadata.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdabre12, @pramodsatya : Have a first round of comments.
// TpcdsTransactionHandle is special since | ||
// the corresponding class in Java is an enum. | ||
|
||
namespace facebook::presto::protocol { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add API's to PrestoToVeloxConnector to handle transaction handles as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, could this be taken up in a separate PR? (since it affects transaction handles for all connectors)
presto-native-execution/presto_cpp/main/connectors/tpcds/utils/append_info-c.hpp
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/tpcds/utils/append_info-c.hpp
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/tpcds/utils/append_info-c.hpp
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/tpcds/DSDGenIterator.h
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/tpcds/DSDGenIterator.h
Outdated
Show resolved
Hide resolved
vector_size_t parallel, | ||
vector_size_t child); | ||
|
||
void initializeTable(const std::vector<VectorPtr>& children, int table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments for each of the methods in this class as this is the final interface seen over dsdgen
presto-native-execution/presto_cpp/main/connectors/tpcds/DSDGenIterator.h
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/tpcds/DSDGenIterator.h
Outdated
Show resolved
Hide resolved
@pdabre12 @majetideepak : Small RFC is okay. Let's bring this to Native worker working group next session so that it gets immediate attention from others. |
eb763e0
to
ce9914b
Compare
ce9914b
to
9dba598
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pull branch, local docs build.
5ff51ef
to
a1afddd
Compare
fb56a83
to
09a13d0
Compare
a373413
to
ce1e567
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pramodsatya @pdabre12 : Please add a README for this folder describing the origin of this code and if you have made any changes to the original dsdgen for the Prestissimo connector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments on the first two commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test to validate the schema is generated correctly with this config
public TpcdsMetadata() | ||
private final boolean toggleCharToVarchar; | ||
|
||
public TpcdsMetadata(boolean toggleCharToVarchar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can retain the old constructor and have it set toggleCharToVarchar as false, so that it can be used by the Java connector. Then you will not need to change tests.
@@ -57,7 +57,7 @@ public TpcdsRecordSet(Results results, List<Column> columns) | |||
this.columns = ImmutableList.copyOf(columns); | |||
ImmutableList.Builder<Type> columnTypes = ImmutableList.builder(); | |||
for (Column column : columns) { | |||
columnTypes.add(getPrestoType(column.getType())); | |||
columnTypes.add(getPrestoType(column.getType(), false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment that this code is only used by the Java connector which doesn't allow this option.
@@ -109,6 +110,16 @@ private int getSplitsPerNode(Map<String, String> properties) | |||
} | |||
} | |||
|
|||
private boolean toggleCharToVarchar(Map<String, String> properties) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this option is only valid in Native code as the Java RecordSet always uses false for it.
If we expect that the only use-case for this property is for native code temporarily and should be removed after Prestissimo starts supporting char type, then we can remove this config and just generate the TableMetadata correctly depending on isNativeExecutionEnabled().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is odd for users to get different results (types) based on the worker type. Making this an option makes it explicit for the user and allows them to make adjustments to their pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TPC-DS standard specifies the column as char and Presto Java supports it correctly as well. We are kind of regressing by allowing the option to use varchar in the schema instead.
The Native engine is working around its limitation of types and using varchar for these fields. I feel this is very explicit knowledge for tools/pipelines using Native engine. They sign up for this behavior when they decide to use the tpc-ds connector with native engine. Though we could make it more explicit if they are forced to set a config to start using the "modified" tpc-ds connector.
@pdabre12 : Do we have a notion of native only/java only session config with side-car enabled ? This is a good candidate for a native only config I feel.
@majetideepak : Do you feel that the varchar/char(n) work will come along in the near term ? If yes, then we can use the correct type in the TPC-DS connector from the beginning itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are kind of regressing by allowing the option to use varchar in the schema instead.
We can simply throw an error for Java that varchar is not supported. Don't have to add support. Similarly, we will throw an error for Native that char is not supported until we support it.
They sign up for this behavior when they decide to use the tpc-ds connector with native engine.
They will only know this by reading the documentation or realize this after generating some TPC-DS data. The config will give them an advance notice.
Do you feel that the varchar/char(n) work will come along in the near term
This work is getting closer to completion. It does make sense to wait for this to complete instead of adding workarounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aditi-pandit We should be able to do that, it is still an ongoing effort : #23968
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simply throw an error for Java that varchar is not supported. Don't have to add support. Similarly, we will throw an error for Native that char is not supported until we support it.
This would mean throwing for almost all the tables because most of them have a column with char data type..
@@ -269,6 +275,10 @@ void PrestoServer::run() { | |||
std::make_unique<SystemPrestoToVeloxConnector>("system")); | |||
registerPrestoToVeloxConnector( | |||
std::make_unique<SystemPrestoToVeloxConnector>("$system@system")); | |||
#ifdef PRESTO_ENABLE_TPCDS_CONNECTOR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use this flag for tpcds connector ? Why not always build and enable it by default like TpchConnector ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TPC-H connector option for Velox is present in the Velox CMakeLists.txt.
VELOX_ENABLE_TPCH_CONNECTOR
Since it is ON by default in Velox, it works for Prestissimo. We must fix this.
Having these options reduces build time and avoids dependencies in other cases. We can get a light worker and these options don't hurt.
The ifdef can go to the TPC-DS implementation and keep the usage here clean. See how Parquet and storage adapters are handled.
Map<String, String> tpcdsProperties = ImmutableMap.<String, String>builder() | ||
.put("tpcds.toggle-char-to-varchar", "true") | ||
.build(); | ||
queryRunner.createCatalog("tpcds", "tpcds", tpcdsProperties); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires PRESTO_ENABLE_TPCDS_CONNECTOR to always be set. I feel we should remove that compilation option and always build and use TpcdsConnector.
Can you check the Java behavior ? We should make native and java consistent.
for (int i = 0; i < splitsPerNode; i++) { | ||
splits.add(new TpcdsSplit(tableHandle, partNumber, totalParts, ImmutableList.of(node.getHostAndPort()), noSexism)); | ||
partNumber++; | ||
if (smallTables.contains(tableHandle.getTableName())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change becomes applicable for both native and java workers. Do we really want to change java behavior ?
const RowTypePtr& type, | ||
size_t vectorSize, | ||
memory::MemoryPool* pool) { | ||
std::vector<VectorPtr> vector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see what you mean. Maybe call this allocateChildVectors instead.
} | ||
} | ||
|
||
std::string toTableName(Table table) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the reverse of the map in fromTableName for toTableName. PlanNode.cpp has some good examples of the usage of such maps e.g. https://github.com/facebookincubator/velox/blob/main/velox/core/PlanNode.cpp#L1452
/// Returns the schema (RowType) for a particular TPC-DS table. | ||
const velox::RowTypePtr getTableSchema(Table table); | ||
|
||
Table fromTableName(const std::string_view& tableName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this closer to toTableName(..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this copied code is full of usage of unsafe functions, no initializations, possible buffer overflows (aka no checks if we could write out of bounds into a given buffer - the code will likely have definitions that will allocate enough memory but that doesn't make it much safer) that will trigger the static scan.
Perhaps we can clean up a lot of this and refactor some of these functions?
* TPCH connector, with ``tpch.naming=standard`` catalog property. | ||
|
||
* TPCDS connector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TPCH columns have a prefix. E.g. for the lineitem table the columns have l_
as prefix.
CREATE TABLE tpch.tiny.lineitem ( "l_orderkey" bigint NOT NULL, "l_partkey" bigint NOT NULL, "l_suppkey" bigint NOT NULL, "l_linenumber" integer NOT NULL, "l_quantity" double NOT NULL, "l_extendedprice" double NOT NULL, "l_discount" double NOT NULL, "l_tax" double NOT NULL, "l_returnflag" varchar(1) NOT NULL, "l_linestatus" varchar(1) NOT NULL, "l_shipdate" date NOT NULL, "l_commitdate" date NOT NULL, "l_receiptdate" date NOT NULL, "l_shipinstruct" varchar(25) NOT NULL, "l_shipmode" varchar(10) NOT NULL, "l_comment" varchar(44) NOT NULL )
* TODO: None | ||
*/ | ||
StringBuffer_t* InitBuffer(int nSize, int nIncrement) { | ||
StringBuffer_t* pBuf; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this code was copied but to reduce the static scan can we initialize pointers to NULL?
nRemaining += pBuf->nIncrement; | ||
} | ||
|
||
strcat(pBuf->pText, pStr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use functions that limit how much data is copied. In this case we can say
strncat(pBuf->pText, pStr, pBuf->nBytesAllocated);
* TODO: None | ||
*/ | ||
int mk_address(ds_addr_t* pAddr, int nColumn, DSDGenContext& dsdGenContext) { | ||
int i, nRegion; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets initialize variables before their first usage, unless they are set in the very next call e.g. like it is done for nMaxCities and nMaxCounties.
int ftodec(decimal_t* dest, double f) { | ||
static char valbuf[20]; | ||
|
||
sprintf(valbuf, "%f", f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be
snprintf(valbuf, sizeof(valbuf), "%f", f);
@czentgr : Do you have a compiled list of coding patterns to look out for to avoid errors with the security static scan (could be your own doc or something we find online) ? Or is there a way to run the scan on this code from time to time ?Would be really helpful when reviewing code. We can add it to the coding guidelines as well. |
@@ -109,6 +110,16 @@ private int getSplitsPerNode(Map<String, String> properties) | |||
} | |||
} | |||
|
|||
private boolean toggleCharToVarchar(Map<String, String> properties) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tpcds.use-varchar-type
is a better name.
Co-authored-by: Pramod Satya <pramod.satya@ibm.com>
Co-authored-by: Pramod Satya <pramod.satya@ibm.com>
Co-authored-by: Pratik Joseph Dabre <pdabre12@gmail.com>
Co-authored-by: Pratik Joseph Dabre <pdabre12@gmail.com>
ce1e567
to
2631afe
Compare
Description
Adds the native TPC-DS connector and the codegen (dsdgen) files. The TPC-DS connector will be a wrapper for the data generator (dsdgen) provided by the TPC organization, originally implemented in C. This implementation must mimic the behavior of the original C implementation. DuckDB has a TPC-DS connector that uses C++ files to wrap the C implementation, and we will use these C++ files for our implementation.
Motivation and Context
The goal is to add a TPC-DS connector to enhance our ability to write end-to-end tests for Prestissimo and conduct micro-benchmarks in Velox. This addition will significantly improve our testing capabilities by ensuring comprehensive coverage and performance validation.
Impact
With this change, we can now write e2e tests and microbenchmarks.
Test Plan
Native end-to-end tests.
Future enhancements will include adding SpeedTest and ConnectorTest to the Velox repository.
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.
Resolves: #22361