Proposers
- Ajas M M
- Haritha K
- Thanzeel Hassan
- Glerin Pinhero
At present, when a query joins multiple tables, it creates a separate TableScanNode for each table. Each TableScanNode select all the records from that table. The join operation is then executed in-memory in Presto using a JOIN node by applying JoinCriteria, FilterPredicate and other criteria (like order by, limit, etc.).
However, if the query joins tables from the same JDBC datasource, it would most of the time be more efficient to let the datasource handle the join instead of creating a separate TableScanNode for each table and joining them in Presto. If we "Push down" these joins to remote JDBC datasource it increases the query performance. i.e., decreases the query execution time. We have seen an average improvement in query times of 3x-4x. Upto 10x in some cases.
Join Query performance improvements :
Select t1.custkey as t1_id, t2.custkey as t2_id, t1.name, t1.address, t1.phone, t2.orderstatus, t2.orderdate, t2.totalprice,
t2.orderpriority, t2.orderstatus, t2.clerk, t2.shippriority
from "postgres"."pg".customer t1
join "postgres"."pg".orders t2
on t1.custkey = t2.orderkey;
- customer table has 45 million rows, orders table has 1.5 million rows – Result is 8 rows.
No | Parameter | Normal Presto Flow | Join Pushdown Flow |
---|---|---|---|
1 | Elapsed Time | 3.73m | 59.32s |
2 | Prerequisites Wait Time | 4.01ms | 13.21ms |
3 | Queued Time | 3.99ms | 5.12ms |
4 | Planning Time | 250.37ms | 313.41ms |
5 | Execution Time | 3.73m | 58.98s |
6 | CPU Time | 2.69m | 30.00ms |
7 | Scheduled Time | 3.95m | 58.53s |
8 | Input Rows | 46.5M | 16.0 |
9 | Input Data | 3.93GB | 1.12kB |
10 | Shuffled Rows | 46.5M | 8.00 |
11 | Shuffled Data | 3.72GB | 1.28kB |
12 | Peak User Memory | 142.46MB | 0B |
13 | Peak Total Memory | 192.88MB | 0B |
14 | Cumulative User Memory | 29.5G seconds | 0 seconds |
15 | Cumulative Total | 30.9G seconds | 0 seconds |
16 | Output Rows | 8.00 | 8.00 |
17 | Output Data | 802B | 1.16kB |
For the below postgres join query if we push down the join to a single TableScanNode, then the Presto Plan and performance will be as follows :
Join Query
SELECT
b.book_id,
b.copies_available,
b.year_published,
l.total_seating_capacity,
l.number_of_staff
FROM
postgres.pg.books b
JOIN
postgres.pg.libraries l
ON b.library_id = l.library_id;
Original Presto Plan
Explain Analyze output :
Query Plan
"Fragment 1 [HASH]
CPU: 20.03ms, Scheduled: 57.41ms, Input: 10 rows (265B); per task: avg.: 10.00 std.dev.: 0.00, Output: 5 rows (125B), 1 tasks
Output layout: [book_id, copies_available, year_published, total_seating_capacity, number_of_staff]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
- InnerJoin[PlanNodeId 4][(""library_id"" = ""library_id_0"")][$hashvalue, $hashvalue_23] => [book_id:integer, copies_available:integer, year_published:integer, total_seating_capacity:integer, number_of_staff:integer]
CPU: 14.00ms (25.93%), Scheduled: 50.00ms (41.32%), Output: 5 rows (125B)
Left (probe) Input avg.: 0.31 rows, Input std.dev.: 387.30%
Right (build) Input avg.: 0.31 rows, Input std.dev.: 186.55%
Distribution: PARTITIONED
- RemoteSource[2] => [book_id:integer, library_id:integer, copies_available:integer, year_published:integer, $hashvalue:bigint]
CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Output: 5 rows (145B)
Input avg.: 0.31 rows, Input std.dev.: 387.30%
- LocalExchange[PlanNodeId 363][HASH][$hashvalue_23] (library_id_0) => [library_id_0:integer, total_seating_capacity:integer, number_of_staff:integer, $hashvalue_23:bigint]
Estimates: {source: CostBasedSourceInfo, rows: ? (?), cpu: ?, memory: 0.00, network: ?}
CPU: 1.00ms (1.85%), Scheduled: 1.00ms (0.83%), Output: 5 rows (120B)
Input avg.: 0.31 rows, Input std.dev.: 387.30%
- RemoteSource[3] => [library_id_0:integer, total_seating_capacity:integer, number_of_staff:integer, $hashvalue_24:bigint]
CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Output: 5 rows (120B)
Input avg.: 0.31 rows, Input std.dev.: 387.30%
Fragment 2 [SOURCE]
CPU: 21.82ms, Scheduled: 38.69ms, Input: 5 rows (0B); per task: avg.: 5.00 std.dev.: 0.00, Output: 5 rows (145B), 1 tasks
Output layout: [book_id, library_id, copies_available, year_published, $hashvalue_22]
Output partitioning: HASH [library_id][$hashvalue_22]
Stage Execution Strategy: UNGROUPED_EXECUTION
- ScanProject[PlanNodeId 0,401][table = TableHandle {connectorId='postgres', connectorHandle='JdbcTableHandle{connectorId=postgres, schemaTableName=pg.books, catalogName=null, schemaName=pg, tableName=books, joinTables=Optional.empty}', layout='Optional[{domains=ALL, additionalPredicate={}}]'}, grouped = false, projectLocality = LOCAL] => [book_id:integer, library_id:integer, copies_available:integer, year_published:integer, $hashvalue_22:bigint]
Estimates: {source: CostBasedSourceInfo, rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}/{source: CostBasedSourceInfo, rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}
CPU: 21.00ms (38.89%), Scheduled: 37.00ms (30.58%), Output: 5 rows (145B)
Input avg.: 5.00 rows, Input std.dev.: 0.00%
$hashvalue_22 := combine_hash(BIGINT'0', COALESCE($operator$hash_code(library_id), BIGINT'0')) (9:5)
LAYOUT: {domains=ALL, additionalPredicate={}}
copies_available := JdbcColumnHandle{connectorId=postgres, columnName=copies_available, jdbcTypeHandle=JdbcTypeHandle{jdbcType=4, jdbcTypeName=int4, columnSize=10, decimalDigits=0, arrayDimensions=null}, columnType=integer, nullable=true, comment=Optional.empty} (9:5)
year_published := JdbcColumnHandle{connectorId=postgres, columnName=year_published, jdbcTypeHandle=JdbcTypeHandle{jdbcType=4, jdbcTypeName=int4, columnSize=10, decimalDigits=0, arrayDimensions=null}, columnType=integer, nullable=true, comment=Optional.empty} (9:5)
library_id := JdbcColumnHandle{connectorId=postgres, columnName=library_id, jdbcTypeHandle=JdbcTypeHandle{jdbcType=4, jdbcTypeName=int4, columnSize=10, decimalDigits=0, arrayDimensions=null}, columnType=integer, nullable=true, comment=Optional.empty} (9:5)
book_id := JdbcColumnHandle{connectorId=postgres, columnName=book_id, jdbcTypeHandle=JdbcTypeHandle{jdbcType=4, jdbcTypeName=int4, columnSize=10, decimalDigits=0, arrayDimensions=null}, columnType=integer, nullable=true, comment=Optional.empty} (9:5)
Input: 5 rows (0B), Filtered: 0.00%
Fragment 3 [SOURCE]
CPU: 18.99ms, Scheduled: 35.35ms, Input: 5 rows (0B); per task: avg.: 5.00 std.dev.: 0.00, Output: 5 rows (120B), 1 tasks
Output layout: [library_id_0, total_seating_capacity, number_of_staff, $hashvalue_25]
Output partitioning: HASH [library_id_0][$hashvalue_25]
Stage Execution Strategy: UNGROUPED_EXECUTION
- ScanProject[PlanNodeId 1,402][table = TableHandle {connectorId='postgres', connectorHandle='JdbcTableHandle{connectorId=postgres, schemaTableName=pg.libraries, catalogName=null, schemaName=pg, tableName=libraries, joinTables=Optional.empty}', layout='Optional[{domains=ALL, additionalPredicate={}}]'}, grouped = false, projectLocality = LOCAL] => [library_id_0:integer, total_seating_capacity:integer, number_of_staff:integer, $hashvalue_25:bigint]
Estimates: {source: CostBasedSourceInfo, rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}/{source: CostBasedSourceInfo, rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}
CPU: 18.00ms (33.33%), Scheduled: 33.00ms (27.27%), Output: 5 rows (120B)
Input avg.: 5.00 rows, Input std.dev.: 0.00%
$hashvalue_25 := combine_hash(BIGINT'0', COALESCE($operator$hash_code(library_id_0), BIGINT'0')) (11:5)
LAYOUT: {domains=ALL, additionalPredicate={}}
library_id_0 := JdbcColumnHandle{connectorId=postgres, columnName=library_id, jdbcTypeHandle=JdbcTypeHandle{jdbcType=4, jdbcTypeName=int4, columnSize=10, decimalDigits=0, arrayDimensions=null}, columnType=integer, nullable=true, comment=Optional.empty} (11:5)
total_seating_capacity := JdbcColumnHandle{connectorId=postgres, columnName=total_seating_capacity, jdbcTypeHandle=JdbcTypeHandle{jdbcType=4, jdbcTypeName=int4, columnSize=10, decimalDigits=0, arrayDimensions=null}, columnType=integer, nullable=true, comment=Optional.empty} (11:5)
number_of_staff := JdbcColumnHandle{connectorId=postgres, columnName=number_of_staff, jdbcTypeHandle=JdbcTypeHandle{jdbcType=4, jdbcTypeName=int4, columnSize=10, decimalDigits=0, arrayDimensions=null}, columnType=integer, nullable=true, comment=Optional.empty} (11:5)
Input: 5 rows (0B), Filtered: 0.00%",
Joinpushdown presto plan
Explain Analyze output :
Query Plan
"Fragment 1 [SOURCE]
CPU: 23.40ms, Scheduled: 46.42ms, Input: 5 rows (0B); per task: avg.: 5.00 std.dev.: 0.00, Output: 5 rows (175B), 1 tasks
Output layout: [book_id, library_id, copies_available, year_published, library_id_0, total_seating_capacity, number_of_staff]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
- TableScan[PlanNodeId 361][TableHandle {connectorId='postgres', connectorHandle='JdbcTableHandle{connectorId=postgres, schemaTableName=pg.books, catalogName=null, schemaName=pg, tableName=books, joinTables=Optional[[JdbcTableHandle{connectorId=postgres, schemaTableName=pg.books, catalogName=null, schemaName=pg, tableName=books, joinTables=Optional.empty}, JdbcTableHandle{connectorId=postgres, schemaTableName=pg.libraries, catalogName=null, schemaName=pg, tableName=libraries, joinTables=Optional.empty}]]}', layout='Optional[{domains=ALL, additionalPredicate={}}]'}, grouped = false] => [book_id:integer, library_id:integer, copies_available:integer, year_published:integer, library_id_0:integer, total_seating_capacity:integer, number_of_staff:integer]
CPU: 23.00ms (100.00%), Scheduled: 45.00ms (100.00%), Output: 5 rows (175B)
Input avg.: 5.00 rows, Input std.dev.: 0.00%
LAYOUT: {domains=ALL, additionalPredicate={}}
year_published := JdbcColumnHandle{connectorId=postgres, columnName=year_published, jdbcTypeHandle=JdbcTypeHandle{jdbcType=4, jdbcTypeName=int4, columnSize=10, decimalDigits=0, arrayDimensions=null}, columnType=integer, nullable=true, comment=Optional.empty} (9:5)
total_seating_capacity := JdbcColumnHandle{connectorId=postgres, columnName=total_seating_capacity, jdbcTypeHandle=JdbcTypeHandle{jdbcType=4, jdbcTypeName=int4, columnSize=10, decimalDigits=0, arrayDimensions=null}, columnType=integer, nullable=true, comment=Optional.empty} (11:5)
copies_available := JdbcColumnHandle{connectorId=postgres, columnName=copies_available, jdbcTypeHandle=JdbcTypeHandle{jdbcType=4, jdbcTypeName=int4, columnSize=10, decimalDigits=0, arrayDimensions=null}, columnType=integer, nullable=true, comment=Optional.empty} (9:5)
book_id := JdbcColumnHandle{connectorId=postgres, columnName=book_id, jdbcTypeHandle=JdbcTypeHandle{jdbcType=4, jdbcTypeName=int4, columnSize=10, decimalDigits=0, arrayDimensions=null}, columnType=integer, nullable=true, comment=Optional.empty} (9:5)
library_id := JdbcColumnHandle{connectorId=postgres, columnName=library_id, jdbcTypeHandle=JdbcTypeHandle{jdbcType=4, jdbcTypeName=int4, columnSize=10, decimalDigits=0, arrayDimensions=null}, columnType=integer, nullable=true, comment=Optional.empty} (9:5)
library_id_0 := JdbcColumnHandle{connectorId=postgres, columnName=library_id, jdbcTypeHandle=JdbcTypeHandle{jdbcType=4, jdbcTypeName=int4, columnSize=10, decimalDigits=0, arrayDimensions=null}, columnType=integer, nullable=true, comment=Optional.empty} (11:5)
number_of_staff := JdbcColumnHandle{connectorId=postgres, columnName=number_of_staff, jdbcTypeHandle=JdbcTypeHandle{jdbcType=4, jdbcTypeName=int4, columnSize=10, decimalDigits=0, arrayDimensions=null}, columnType=integer, nullable=true, comment=Optional.empty} (11:5)
Input: 5 rows (0B), Filtered: 0.00%",
Below image shows the flow of a query from user input to execution.
The join order is syntactic in the case of JDBC - Presto builds the join graph using the order specified by the SQL statement. This is true if the join reordering strategy is set to NONE or in case of AUTOMATIC, we do not have any table stats to do any join reordering. In most JDBC connectors we don't have table stats and so we will not see any join reordering. So the order and position of the tables in the join query plays an important role to determine whether join pushdown will happen, or if it happens to what extend.
Below is the example of PlanNode that is created for the join query.
Currently while executing a JoinNode, a TableScanNode is created for each JDBC source table.
In our proposed implementation, instead of creating one TableScanNode per JDBC table, we could create a single TableScanNode that represents the result of pushing down the join to the underlying JDBC source
No | Node Description | SQL query to JDBC source (with join pushed down) |
---|---|---|
1 | TableScanNode [mypg_table1, mypg_table2] for PostgreSQL | select * from postgresql.pg.mypg_table1 t1, postgresql.pg.mypg_table2 t2 where t1.pgfirsttablecolumn=t2.pgsecondtablecolumn |
2 | TableScanNode [mydb2_table1, mydb2_table2, mydb2_table3] for DB2 | select * from db2.db2.mydb2_table1 t3, db2.db2.mydb2_table2 t4, db2.db2.mydb2_table5 t5 where t3.dbthirdtablecolumn = t4.dbfourthtablecolumn and t4.dbfourthtablecolumn = t5.dbfifthtablecolumn |
For performing this jdbc join pushdown, we need to create two logical optimizers GroupInnerJoinsByConnector and JdbcJoinPushdown.
GroupInnerJoinsByConnector - is an optimizer rule that, if enabled, will attempt to group sources by connector into a single TableScanNode that represents the result of pushing down the INNER JOINs between these sources. The low level design is available here
JdbcJoinPushdown - is a JDBC connector optimizer which will operate on these 'grouped' table sets to build the pushdown SQL to acheive the join result. The low level design is available here
After all optimizations the flow will pass to the presto-base-jdbc module to create the final join query. The final join query is prepared at the connector level using the Querybuilder. It is explained in the low level design here.
Sql Query :
select t1.intcolumn2
from postgresql.pg.mypg_table1 t1
join postgresql.pg.mypg_table2 t2 on t1.pgfirsttablecolumn = t2.pgsecondtablecolumn
Join db2.db2.mydb2_table1 t3 on t3.dbthirdtablecolumn = t2.pgsecondtablecolumn
JOIN db2.db2.mydb2_table2 t4 ON t3.dbthirdtablecolumn = t4.dbfourthtablecolumn
JOIN db2.db2.mydb2_table3 t5 ON t4.dbfourthtablecolumn = t5.dbfifthtablecolumn
Below diagram shows the the optimization flow :
Presto validate Join operation (PlanNode) specifications to perform join pushdown. The specifics for the supported pushdown of table joins varies for each data source, and therefore for each connector. However, there are some generic conditions that must be met in order for a join to be pushed down in jdbc connector
1. The Jdbc connector should be able to process the Join operation.
Presto Jdbc connector will process almost every Join operation except presto functions and operators.
When we use some aggregate, math operations or datatype conversion along with join query it is converted to presto functions and applied to Join operation. Any join query which creates intermediate presto functions, cannot be handled by the connector and hence will not be pushed down.
No | Condition which create presto function | SQL Query |
---|---|---|
1 | abs(int_clumn) = int_cilumn2 | Select * from table a join table b on abs(a.col1) = b.col2; |
2 | int_sum_column = int_value1_column1 + int_value1_column2 | Select * from table a join table b on a.col1 = b.col2 + b.col3; |
3 | cast(varchar_20_column, varchar(100)) = varchar100_column | Select * from table a join table b on cast(a.varchar_20_column, varchar(100)) = b.varchar100_column; |
2. Join operation should be an INNER JOIN or a SELF JOIN.
A SELF JOIN is when a table is joined with itself.
Note: Other optimizers in Presto may change the Join operation. We can call this as Inference. Sometimes presto will change a Pushdown capable Inner join to another Join operation incapable of pushdown (Eg: Infering to remove join condition/predicate in the plan). This will lead to pushdown capability being removed. And sometimes presto will change Join operation to a pushdown capable one. (Eg: Infering to create Inner join from Right/Left join)
Examples to explain presto change an inner join to another Join operation :
Suppose we have a query like this:
Select * from table a join table b on a.col1 = b.col2 and a.col1 = 5;
Presto will change this from an inner join to two different select statements like this:
Select * from table a where a.col1 = 5;
Select * from table b where b.col2 = 5;
Then it does a cross join with these two results. We will not do pushdown in this case.
3. Join criteria (joining column) should be of Datatypes and operators that support join pushdown.
No | DataType support join pushdown | Operations |
---|---|---|
1 | TinyINT | =, <, >, <=, >=, !=, <> |
2 | SmallINT | =, <, >, <=, >=, !=, <> |
3 | Integer | =, <, >, <=, >=, !=, <> |
4 | BigINT | =, <, >, <=, >=, !=, <> |
5 | Boolean | =, !=, <> |
6 | Integer | =, <, >, <=, >=, !=, <> |
7 | Real | =, <, >, <=, >=, !=, <> |
8 | Double | =, <, >, <=, >=, !=, <> |
9 | Decimal | =, <, >, <=, >=, !=, <> |
10 | Varchar | =, <, >, <=, >=, !=, <> |
11 | Char | =, <, >, <=, >=, !=, <> |
4. All tables from same connector will be grouped based on above specifications and pushed down to underlying datasource.
5. Enable presto Join pushdown capabilities by setting the session flag optimizer_inner_join_pushdown_enabled = true.
As part of performing JDBC Join pushdown, we need to introduce 2 new optimizers and then need to use the existing Predicate pushdown optimizer and JDBC Compute Pushdown Optimizer.
We are going to create a new optimizer (GroupInnerJoinsByConnector) which implements PlanOptimizer and another optimizer (JdbcJoinPushdown) which implements ConnectorPlanOptimizer.
After completing GroupInnerJoinsByConnector optimization, JdbcJoinPushdown Optimizer will be invoked. After that predicate pushdown optimizer is invoked to recreate join criteria from the filter node of the JoinNode.
Below is the overall process :
- Run GroupInnerJoinsByConnector Optimizer (new)
- Run JdbcJoinPushdown Optimizer (new)
- Run Predicate Pushdown Optimizer (existing)
- Run Jdbc compute Pushdown Optimizer (existing)
- Optimizing is over, execution starts
- From JdbcSplit the new values are passed to Query builder
- Query Builder checks if pushdown is happening and builds join query accordingly.
- The built join query is passed to BaseJdbcClient for execution.
GroupInnerJoinsByConnector Optimizer is implemented inside the presto-main module. This optimizer is used to group the tables (which are part of inner joins) in a query so that we can push down these grouped tables.
- The GroupInnerJoinsByConnector uses SimplePlanRewriter methods VisitJoin and VisitFilter to traverse through the nodes. The reason we need to traverse the JoinNode is that we need to identify whether the join query (presto plan) is able to be processed by the datasource. For this we traverse all the nodes of the join node and validate all the 5 points
2. Flatten all TableScanNode, filter, outputVariables and assignments to a new data structure called MultiJoinNode
- Presto already has an existing data structure called multiJoinNode which is used to flatten Plan nodes into list of source nodes. We are using a similar approach to create multiJoinNode.
- We have a logic so that the grouping only happens to sources that have the following structure :
- TableScanNode
- FilterNode -> TableScanNode
- ProjectNode -> FilterNode -> TableScanNode
- 3.1. We take each item of multiJoinNode's sourceList and check if it’s a connector which supports join push down. For this we have introduced a new capability in ConnectorCapabilities (in SPI module) called "SUPPORTS_JOIN_PUSHDOWN”.
public enum ConnectorCapabilities
{
NOT_NULL_COLUMN_CONSTRAINT,
SUPPORTS_REWINDABLE_SPLIT_SOURCE,
SUPPORTS_PAGE_SINK_COMMIT,
PRIMARY_KEY_CONSTRAINT,
UNIQUE_CONSTRAINT,
ENFORCE_CONSTRAINTS,
ALTER_COLUMN,
SUPPORTS_JOIN_PUSHDOWN
}
- 3.2. In JdbcConnector, we set this capability to enable Join Pushdown. So that all Jdbc connectors will get this join pushdown capability.
@Override
public Set<ConnectorCapabilities> getCapabilities()
{
return immutableEnumSet(NOT_NULL_COLUMN_CONSTRAINT, SUPPORTS_JOIN_PUSHDOWN);
}
- 3.3. Once it identifies the connector as pushdown supported, it creates a Map with key as connector name and value as a List of tables which are from the connector.
- 3.4. This ensures that no other connector is affected by this optimizer. Only connectors with Join pushdown capability will be pushed down.
4. Grouping tables for creating join query - based on JDBC datasource capability
- 4.1. JoinTables (List of ConnectorTableHandle) creation happens from the Map which is created above. [Point number 3.3]
- 4.2. For each item in map, based on connector, we get a list of tables/nodes. Each node is then analyzed for join pushdown capability and either added to JoinTables List or added back to rewrittenList (If it can not be pushed down).
5. If we are able to create a JoinTables list, then we create a single table scan for that and then add to the rewrittenList.
- 5.1. If there are 4 tables in JoinTables list against Postgres, then we create a single table scan node with ConnectorHandleSet
- 5.2. Inside the ConnectorHandleSet, these 4 tables will be there.
- 5.3. This rewrittenList is used to create another multiJoinNode (rewrittenMultiJoinNode).
- Iterate over the rewrittenMultiJoinNode, for each sourceList, call createLeftDeepJoinTree() method. This creates a joinNode with all the nodes in the sourceList.
- A new FilterNode is created with the combinedFilters of the multiJoinNode as the predicate. This is finally returned.
JoinPushdown Optimizer is implemented inside the presto-base-jdbc module. This optimizer is called after GroupInnerJoinsByConnector. It is used to convert JoinTableSet to List of ConnectorTableHandles which is able to be understood by JdbcTableHandle.
- JdbcJoinPushdown Optimizer is added as Logical Plan Optimizer in JdbcPlanOptimizerProvider.
@Override
public Set<ConnectorPlanOptimizer> getLogicalPlanOptimizers()
{
return ImmutableSet.of(new JdbcJoinPushdown());
}
- When the logical optimization of Jdbc connector happens, it invokes JdbcJoinPushdown optimizer visitTableScan() to rewrite JoinTableSet to List of ConnectorTableHandles
Inside the visitTableScan() :
- We check if connectorHandle of tableHandle is an instance of JoinTableSet
- If that is the case, make a new JdbcTableHandle with joinTables as the tableHandles.getConnectorTableHandles()
- If not, return the node.
JdbcJoinPushdown optimizer will create a TableScanNode structure which is able to hold all the jdbc tables which are grouped as part of above implementation. Below is the proposed structure for the new TableScanNode
After creating Single TableScanNode for grouped tables (refer point 7) we need to pushdown the FilterNode (join criteria specific to the grouped tables of new tableScanNode and all filters specific to the group tables) on connector level for the applicable filter and maintain the FilterNode for presto if it is not able to pushdown. For this we are just invoking predicate pushdown after jdbc join pushdown optimizer and there is no code change.
Using JdbcComputePushdown optimizer, we are pushing down the join criteria as additional predicate. For doing this, we enhanced JdbcComputePushdown optimizer by adding a join predicate to sql translator.
private final JdbcJoinPredicateToSqlTranslator jdbcJoinPredicateToSqlTranslator;
this.jdbcJoinPredicateToSqlTranslator = new JdbcJoinPredicateToSqlTranslator(
functionMetadataManager,
buildFunctionTranslator(ImmutableSet.of(JoinOperatorTranslators.class)),
identifierQuote);
We have also added visitPlan() method and enhanced the visitFilter() method.
Enhancements done to visitFilter() are from this PR : prestodb/presto#16412 Before, if the complete filter could not be pushed down, nothing would be pushed down. The changes help in finding what all can be pushed down and able to be translated. The filters that can't be pushed down are kept in a new FilterNode.
- Added 2 new classes : JoinTableInfo, JoinTableSet
- JoinTableInfo holds the information about a table that is undergoing pushdown. It has details about tableHandle, assignments and outputVariables.
- JoinTableSet is used to store a set of JoinTableInfo objects.
JoinTableInfo
{
private final ConnectorTableHandle tableHandle;
private final Map<VariableReferenceExpression, ColumnHandle> assignments;
private final List<VariableReferenceExpression> outputVariables
}
public class JoinTableSet
implements ConnectorTableHandle
{
private final Set<JoinTableInfo> innerJoinTableInfos;
}
How these two classes are used is mentioned here
We have added a new Connector capability called SUPPORTS_JOIN_PUSHDOWN ConnectorCapabilities enum in SPI. Any connector which wants to participate in pushdown need to added this Connector capability to their Set of ConnectorCapabilities. JDBC has done it like this
Self join is a scenario where a table is joined with itself. This is handled in Presto by assigning unique TableScanNode's for each instance of the table being joined We want to push down this self join to database level. This is similar to pushing down an inner join. But, we faced some difficulties in achieving this.
If we do a join query which joins the same table with itself, We get 2 TableScans in the plan. There isn’t a way to differentiate between the columns of these TableScans.
Example : Assume there is a column 'col_1' in a table 'table_1'. We join table_1 with itself ->
select * from table_1 a join table_1 b on a.col_1 = b.col_1;
The column 'col_1' will be referred as col_1 (in outputVariables) inside TableScan1 and will point to 'col_1' JdbcColumnHandle (in assignments)
It will be referred as col_1_0 (in outputVariables) inside TableScan2 and will point to 'col_1' JdbcColumnHandle (in assignments)
When our Optimiser JdbcJoinRenderByConnector
is run, both these TableScan's get converted to a single TableScan, with assignments, output variables, etc. are combined.
(Note : assignments now has two references to the same column 'col_1'. One from 'col_1' and the other from 'col_1_0')
This creates a problem in pushPredicateIntoTableScan method in PickTableLayout class, which assumes that all the values in assignments are unique.
This assumption by presto is done on the basis that till now all queries done by presto to any database is as a select from a single table.
We can bring in a 'table alias' to uniquely identify the tables. This is not the table alias that the user passes in his query.
This alias can then be used to unique identify columns and assignments even in the case of self joins.
This alias can then be used to build the pushdown filter (WHERE
clauses) from the query predicates
Addition of JoinTableInfo Class: Please check the outline here A new class, JoinTableInfo, has been introduced to encapsulate essential information for each join table. This includes:
- A tableHandle representing the table.
- Corresponding assignments to specify how columns are mapped.
- Output variables associated with the table.
Storing JoinTableInfo in JoinTableSet: Please check the outline of JoinTableSet here
- Within the GroupInnerJoinsByConnector, we gather a set of JoinTableInfo objects, each representing one of the join tables.
- This set is stored in the JoinTableSet, which is an implementation of ConnectorTableHandle.
Iterating JoinTableInfo in JdbcJoinPushdown:
- The JdbcJoinPushdown optimizer then iterates through the JoinTableInfo objects in the set.
- During this iteration, an alias is applied to each JdbcTableHandle and JdbcColumnHandle objects within assignments.
Creating the New TableScanNode:
- The updated list of ConnectorTableHandle and updated assignments are then used to construct the new TableScanNode
At present we are focusing on common operators =, <, >, <=, >= and != with common datatype like int, bigint, float, real, string, varchar, char. So there is no connector level implementation required and focusing on single implementation for all supported Jdbc connector through QueryBuilder class.
The issue with handling a larger set of operators + types for the join conditions is being tracked here.
Now we have a new TableScanNode with a list of JoinTableInfos. We need to modify the logic to transfer the new object (Optional<List< ConnectorTableHandle >> joinPushdownTables), to the connector level. If the split contains 'joinTables' details then we need to transfer those details to the new method called 'buildJoinSql()' where we will build the join query to be executed.
In buildJoinSql(), we will handle columns to be selected, the tables from which to query, the join condition and filter conditions if any. Note that join filters are pushed down as 'regular' WHERE clause filters. The change here, compared to the traditional 'buildSql()' is adding support for building a FROM clause with multiple tables. The rest are plumbing details of how we set the tables, aliases, columns names, etc. for this. The table alias that we set earlier will be used here.
We have a new session flag 'optimizer.inner-join-pushdown-enabled'. This flag should be configured in presto-main config.properties with default value as false. eg:
optimizer.inner-join-pushdown-enabled = true
It can be set from user session to override the above config. eg:
SET SESSION optimizer_inner_join_pushdown_enabled = true
If we do not set this flag or set it to false (SET SESSION optimizer_inner_join_pushdown_enabled = false’) then Join Pushdown will not happen.
GroupInnerJoinsByConnector optimizer will be invoked based on session flag 'optimizer.inner-join-pushdown-enabled'.
We also have one more flag 'optimizer.inequality-join-pushdown-enabled' in config.properties with default value as false. This is used to denote whether join conditions with inequality conditions (<, >, <=, >=, !=, <>) should be pushed down or not. Eg :-
optimizer.inner-join-pushdown-enabled = true
optimizer.inequality-join-pushdown-enabled = false
This will only pushdown inner joins based on equality conditions (=)
optimizer.inner-join-pushdown-enabled = true
optimizer.inequality-join-pushdown-enabled = true
This will pushdown inner joins based on equality conditions (=) and inequality conditions (<, >, <=, >=, !=, <>).
How can we measure the impact of this feature?
We can see the impact in the performance improvement in Inner Join Queries involving JDBC connectors. We can also see the change in the Plan that is created by Presto. This can be observed by executing EXPLAIN or EXPLAIN ANALYZE queries.
Based on the discussion, this may need to be updated with feedback from reviewers.
-
What impact (if any) will there be on existing users? Are there any new session parameters, configurations, SPI updates, client API updates, or SQL grammar?
- There will be a new session parameter. Users will need to set it to True if they queries to be pushed down to JDBC connectors.
-
If we are changing behaviour how will we phase out the older behaviour?
- Not Applicable
-
If we need special migration tools, describe them here.
- Not Applicable
-
When will we remove the existing behaviour, if applicable.
- Not Applicable
-
How should this feature be taught to new and existing users? Basically mention if documentation changes/new blog are needed?
- Yes, documentation changes will be required.
-
What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC?
-
- Push down all kinds of Joins
-
- Some queries may become slower due to this change, if the session flag is set to true.
-
Added 3 new test classes - TestJdbcJoinPushdown, TestJdbcQueryBuilderJoinPushdown, TestJdbcQueryBuilderJoinPushdownExtended Added 1 new class TestJoinQueriesWithPushDown which extends AbstractTestJoinQueries. Total 221 new test cases - all are passing
All the test cases in AbstractTestJoinQueries are passing with the join pushdown flag enabled. This gives us a lot of confidence on the correctness of the implementation.
We have done a POC on the implementation and we were able to see following performance improvements :
S.No. | Database | Query | Matrix | Rows | Normal | Pushdown | Improvement | Explain Analyze Normal | Explain Analyze Pushdown |
---|---|---|---|---|---|---|---|---|---|
1 | DB2 | Query | Table 1 - 10 million rows, Table 2 - 10 million rows, Table 3 - 10k rows, Result - 50 rows | 20 million | 105 seconds | 10 seconds | 10.5x | Explain Analyze | Explain Analyze |
2 | DB2 | Query | Table 1 - 10 million rows, Table 2 - 10k rows, Result - 50 rows | 10 million | 42.5 seconds | 7.5 seconds | 5.66x | Explain Analyze | Explain Analyze |
3 | Postgres | Query | Table 1 - 10 million rows, Table 2 - 10 million rows, Result - 25 rows | 20 million | 81 seconds | 14 seconds | 5.8x | Explain Analyze | Explain Analyze |
4 | Postgres | Query | Table 1 - 45 million rows, Table 2 - 1.5 million rows, Result - 8 rows | 46.5 million | 303 seconds | 56 seconds | 5.5x | Explain Analyze | Explain Analyze |
- Dependency on Predicate Pushdown for Join Pushdown:
- Join pushdown relies on pushing down join criteria as a filter object to the underlying datasource, using the existing filter pushdown capabilities. Therefore, join pushdown inherits all the features and limitations of the current filter pushdown functionality.
- Example: Joins with OR conditions or cases where Presto doesn’t infer a join criterion cannot be pushed down. Join pushdown cannot handle certain filters that Presto doesn’t currently push down, such as LIKE '%FR%' or OR conditions. These filters are also not handled in the join criteria, so such joins will not be pushed down to the datasource.
- Compatibility with Database-Supported Join Queries:
- Jdbc join pushdown only works for join queries that the database can fully understand.
- If a query uses filters, projections, conditions, or special keywords along with a join, Presto may add a function or special operator node to that table. This transformation may prevent the datasource from processing the join, making it ineligible for pushdown.