Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[native] Add native plan checker and native endpoint for Velox plan conversion #23596

Merged
merged 2 commits into from
Dec 2, 2024

Conversation

BryanCutler
Copy link
Contributor

@BryanCutler BryanCutler commented Sep 6, 2024

Description

This adds a provider for a native plan checker that will send a plan fragment to the native sidecar where it is validated by performing a conversion to a Velox plan. It is added to the native sidecar plugin and is enabled with the config native-plan-checker.plan-validation-enabled=true from filename etc/plan-checker-providers/native-plan-checker.properties.

A new endpoint is added to the native worker to accept the plan node fragment and return an ok response if the conversion is successful and an error response with an HTTP code 422 if failed, along with the failure information.

This follows the addition from #23649 so that when eager-plan-validation-enabled=true then a query can be validated while queued for correctness on a native worker.

RFC for this addition has been discussed and merged at https://github.com/prestodb/rfcs/blob/main/RFC-0008-plan-checker.md

Motivation and Context

This is a useful addition to help fail queries that are incompatible with Velox before execution begins and cluster resources have been allocated.

Impact

Native plan checker enabled with config native-plan-checker.plan-validation-enabled=true from filename etc/plan-checker-providers/native-plan-checker.properties. The native plan checker is disabled by default.

Test Plan

Added unit tests for Java and C++, as well as manual testing of queries that currently fail on a native worker.

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Prestissimo (Native Execution) Changes
* Add native plan checker to the native sidecar plugin and native endpoint for Velox plan conversion. :pr:`23596`

Copy link

linux-foundation-easycla bot commented Sep 6, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@BryanCutler
Copy link
Contributor Author

Making this PR a draft as recent discussions will likely require a change to send the PlanFragment instead of only the PlanNode portion, and the response from the native endpoint would be more useful to also return the converted Velox plan.

cc @tdcmeehan @rschlussel

@BryanCutler
Copy link
Contributor Author

I have separated out the first part of this, adding the session property to enable building and validating the plan eagerly at #23649

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Changes to SPI to support custom PlanChecker](https://github.com/prestodb/presto/pull/23596/commits/27b0c2ef2b4758de723712d8ad0cc142d9757b02)

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to Java Presto main and server to support custom PlanChecker SPI

@@ -371,6 +377,8 @@ else if (serverConfig.isCoordinator()) {
driftClientBinder(binder).bindDriftClient(ThriftServerInfoClient.class, ForNodeManager.class)
.withAddressSelector(((addressSelectorBinder, annotation, prefix) ->
addressSelectorBinder.bind(AddressSelector.class).annotatedWith(annotation).to(FixedAddressSelector.class)));
// NodeManager instance for plugins to use
binder.bind(NodeManager.class).to(ConnectorAwareNodeManager.class).in(Scopes.SINGLETON);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than inject this, I would just construct it where needed.

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add plan checker provider and plugin for NativePlanChecker


// Create static taskId and empty TableWriteInfo needed for plan conversion
protocol::TaskId taskId = "velox-plan-conversion.0.0.0";
auto tableWriteInfo = std::make_shared<protocol::TableWriteInfo>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: this causes a problem for create table, need to address

Copy link
Contributor Author

@BryanCutler BryanCutler Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be a problem anymore, correct me if I'm missing anything here

I just encountered this again, so looks like there is still an issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed with @tdcmeehan and since TableWriteInfo is not built until after the planning stage, we need to have an alternative to handle fragments with a TableWriterNode. Some alternatives discussed:

  1. Adding a field to the TableWriteInfo to indicate that the converter should proceed and skip the required fields, like ExecutionWriterTarget
  2. Make a dummy TableWriteInfo with the fields populated enough to pass the conversion
  3. Skip the TableWriteNode portion of the fragment, the source node can continue with the conversion

All of these options would result in an incomplete conversion of a fragment with a TableWriteNode, but since that is not available at the time the plan validation happens, there isn't much we can do. Here I implemented option (3) which is the least intrusive to the plan conversion.

if (error.empty()) {
http::sendOkResponse(downstream, json(R"({ "status": "ok" })"));
} else {
http::sendErrorResponse(downstream, json(R"({ "status": "error", "message": ")" + error + R"(")})"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this look like on the user side. e.g. can we report different kinds of error messages depending on what the failure was? For regular tasks, i think we send and OkResponse for most kinds of failures, and put the error information in the TaskInfo. Maybe we should do something similar here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make the response however you think is best. I didn't put too much thought into the message since we discussed that this would eventually be sending the converted plan back. Are you saying if the plan validation fails you would send an OkResponse and not an ErrorResponse?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i think we should follow the model for how errors get passed along in general for native workers, which i think is an OkResponse with the error information in the TaskInfo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rschlussel if I understand correctly, you are asking for the response to be a TaskInfo message with ExecutionFailureInfo to be populated if the conversion fails?

On the Java side, TaskInfo and many of the related classes are not part of the SPI, so things will need to be moved around. Is that what we want here?

On the C++ side, there is lots of information in the TaskInfo struct, I'm not sure how much is really useful in this case and if it needs to be populated with the TaskManager and have a corresponding PrestoTask.

Would it make sense to define a simple response message? cc @tdcmeehan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, the difference with tasks is that 4xx is returned when we can not create the task. What is being reported by the TaskInfo is whether or not the already successfully created task ended up failing. That should be returned as a 200, because the call to check on the task status itself succeeded, and it is merely reporting the underlying failure. When we go to update or create the task, when 4xx is returned, we are returning that only when the task cannot be created.

In this case, what is being returned is the converted plan. That should probably return code 422, because we are returning whether or not the conversion failed. Like all 4xx error codes, this indicates the problem is with the input. It's similar to a task update or creation failing.

String responseBody = response.body() != null ? response.body().string() : "{}";
LOG.error("Native plan checker failed with code: %d, response: %s", response.code(), responseBody);
if (config.isQueryFailOnError()) {
throw new PrestoException(QUERY_REJECTED, "Query failed by native plan checker with code: " + response.code() + ", response: " + responseBody);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QUERY_REJECTED seems like the wrong error code. We should propagate the error from the side car that caused the failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean the response from the side car should have an error code we use here instead of QUERY_REJECTED? The error message is propagated and shown to the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error that the user sees shouldn't be QUERY_REJECTED. It should be whatever error caused the plan checking to fail.

try (Response response = httpClient.newCall(request).execute()) {
if (!response.isSuccessful()) {
String responseBody = response.body() != null ? response.body().string() : "{}";
LOG.error("Native plan checker failed with code: %d, response: %s", response.code(), responseBody);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think we need to log the error if we're failing the query with that error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fine, I just want to make sure it's clear that if failed due to native plan checker, so if the exception message contains that and is logged, then that should be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it doesn't, it should be changed so that it does. It's much more helpful to have the failure reason show up in the query error than to have to scour the coordinator logs for any relevant error messages.

steveburnett
steveburnett previously approved these changes Oct 4, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull branch, new local doc build, looks good. Thanks!

BryanCutler added a commit to BryanCutler/presto that referenced this pull request Oct 9, 2024
Adding new SPI to support plugin custom plan checkers to be provided
for validating intermediate, final, and fragment stages of a logical
plan. The motivation for this is to allow for a native plan checker
that will eagerly validate a plan on the native sidecar to quickly
fail the query if there are incompatibilities.

See also: prestodb#23596
RFC: https://github.com/prestodb/rfcs/blob/main/RFC-0008-plan-checker.md
BryanCutler added a commit to BryanCutler/presto that referenced this pull request Oct 9, 2024
This adds a new SPI to support plugin custom plan checkers to be
provided for validating intermediate, final, and fragment stages of
a logical plan. The motivation for this is to allow for a native plan
checker that will eagerly validate a plan on the native sidecar to
quickly fail the query if there are incompatibilities.

Add unit test to verify that a queued query can be validated
and fail while queued. This is done by using the new custom
plan checker SPI to add plugin that will trigger a failure when
validating the plan.

See also: prestodb#23596
RFC: https://github.com/prestodb/rfcs/blob/main/RFC-0008-plan-checker.md
BryanCutler added a commit that referenced this pull request Oct 10, 2024
This adds a new SPI to support plugin custom plan checkers to be
provided for validating intermediate, final, and fragment stages of
a logical plan. The motivation for this is to allow for a native plan
checker that will eagerly validate a plan on the native sidecar to
quickly fail the query if there are incompatibilities.

Add unit test to verify that a queued query can be validated
and fail while queued. This is done by using the new custom
plan checker SPI to add plugin that will trigger a failure when
validating the plan.

See also: #23596
RFC: https://github.com/prestodb/rfcs/blob/main/RFC-0008-plan-checker.md
@BryanCutler
Copy link
Contributor Author

BryanCutler commented Oct 13, 2024

Apologies for the delay in response @tdcmeehan and @rschlussel , I was focused on getting the other pr ready. I think I addressed most of this feedback, just a couple remaining things, and I'll do some testing then remove from draft soon.

@BryanCutler BryanCutler marked this pull request as ready for review October 14, 2024 04:43
@BryanCutler BryanCutler changed the title [WIP] Add SPI for custom plan checker and plugin for native plan validation [Native] Add SPI for custom plan checker and plugin for native plan validation Oct 14, 2024
Joe-Abraham pushed a commit to Joe-Abraham/presto that referenced this pull request Oct 15, 2024
This adds a new SPI to support plugin custom plan checkers to be
provided for validating intermediate, final, and fragment stages of
a logical plan. The motivation for this is to allow for a native plan
checker that will eagerly validate a plan on the native sidecar to
quickly fail the query if there are incompatibilities.

Add unit test to verify that a queued query can be validated
and fail while queued. This is done by using the new custom
plan checker SPI to add plugin that will trigger a failure when
validating the plan.

See also: prestodb#23596
RFC: https://github.com/prestodb/rfcs/blob/main/RFC-0008-plan-checker.md
Joe-Abraham pushed a commit to Joe-Abraham/presto that referenced this pull request Oct 15, 2024
This adds a new SPI to support plugin custom plan checkers to be
provided for validating intermediate, final, and fragment stages of
a logical plan. The motivation for this is to allow for a native plan
checker that will eagerly validate a plan on the native sidecar to
quickly fail the query if there are incompatibilities.

Add unit test to verify that a queued query can be validated
and fail while queued. This is done by using the new custom
plan checker SPI to add plugin that will trigger a failure when
validating the plan.

See also: prestodb#23596
RFC: https://github.com/prestodb/rfcs/blob/main/RFC-0008-plan-checker.md
@aditi-pandit
Copy link
Contributor

@BryanCutler : Also please prefix the title of your second commit with [native] as well.

Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more minor comments.

@@ -75,6 +78,16 @@ void sendErrorResponse(
.sendWithEOM();
}

void sendErrorJsonResponse(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is both a json and thrift for OkResponse, but the error has only json. Should we consider adding a thrift response as well ?

@tdcmeehan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what this means, but to be clear, we'll only send JSON back from this endpoint, never Thrift.

Copy link
Contributor

@aditi-pandit aditi-pandit Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the error response matters only if the native side-car will work with both JSON and Thrift.
Prestissimo has support for both formats for the control API. Its likely the thrift usage is legacy.

But if its not too much work, lets add a Thrift error response version for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditi-pandit , just so I understand what you're asking for a little better - are you asking for an additional API in HttpServer that can send an error formatted as Thrift, e.g. void sendErrorThriftResponse(proxygen::ResponseHandler* downstream,const std::string& body) or are you also asking that the /v1/velox/plan endpoint could also send responses back in Thrift format?

If it's the latter, I think there would have to be some additional conversion of the message to Thrift and it doesn't sound like it would be useful since only JSON is being used for this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler : It is the former.

It's not worth changing /v1/velox/plan to send responses in Thrift format.

* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "VeloxPlanConversion.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cpp files always include the complete path of the headers even if in the local directory.

aditi-pandit
aditi-pandit previously approved these changes Nov 27, 2024
Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BryanCutler for working through the iterations.

tdcmeehan
tdcmeehan previously approved these changes Dec 2, 2024
This adds a provider for a native plan checker that will send a plan fragment
to the native sidecar where it is validated by performing a conversion to a
Velox plan. If the conversion succeeds the query will continue, if it fails
then the query will fail with an error from the native sidecar. The provider
is added to the native sidecar plugin and is enabled with the config
`native-plan-checker.plan-validation-enabled=true` from filename
`etc/plan-checker-providers/native-plan-checker.properties`.

See also: prestodb#23649
RFC: https://github.com/prestodb/rfcs/blob/main/RFC-0008-plan-checker.md
This adds an endpoint to the native Presto server that will convert
a Presto plan fragment to Velox. If the conversion is successful,
the server will send an ok response. If it fails, the server will
send an error response with a 422 status code as unprocessable.
The error message will contain a PlanConversionFailureInfo struct
with error type, code and message.

See also prestodb#23649
RFC: https://github.com/prestodb/rfcs/blob/main/RFC-0008-plan-checker.md
@tdcmeehan tdcmeehan merged commit 4479bcc into prestodb:master Dec 2, 2024
60 checks passed
@BryanCutler BryanCutler deleted the native-fail-fast branch December 3, 2024 01:02
@BryanCutler
Copy link
Contributor Author

Thanks @tdcmeehan @rschlussel and @aditi-pandit for reviewing!

BryanCutler added a commit to BryanCutler/presto that referenced this pull request Dec 3, 2024
The previous commit to add the native plan checker mistakenly removed
the binding for creating a SessionPropertyManager. This adds it back
into PrestoServer.

See prestodb#23596
BryanCutler added a commit to BryanCutler/presto that referenced this pull request Dec 3, 2024
The previous commit to add the native plan checker mistakenly removed
the loading of SessionPropertyManager. This adds it back the call to
loadSessionPropertyProviders into PrestoServer.

See prestodb#23596
BryanCutler added a commit that referenced this pull request Dec 3, 2024
The previous commit to add the native plan checker mistakenly removed
the loading of SessionPropertyManager. This adds it back the call to
loadSessionPropertyProviders into PrestoServer.

See #23596
@rschlussel
Copy link
Contributor

This has broken our usage of the plan checker spi because the method to get the plan checkers was moved from Plugin to CoordinatorPlugin, but this fix prestodb/presto-maven-plugin#19 hasn't been released yet. I may revert this change temporarily depending on how quickly that release is ready.

@tdcmeehan
Copy link
Contributor

@rschlussel can you add the service descriptor yourself manually? The presto-maven-plugin makes plugin code generation more convenient convenient, but it shouldn't be a blocker, because the only requirement is that a service descriptor META-INF/services/com.facebook.presto.spi.CoordinatorPlugin be present which contains the fully qualified class name of the module's plugin.

@rschlussel
Copy link
Contributor

I see the native sidecar module uses a dummy plugin to work around the issue (that's needed in addition to adding the meta-inf resources file). We can do the same.

@tdcmeehan
Copy link
Contributor

@rschlussel yes, unfortunately that would be required if you were loading the plugin as a part of a multi-module build and loading the plugin by reading its Pom file. If the plugin were loaded as a JAR then it wouldn't be required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants