Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[native] Cancel pending async callback before resetting client and evb #24308

Merged
merged 1 commit into from
Dec 31, 2024

Conversation

yuandagits
Copy link
Contributor

@yuandagits yuandagits commented Dec 30, 2024

Description

There is a rare race at shutdown time when annoucer_->stop() is called which will reset fields like client_ and the evb_. However, this is race-y because the libevent timer on the evb can go-off, and we set stop_ = true even though sendRequest() has progressed beyond that point. Essentially, it suffers from time-of-check to time-of-use race.

To fix, we can cancel all pending events on the evb.

Note, you may wonder why the comment in sendRequest doesn't work.

stop() calls EventBase's destructor which executed all pending callbacks;
make sure not to do anything if that's the case"

This is because client_ depends on evb_, so client must be destructed before evb_ due to dependency issue, otherwise there will be SIGSEGV in client_'s d-tor. This destruction order is actually not desirable because we want the evb_ to finish all of the clean up before client is destructed, but we cannot because of the order issue.

This is why we must explicitly call cancel on the evb_'s timer itself.

Motivation and Context

Impact

Test Plan

Deploy to test cluster and restarted 4 times without any issues.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* ... :pr:`12345`
* ... :pr:`12345`

Hive Connector Changes
* ... :pr:`12345`
* ... :pr:`12345`

If release note is NOT required, use:

== NO RELEASE NOTE ==

@yuandagits yuandagits marked this pull request as ready for review December 30, 2024 20:42
@yuandagits yuandagits requested a review from a team as a code owner December 30, 2024 20:42
@yuandagits yuandagits marked this pull request as draft December 30, 2024 20:49
@yuandagits yuandagits force-pushed the periodicCrash branch 5 times, most recently from a6854ee to 825a127 Compare December 30, 2024 21:53
@yuandagits yuandagits marked this pull request as ready for review December 30, 2024 22:42
Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuandagits thanks for the quick fix!

@xiaoxmeng xiaoxmeng changed the title Cancel pending async callback before resetting client and evb [native] Cancel pending async callback before resetting client and evb Dec 30, 2024
@yuandagits
Copy link
Contributor Author

There appears to be an unrelated test failure

2024-12-30T23:34:35.8331948Z [ERROR] Tests run: 3756, Failures: 1, Errors: 0, Skipped: 91, Time elapsed: 3,563.859 s <<< FAILURE! - in TestSuite
2024-12-30T23:34:35.8333650Z [ERROR] com.facebook.presto.hive.TestHiveHistoryBasedStatsTracking.testHistoryBasedStatsCalculatorMultipleStrategies  Time elapsed: 0.451 s  <<< FAILURE!
2024-12-30T23:34:35.8335143Z java.lang.AssertionError: 
2024-12-30T23:34:35.8335843Z Plan does not match, expected [
2024-12-30T23:34:35.8336114Z 
2024-12-30T23:34:35.8336254Z - anyTree
2024-12-30T23:34:35.8336739Z     - node(ProjectNode)
2024-12-30T23:34:35.8337119Z         expectedOutputRowCount(48.0)
2024-12-30T23:34:35.8337518Z         - node
2024-12-30T23:34:35.8337695Z 
2024-12-30T23:34:35.8337815Z ] but found [
2024-12-30T23:34:35.8337975Z 
2024-12-30T23:34:35.8338931Z - Output[PlanNodeId 6][orderkey, orderpriority, comment, custkey, ds, ts, _col6] => [orderkey:bigint, orderpriority:varchar(15), comment:varchar(79), custkey:bigint, ds:varchar(10), ts:varchar(5), expr_5:integer]
2024-12-30T23:34:35.8340493Z         Estimates: {source: HistoryBasedSourceInfo, rows: 48 (5.32kB), cpu: ?, memory: ?, network: ?}
2024-12-30T23:34:35.8341213Z         _col6 := expr_5 (1:11)
2024-12-30T23:34:35.8342272Z     - RemoteStreamingExchange[PlanNodeId 247][GATHER - COLUMNAR] => [orderkey:bigint, orderpriority:varchar(15), comment:varchar(79), custkey:bigint, ds:varchar(10), ts:varchar(5), expr_5:integer]
2024-12-30T23:34:35.8343635Z             Estimates: {source: CostBasedSourceInfo, rows: 50 (9.44kB), cpu: ?, memory: ?, network: ?}
2024-12-30T23:34:35.8346258Z         - ScanProject[PlanNodeId 0,168][table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpch, tableName=test_orders, analyzePartitionValues=Optional.empty}', layout='Optional[tpch.test_orders{domains={orderpriority=[ [["1-URGENT"]] ]}}]'}, projectLocality = LOCAL] => [orderkey:bigint, orderpriority:varchar(15), comment:varchar(79), custkey:bigint, ds:varchar(10), ts:varchar(5), expr_5:integer]
2024-12-30T23:34:35.8349599Z                 Estimates: {source: CostBasedSourceInfo, rows: 50 (9.44kB), cpu: ?, memory: ?, network: ?}/{source: CostBasedSourceInfo, rows: 50 (9.44kB), cpu: ?, memory: ?, network: ?}
2024-12-30T23:34:35.8350621Z                 orderpriority := VARCHAR'1-URGENT'
2024-12-30T23:34:35.8351140Z                 ds := VARCHAR'2020-09-02'
2024-12-30T23:34:35.8351710Z                 expr_5 := INTEGER'2'
2024-12-30T23:34:35.8352370Z                 LAYOUT: tpch.test_orders{domains={orderpriority=[ [["1-URGENT"]] ]}}
2024-12-30T23:34:35.8352939Z                 orderkey := orderkey:bigint:0:REGULAR (1:18)
2024-12-30T23:34:35.8353376Z                 custkey := custkey:bigint:3:REGULAR (1:18)
2024-12-30T23:34:35.8354037Z                 ts := ts:varchar(5):-14:PARTITION_KEY (1:18)
2024-12-30T23:34:35.8357251Z                     :: [["00:02"]]
2024-12-30T23:34:35.8357723Z                 comment := comment:varchar(79):2:REGULAR (1:18)
2024-12-30T23:34:35.8358350Z                 orderpriority:varchar(15):1:REGULAR
2024-12-30T23:34:35.8358773Z                     :: [["1-URGENT"]]
2024-12-30T23:34:35.8359185Z                 ds:varchar(10):-13:PARTITION_KEY
2024-12-30T23:34:35.8359592Z                     :: [["2020-09-02"]]
2024-12-30T23:34:35.8359833Z 
2024-12-30T23:34:35.8359941Z ]
2024-12-30T23:34:35.8360578Z 	at com.facebook.presto.sql.planner.assertions.PlanAssert.assertPlan(PlanAssert.java:57)
2024-12-30T23:34:35.8361677Z 	at com.facebook.presto.sql.planner.assertions.PlanAssert.assertPlan(PlanAssert.java:41)
2024-12-30T23:34:35.8362890Z 	at com.facebook.presto.tests.AbstractTestQueryFramework.lambda$assertPlan$7(AbstractTestQueryFramework.java:509)
2024-12-30T23:34:35.8364221Z 	at com.facebook.presto.transaction.TransactionBuilder.execute(TransactionBuilder.java:151)
2024-12-30T23:34:35.8365469Z 	at com.facebook.presto.tests.AbstractTestQueryFramework.assertPlan(AbstractTestQueryFramework.java:507)
2024-12-30T23:34:35.8366849Z 	at com.facebook.presto.tests.AbstractTestQueryFramework.assertPlan(AbstractTestQueryFramework.java:499)
2024-12-30T23:34:35.8369052Z 	at com.facebook.presto.hive.TestHiveHistoryBasedStatsTracking.testHistoryBasedStatsCalculatorMultipleStrategies(TestHiveHistoryBasedStatsTracking.java:207)
2024-12-30T23:34:35.8370707Z 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2024-12-30T23:34:35.8371586Z 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
2024-12-30T23:34:35.8372930Z 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2024-12-30T23:34:35.8373971Z 	at java.lang.reflect.Method.invoke(Method.java:498)
2024-12-30T23:34:35.8374968Z 	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:135)
2024-12-30T23:34:35.8376110Z 	at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:673)
2024-12-30T23:34:35.8377141Z 	at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:220)
2024-12-30T23:34:35.8378153Z 	at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
2024-12-30T23:34:35.8379288Z 	at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:945)
2024-12-30T23:34:35.8380452Z 	at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:193)
2024-12-30T23:34:35.8381628Z 	at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
2024-12-30T23:34:35.8382822Z 	at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)
2024-12-30T23:34:35.8383860Z 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2024-12-30T23:34:35.8384921Z 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2024-12-30T23:34:35.8385681Z 	at java.lang.Thread.run(Thread.java:750)
2024-12-30T23:34:35.8385988Z 
2024-12-30T23:34:38.4884447Z [INFO] 
2024-12-30T23:34:38.4884786Z [INFO] Results:
2024-12-30T23:34:38.4885102Z [INFO] 
2024-12-30T23:34:38.4885386Z [ERROR] Failures: 
2024-12-30T23:34:38.4887597Z [ERROR]   TestHiveHistoryBasedStatsTracking.testHistoryBasedStatsCalculatorMultipleStrategies:207->AbstractTestQueryFramework.assertPlan:499->AbstractTestQueryFramework.assertPlan:507->AbstractTestQueryFramework.lambda$assertPlan$7:509 Plan does not match, expected [
2024-12-30T23:34:38.4889916Z 
2024-12-30T23:34:38.4890054Z - anyTree
2024-12-30T23:34:38.4890439Z     - node(ProjectNode)
2024-12-30T23:34:38.4891504Z         expectedOutputRowCount(48.0)
2024-12-30T23:34:38.4891958Z         - node
2024-12-30T23:34:38.4892143Z 
2024-12-30T23:34:38.4892269Z ] but found [
2024-12-30T23:34:38.4892443Z 
2024-12-30T23:34:38.4893639Z - Output[PlanNodeId 6][orderkey, orderpriority, comment, custkey, ds, ts, _col6] => [orderkey:bigint, orderpriority:varchar(15), comment:varchar(79), custkey:bigint, ds:varchar(10), ts:varchar(5), expr_5:integer]
2024-12-30T23:34:38.4895271Z         Estimates: {source: HistoryBasedSourceInfo, rows: 48 (5.32kB), cpu: ?, memory: ?, network: ?}
2024-12-30T23:34:38.4895964Z         _col6 := expr_5 (1:11)
2024-12-30T23:34:38.4897203Z     - RemoteStreamingExchange[PlanNodeId 247][GATHER - COLUMNAR] => [orderkey:bigint, orderpriority:varchar(15), comment:varchar(79), custkey:bigint, ds:varchar(10), ts:varchar(5), expr_5:integer]
2024-12-30T23:34:38.4898249Z             Estimates: {source: CostBasedSourceInfo, rows: 50 (9.44kB), cpu: ?, memory: ?, network: ?}
2024-12-30T23:34:38.4900221Z         - ScanProject[PlanNodeId 0,168][table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpch, tableName=test_orders, analyzePartitionValues=Optional.empty}', layout='Optional[tpch.test_orders{domains={orderpriority=[ [["1-URGENT"]] ]}}]'}, projectLocality = LOCAL] => [orderkey:bigint, orderpriority:varchar(15), comment:varchar(79), custkey:bigint, ds:varchar(10), ts:varchar(5), expr_5:integer]
2024-12-30T23:34:38.4901981Z                 Estimates: {source: CostBasedSourceInfo, rows: 50 (9.44kB), cpu: ?, memory: ?, network: ?}/{source: CostBasedSourceInfo, rows: 50 (9.44kB), cpu: ?, memory: ?, network: ?}
2024-12-30T23:34:38.4902580Z                 orderpriority := VARCHAR'1-URGENT'
2024-12-30T23:34:38.4902835Z                 ds := VARCHAR'2020-09-02'
2024-12-30T23:34:38.4903071Z                 expr_5 := INTEGER'2'
2024-12-30T23:34:38.4903383Z                 LAYOUT: tpch.test_orders{domains={orderpriority=[ [["1-URGENT"]] ]}}
2024-12-30T23:34:38.4903903Z                 orderkey := orderkey:bigint:0:REGULAR (1:18)
2024-12-30T23:34:38.4904295Z                 custkey := custkey:bigint:3:REGULAR (1:18)
2024-12-30T23:34:38.4904570Z                 ts := ts:varchar(5):-14:PARTITION_KEY (1:18)
2024-12-30T23:34:38.4904853Z                     :: [["00:02"]]
2024-12-30T23:34:38.4905102Z                 comment := comment:varchar(79):2:REGULAR (1:18)
2024-12-30T23:34:38.4905382Z                 orderpriority:varchar(15):1:REGULAR
2024-12-30T23:34:38.4905631Z                     :: [["1-URGENT"]]
2024-12-30T23:34:38.4905867Z                 ds:varchar(10):-13:PARTITION_KEY
2024-12-30T23:34:38.4906111Z                     :: [["2020-09-02"]]
2024-12-30T23:34:38.4906263Z 
2024-12-30T23:34:38.4906329Z ]
2024-12-30T23:34:38.4906470Z [INFO] 
2024-12-30T23:34:38.4906686Z [ERROR] Tests run: 3756, Failures: 1, Errors: 0, Skipped: 91
2024-12-30T23:34:38.4906957Z [INFO] 
2024-12-30T23:34:38.4929471Z [INFO] ------------------------------------------------------------------------
2024-12-30T23:34:38.4930338Z [INFO] BUILD FAILURE
2024-12-30T23:34:38.4931027Z [INFO] ------------------------------------------------------------------------
2024-12-30T23:34:38.4941363Z [INFO] Total time:  59:31 min (Wall Clock)
2024-12-30T23:34:38.4942801Z [INFO] Finished at: 2024-12-30T23:34:38Z
2024-12-30T23:34:38.4943340Z [INFO] ------------------------------------------------------------------------
2024-12-30T23:34:38.4958403Z [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M7:test (default-test) on project presto-hive: There are test failures.
2024-12-30T23:34:38.4959307Z [ERROR] 
2024-12-30T23:34:38.4959891Z [ERROR] Please refer to /home/runner/work/presto/presto/presto-hive/target/surefire-reports for the individual test results.
2024-12-30T23:34:38.4960627Z [ERROR] Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
2024-12-30T23:34:38.4961059Z [ERROR] -> [Help 1]
2024-12-30T23:34:38.4961250Z [ERROR] 
2024-12-30T23:34:38.4961758Z [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
2024-12-30T23:34:38.4962211Z [ERROR] Re-run Maven using the -X switch to enable full debug logging.
2024-12-30T23:34:38.4962514Z [ERROR] 
2024-12-30T23:34:38.4962965Z [ERROR] For more information about the errors and possible solutions, please read the following articles:
2024-12-30T23:34:38.4963564Z [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
2024-12-30T23:34:38.5124890Z ##[error]Process completed with exit code 1.
2024-12-30T23:34:38.5208022Z Post job cleanup.
2024-12-30T23:34:38.6468597Z Post job cleanup.
2024-12-30T23:34:38.7461201Z [command]/usr/bin/git version

@xiaoxmeng xiaoxmeng merged commit 87f81cd into prestodb:master Dec 31, 2024
67 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants