Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] Yarn resources cannot be released immediately after submitting an application type task and stopping it #3273

Closed
2 of 3 tasks
Zzm0809 opened this issue Mar 11, 2024 · 2 comments
Assignees
Labels
Optimization Optimization function
Milestone

Comments

@Zzm0809
Copy link
Contributor

Zzm0809 commented Mar 11, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

提交 application 类型任务停止后 yarn 资源无法立即释放
Yarn resources cannot be released immediately after submitting an application type task and stopping it

What you expected to happen

提交 application 类型任务停止后 yarn 资源无法立即释放
Yarn resources cannot be released immediately after submitting an application type task and stopping it

How to reproduce

提交一个作业

-- -----------------------------------------------------------------
-- @Description(作业描述): ${1:}
-- @Creator(创建人): ${2:}
-- @Create DateTime(创建时间): ${3:}
-- -----------------------------------------------------------------

-- add CUSTOMJAR 为 Dinky 扩展语法 功能实现和 add jar 类似 , 推荐使用此方式
ADD customjar 'rs:/udf/scps_udf.jar';
--  create temporary function ip2int as 'com.sopei.udf.Ip2Int';

CREATE TABLE demo_log_01 (
    user_id BIGINT,
    item_id BIGINT,
    behavior STRING,
    dt STRING,
    hh STRING
) WITH (
  'connector' = 'datagen',
  'rows-per-second' = '1'
);
CREATE TABLE demo_log_02 (
    user_id BIGINT,
    item_id BIGINT,
    behavior STRING,
    dt STRING,
    hh STRING
) WITH (
  'connector' = 'datagen',
  'rows-per-second' = '1'
);
CREATE TABLE demo_log_05 (
    user_id BIGINT,
    item_id BIGINT,
    behavior STRING,
    dt STRING,
    hh STRING,
    ip BIGINT
) WITH (
  'connector' = 'print'
);
insert into demo_log_05
select a.user_id,b.item_id,b.behavior,a.dt,a.hh,1 from 
(select  user_id,item_id,behavior,dt,hh from demo_log_01 where dt>='2023-17-02') a 
left join 
(select  user_id,item_id,behavior,dt,hh from demo_log_02) b
on a.user_id = b.user_id;

集群配置

image

提交成功后可见 yarn

image

任务停止

image
image

image

flink 任务状态及 dinky 侧任务状态均已经 cancel

但是 yarn 任务列表仍然存在

image

该任务的 jm 日志如下

省略部分无关日志...


2024-03-11 16:25:32,648 INFO  org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl [] - Processing Event EventType: START_CONTAINER for Container container_e481_1710116763206_0003_01_000002
2024-03-11 16:25:36,435 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Registering TaskManager with ResourceID container_e481_1710116763206_0003_01_000002(hadoop-03:45454) (akka.tcp://flink@hadoop-03:41634/user/rpc/taskmanager_0) at ResourceManager
2024-03-11 16:25:36,457 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker container_e481_1710116763206_0003_01_000002(hadoop-03:45454) is registered.
2024-03-11 16:25:36,458 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker container_e481_1710116763206_0003_01_000002(hadoop-03:45454) with resource spec WorkerResourceSpec {cpuCores=1.0, taskHeapSize=384.000mb (402653174 bytes), taskOffHeapSize=0 bytes, networkMemSize=128.000mb (134217730 bytes), managedMemSize=512.000mb (536870920 bytes), numSlots=1} was requested in current attempt. Current pending count after registering: 0.
2024-03-11 16:25:36,596 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: demo_log_01[1] -> Calc[2] (1/1) (221a5269e20bba5711084dac2cb7f76e_cbc357ccb763df2852fee8c4fc7d55f2_0_0) switched from SCHEDULED to DEPLOYING.
2024-03-11 16:25:36,597 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Deploying Source: demo_log_01[1] -> Calc[2] (1/1) (attempt #0) with attempt id 221a5269e20bba5711084dac2cb7f76e_cbc357ccb763df2852fee8c4fc7d55f2_0_0 and vertex id cbc357ccb763df2852fee8c4fc7d55f2_0 to container_e481_1710116763206_0003_01_000002 @ hadoop-03 (dataPort=44707) with allocation id 3dd57df6ed13d75470f28c28909074da
2024-03-11 16:25:36,602 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: demo_log_02[4] -> Calc[5] (1/1) (221a5269e20bba5711084dac2cb7f76e_6cdc5bb954874d922eaee11a8e7b5dd5_0_0) switched from SCHEDULED to DEPLOYING.
2024-03-11 16:25:36,602 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Deploying Source: demo_log_02[4] -> Calc[5] (1/1) (attempt #0) with attempt id 221a5269e20bba5711084dac2cb7f76e_6cdc5bb954874d922eaee11a8e7b5dd5_0_0 and vertex id 6cdc5bb954874d922eaee11a8e7b5dd5_0 to container_e481_1710116763206_0003_01_000002 @ hadoop-03 (dataPort=44707) with allocation id 3dd57df6ed13d75470f28c28909074da
2024-03-11 16:25:36,603 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Join[7] -> Calc[8] -> Sink: demo_log_05[9] (1/1) (221a5269e20bba5711084dac2cb7f76e_8b481b930a189b6b1762a9d95a61ada1_0_0) switched from SCHEDULED to DEPLOYING.
2024-03-11 16:25:36,603 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Deploying Join[7] -> Calc[8] -> Sink: demo_log_05[9] (1/1) (attempt #0) with attempt id 221a5269e20bba5711084dac2cb7f76e_8b481b930a189b6b1762a9d95a61ada1_0_0 and vertex id 8b481b930a189b6b1762a9d95a61ada1_0 to container_e481_1710116763206_0003_01_000002 @ hadoop-03 (dataPort=44707) with allocation id 3dd57df6ed13d75470f28c28909074da
2024-03-11 16:25:36,980 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Join[7] -> Calc[8] -> Sink: demo_log_05[9] (1/1) (221a5269e20bba5711084dac2cb7f76e_8b481b930a189b6b1762a9d95a61ada1_0_0) switched from DEPLOYING to INITIALIZING.
2024-03-11 16:25:36,988 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: demo_log_01[1] -> Calc[2] (1/1) (221a5269e20bba5711084dac2cb7f76e_cbc357ccb763df2852fee8c4fc7d55f2_0_0) switched from DEPLOYING to INITIALIZING.
2024-03-11 16:25:36,992 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: demo_log_02[4] -> Calc[5] (1/1) (221a5269e20bba5711084dac2cb7f76e_6cdc5bb954874d922eaee11a8e7b5dd5_0_0) switched from DEPLOYING to INITIALIZING.
2024-03-11 16:25:37,267 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: demo_log_02[4] -> Calc[5] (1/1) (221a5269e20bba5711084dac2cb7f76e_6cdc5bb954874d922eaee11a8e7b5dd5_0_0) switched from INITIALIZING to RUNNING.
2024-03-11 16:25:37,268 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: demo_log_01[1] -> Calc[2] (1/1) (221a5269e20bba5711084dac2cb7f76e_cbc357ccb763df2852fee8c4fc7d55f2_0_0) switched from INITIALIZING to RUNNING.
2024-03-11 16:25:37,356 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Join[7] -> Calc[8] -> Sink: demo_log_05[9] (1/1) (221a5269e20bba5711084dac2cb7f76e_8b481b930a189b6b1762a9d95a61ada1_0_0) switched from INITIALIZING to RUNNING.
2024-03-11 16:30:58,215 ERROR org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers$SavepointTriggerHandler [] - Exception occurred in REST handler: Config key [state.savepoints.dir] is not set. Property [target-directory] must be provided.
2024-03-11 16:30:59,226 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job ADDCUSTOMJAR 任务测试 (9040e8ab6c52e768c4db0f7de9366973) switched from state RUNNING to CANCELLING.
2024-03-11 16:30:59,226 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: demo_log_01[1] -> Calc[2] (1/1) (221a5269e20bba5711084dac2cb7f76e_cbc357ccb763df2852fee8c4fc7d55f2_0_0) switched from RUNNING to CANCELING.
2024-03-11 16:30:59,229 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: demo_log_02[4] -> Calc[5] (1/1) (221a5269e20bba5711084dac2cb7f76e_6cdc5bb954874d922eaee11a8e7b5dd5_0_0) switched from RUNNING to CANCELING.
2024-03-11 16:30:59,229 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Join[7] -> Calc[8] -> Sink: demo_log_05[9] (1/1) (221a5269e20bba5711084dac2cb7f76e_8b481b930a189b6b1762a9d95a61ada1_0_0) switched from RUNNING to CANCELING.
2024-03-11 16:30:59,248 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: demo_log_01[1] -> Calc[2] (1/1) (221a5269e20bba5711084dac2cb7f76e_cbc357ccb763df2852fee8c4fc7d55f2_0_0) switched from CANCELING to CANCELED.
2024-03-11 16:30:59,251 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Join[7] -> Calc[8] -> Sink: demo_log_05[9] (1/1) (221a5269e20bba5711084dac2cb7f76e_8b481b930a189b6b1762a9d95a61ada1_0_0) switched from CANCELING to CANCELED.
2024-03-11 16:30:59,251 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: demo_log_02[4] -> Calc[5] (1/1) (221a5269e20bba5711084dac2cb7f76e_6cdc5bb954874d922eaee11a8e7b5dd5_0_0) switched from CANCELING to CANCELED.
2024-03-11 16:30:59,253 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job ADDCUSTOMJAR 任务测试 (9040e8ab6c52e768c4db0f7de9366973) switched from state CANCELLING to CANCELED.
2024-03-11 16:30:59,253 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Stopping checkpoint coordinator for job 9040e8ab6c52e768c4db0f7de9366973.
2024-03-11 16:30:59,253 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Clearing resource requirements of job 9040e8ab6c52e768c4db0f7de9366973
2024-03-11 16:30:59,258 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job 9040e8ab6c52e768c4db0f7de9366973 reached terminal state CANCELED.
2024-03-11 16:30:59,262 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job 9040e8ab6c52e768c4db0f7de9366973 has been registered for cleanup in the JobResultStore after reaching a terminal state.
2024-03-11 16:30:59,265 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Stopping the JobMaster for job 'ADDCUSTOMJAR 任务测试' (9040e8ab6c52e768c4db0f7de9366973).
2024-03-11 16:30:59,269 INFO  org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore [] - Shutting down
2024-03-11 16:30:59,270 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Disconnect TaskExecutor container_e481_1710116763206_0003_01_000002(hadoop-03:45454) because: Stopping JobMaster for job 'ADDCUSTOMJAR 任务测试' (9040e8ab6c52e768c4db0f7de9366973).
2024-03-11 16:30:59,271 INFO  org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - Releasing slot [3dd57df6ed13d75470f28c28909074da].
2024-03-11 16:30:59,271 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Close ResourceManager connection 2e3c544f5551f7ebeaf1ff0797293bd6: Stopping JobMaster for job 'ADDCUSTOMJAR 任务测试' (9040e8ab6c52e768c4db0f7de9366973).
2024-03-11 16:30:59,273 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Disconnect job manager 00000000000000000000000000000000@akka.tcp://flink@hadoop-03:41462/user/rpc/jobmanager_2 for job 9040e8ab6c52e768c4db0f7de9366973 from the resource manager.
2024-03-11 16:31:51,966 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - need release 1 workers, current worker number 1, declared worker number 0
2024-03-11 16:31:51,966 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Stopping worker container_e481_1710116763206_0003_01_000002(hadoop-03:45454).
2024-03-11 16:31:51,966 INFO  org.apache.flink.yarn.YarnResourceManagerDriver              [] - Stopping container container_e481_1710116763206_0003_01_000002(hadoop-03:45454).
2024-03-11 16:31:51,967 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Closing TaskExecutor connection container_e481_1710116763206_0003_01_000002(hadoop-03:45454) because: slot manager has determined that the resource is no longer needed
2024-03-11 16:31:51,967 INFO  org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl [] - Processing Event EventType: STOP_CONTAINER for Container container_e481_1710116763206_0003_01_000002
2024-03-11 16:31:51,984 WARN  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Discard registration from TaskExecutor container_e481_1710116763206_0003_01_000002(hadoop-03:45454) at (akka.tcp://flink@hadoop-03:41634/user/rpc/taskmanager_0) because the framework did not recognize it
2024-03-11 16:33:09,757 ERROR org.dinky.app.util.FlinkAppUtil                              [] - send hook failed,retry later taskId:9,jobId:9040e8ab6c52e768c4db0f7de9366973,ConnectException: 连接超时 (Connection timed out)
2024-03-11 16:35:18,009 ERROR org.dinky.app.util.FlinkAppUtil                              [] - send hook failed,retry later taskId:9,jobId:9040e8ab6c52e768c4db0f7de9366973,ConnectException: 连接超时 (Connection timed out)
2024-03-11 16:37:26,265 ERROR org.dinky.app.util.FlinkAppUtil                              [] - send hook failed,retry later taskId:9,jobId:9040e8ab6c52e768c4db0f7de9366973,ConnectException: 连接超时 (Connection timed out)


Anything else

No response

Version

dev

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@Zzm0809 Zzm0809 added Bug Something isn't working Waiting for reply Waiting for reply labels Mar 11, 2024
@Zzm0809 Zzm0809 added Optimization Optimization function and removed Bug Something isn't working Waiting for reply Waiting for reply labels Mar 11, 2024
@Zzm0809 Zzm0809 changed the title [Bug] Yarn resources cannot be released immediately after submitting an application type task and stopping it [Improvement] Yarn resources cannot be released immediately after submitting an application type task and stopping it Mar 11, 2024
@Zzm0809 Zzm0809 added this to the 1.0.1 milestone Mar 12, 2024
@Zzm0809
Copy link
Contributor Author

Zzm0809 commented Mar 13, 2024

maybe my yarn's problem, so close this issue

@Zzm0809 Zzm0809 closed this as completed Mar 13, 2024
@data-server
Copy link

@Zzm0809 这个问题解决了吗?我遇到了相同的问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Optimization Optimization function
Projects
Archived in project
Development

No branches or pull requests

3 participants