Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s部署kuscia中心化集群,Runp模式,执行隐私计算有问题 #427

Closed
Meng-xiangkun opened this issue Sep 13, 2024 · 36 comments
Closed

Comments

@Meng-xiangkun
Copy link

Issue Type

Feature

Search for existing issues similar to yours

Yes

Kuscia Version

0.10.0b0

Link to Relevant Documentation

No response

Question Details

使用kuscia-secretflow:laster镜像做隐私求交计算时出现这个错误
Failed to process object: error handling "dppm-qvxgwzap-node-35", failed to process kusciaTask "dppm-qvxgwzap-node-35", failed to build domain bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow "secretflow-image" not found, retry
Failed to update kuscia job "dppm" status, Operation cannot be fulfilled on kusciajobs.kuscia.secretflow "dppm": the object has been modified; please apply your changes to the latest version and try again
@Meng-xiangkun
Copy link
Author

image

2024-09-12 18:30:34.303 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status

2024-09-12 18:30:34.317 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status

2024-09-12 18:30:34.317 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (13.420693ms)

2024-09-12 18:30:34.317 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (13.470899ms)

2024-09-12 18:30:34.317 INFO resources/kusciajob.go:82 update kuscia job dppm

2024-09-12 18:30:34.329 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (12.672843ms)

2024-09-12 18:30:34.330 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status

2024-09-12 18:30:34.343 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status

2024-09-12 18:30:34.343 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (13.248207ms)

2024-09-12 18:30:34.343 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (13.29884ms)

2024-09-12 18:30:34.345 INFO handler/job_scheduler.go:323 Create kuscia tasks: dppm-qvxgwzap-node-35

2024-09-12 18:30:34.357 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status

2024-09-12 18:30:34.369 WARN kusciatask/controller.go:424 Error handling "dppm-qvxgwzap-node-35", re-queuing

2024-09-12 18:30:34.369 ERROR kusciatask/controller.go:435 Failed to process object: error handling "dppm-qvxgwzap-node-35", failed to process kusciaTask "dppm-qvxgwzap-node-35", failed to build domain bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow "secretflow-image" not found, retry

2024-09-12 18:30:34.370 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status

2024-09-12 18:30:34.370 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (25.113735ms)

2024-09-12 18:30:34.370 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (25.15742ms)

2024-09-12 18:30:34.370 INFO handler/job_scheduler.go:661 jobStatusPhaseFrom readyTasks={}, tasks={{taskId=dppm-qvxgwzap-node-35, dependencies=[], tolerable=false, phase=}}, kusciaJobId=dppm

2024-09-12 18:30:34.370 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status

2024-09-12 18:30:34.383 WARN kusciatask/controller.go:424 Error handling "dppm-qvxgwzap-node-35", re-queuing

2024-09-12 18:30:34.383 ERROR kusciatask/controller.go:435 Failed to process object: error handling "dppm-qvxgwzap-node-35", failed to process kusciaTask "dppm-qvxgwzap-node-35", failed to build domain bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow "secretflow-image" not found, retry

2024-09-12 18:30:34.385 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status

2024-09-12 18:30:34.386 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (15.795756ms)

2024-09-12 18:30:34.386 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (15.879731ms)

2024-09-12 18:30:34.388 INFO handler/job_scheduler.go:661 jobStatusPhaseFrom readyTasks={}, tasks={{taskId=dppm-qvxgwzap-node-35, dependencies=[], tolerable=false, phase=}}, kusciaJobId=dppm

2024-09-12 18:30:34.388 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (488.279µs)

2024-09-12 18:30:34.399 WARN kusciatask/controller.go:424 Error handling "dppm-qvxgwzap-node-35", re-queuing

2024-09-12 18:30:34.399 ERROR kusciatask/controller.go:435 Failed to process object: error handling "dppm-qvxgwzap-node-35", failed to process kusciaTask "dppm-qvxgwzap-node-35", failed to build domain bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow "secretflow-image" not found, retry

2024-09-12 18:30:34.423 WARN kusciatask/controller.go:424 Error handling "dppm-qvxgwzap-node-35", re-queuing

2024-09-12 18:30:34.424 ERROR kusciatask/controller.go:435 Failed to process object: error handling "dppm-qvxgwzap-node-35", failed to process kusciaTask "dppm-qvxgwzap-node-35", failed to build domain bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow "secretflow-image" not found, retry

2024-09-12 18:30:34.472 INFO resources/kusciatask.go:69 Start updating kuscia task "dppm-qvxgwzap-node-35" status

2024-09-12 18:30:34.488 INFO resources/kusciatask.go:71 Finish updating kuscia task "dppm-qvxgwzap-node-35" status

2024-09-12 18:30:34.488 INFO kusciatask/controller.go:521 Finished syncing kusciatask "dppm-qvxgwzap-node-35" (24.193535ms)

2024-09-12 18:30:34.490 INFO handler/job_scheduler.go:661 jobStatusPhaseFrom readyTasks={}, tasks={{taskId=dppm-qvxgwzap-node-35, dependencies=[], tolerable=false, phase=Failed}}, kusciaJobId=dppm

2024-09-12 18:30:34.490 INFO handler/job_scheduler.go:679 jobStatusPhaseFrom failed readyTasks={}, tasks={{taskId=dppm-qvxgwzap-node-35, dependencies=[], tolerable=false, phase=Failed}}, kusciaJobId=dppm

2024-09-12 18:30:34.491 WARN handler/failed_handler.go:62 Get task resource group dppm-qvxgwzap-node-35 failed, skip setting its status to failed, taskresourcegroup.kuscia.secretflow "dppm-qvxgwzap-node-35" not found

2024-09-12 18:30:34.491 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status

2024-09-12 18:30:34.491 INFO resources/kusciatask.go:69 Start updating kuscia task "dppm-qvxgwzap-node-35" status

2024-09-12 18:30:34.505 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status

2024-09-12 18:30:34.505 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (14.950352ms)

2024-09-12 18:30:34.505 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (14.972553ms)

2024-09-12 18:30:34.510 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status

2024-09-12 18:30:34.510 INFO resources/kusciatask.go:71 Finish updating kuscia task "dppm-qvxgwzap-node-35" status

2024-09-12 18:30:34.510 INFO kusciatask/controller.go:521 Finished syncing kusciatask "dppm-qvxgwzap-node-35" (19.491329ms)

2024-09-12 18:30:34.510 INFO kusciatask/controller.go:489 KusciaTask "dppm-qvxgwzap-node-35" was finished, skipping

2024-09-12 18:30:34.523 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status

2024-09-12 18:30:34.523 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (13.33302ms)

2024-09-12 18:30:34.523 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (13.376915ms)

2024-09-12 18:30:34.523 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status

2024-09-12 18:30:34.534 WARN resources/kusciajob.go:122 Failed to update kuscia job "dppm" status, Operation cannot be fulfilled on kusciajobs.kuscia.secretflow "dppm": the object has been modified; please apply your changes to the latest version and try again

2024-09-12 18:30:34.542 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status

2024-09-12 18:30:34.554 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status

2024-09-12 18:30:34.555 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (31.853225ms)

2024-09-12 18:30:34.555 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (31.901265ms)

2024-09-12 18:30:34.555 INFO handler/job_scheduler.go:700 KusciaJob dppm was finished, skipping

2024-09-12 18:30:34.555 INFO kusciajob/controller.go:266 KusciaJob "dppm" should not reconcile again, skipping

2024-09-12 18:30:34.555 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (111.519µs)

@lanyy9527
Copy link

异常日志显示“secretflow-image”镜像缺失,可以通过 kuscia get appimage 查看镜像是否存在;如果镜像是存在的,可以进一步提供下pod引擎日志信息

@Meng-xiangkun
Copy link
Author

异常日志显示“secretflow-image”镜像缺失,可以通过 kuscia get appimage 查看镜像是否存在;如果镜像是存在的,可以进一步提供下pod引擎日志信息

image
是这样吗?

@lanyy9527
Copy link

可以加上你的namespace(-n name),或者-A查看所有

@Meng-xiangkun
Copy link
Author

可以加上你的namespace(-n name),或者-A查看所有

image
还是一样的

@Meng-xiangkun
Copy link
Author

异常日志显示“secretflow-image”镜像缺失,可以通过 kuscia get appimage 查看镜像是否存在;如果镜像是存在的,可以进一步提供下pod引擎日志信息

作业任务详细信息

sh-4.4# kubectl get kt jaqj-qvxgwzap-node-35 -n cross-domain -o yaml
apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
  creationTimestamp: "2024-09-12T10:49:29Z"
  generation: 1
  labels:
    kuscia.secretflow/controller: kuscia-job
    kuscia.secretflow/job-id: jaqj
    kuscia.secretflow/self-cluster-as-initiator: "true"
    kuscia.secretflow/task-alias: jaqj-qvxgwzap-node-35
  name: jaqj-qvxgwzap-node-35
  ownerReferences:
  - apiVersion: kuscia.secretflow/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: KusciaJob
    name: jaqj
    uid: 9a2a5920-c23d-409d-afdc-14d82e5e53e4
  resourceVersion: "14340"
  uid: 73a41e0d-4b9d-4d03-b5eb-261efb760b15
spec:
  initiator: bob
  parties:
  - appImageRef: secretflow-image
    domainID: bob
    template:
      spec: {}
  - appImageRef: secretflow-image
    domainID: alice
    template:
      spec: {}
  scheduleConfig: {}
  taskInputConfig: |-
    {
      "sf_datasource_config": {
        "bob": {
          "id": "default-data-source"
        },
        "alice": {
          "id": "default-data-source"
        }
      },
      "sf_cluster_desc": {
        "parties": ["bob", "alice"],
        "devices": [{
          "name": "spu",
          "type": "spu",
          "parties": ["bob", "alice"],
          "config": "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
        }, {
          "name": "heu",
          "type": "heu",
          "parties": ["bob", "alice"],
          "config": "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
        }],
        "ray_fed_config": {
          "cross_silo_comm_backend": "brpc_link"
        }
      },
      "sf_node_eval_param": {
        "domain": "data_prep",
        "name": "psi",
        "version": "0.0.5",
        "attr_paths": ["input/receiver_input/key", "input/sender_input/key", "protocol", "sort_result", "allow_duplicate_keys", "allow_duplicate_keys/no/skip_duplicates_check", "fill_value_int", "ecdh_curve"],
        "attrs": [{
          "is_na": false,
          "ss": ["id1"]
        }, {
          "is_na": false,
          "ss": ["id2"]
        }, {
          "is_na": false,
          "s": "PROTOCOL_RR22"
        }, {
          "b": true,
          "is_na": false
        }, {
          "is_na": false,
          "s": "no"
        }, {
          "is_na": true
        }, {
          "is_na": true
        }, {
          "is_na": false,
          "s": "CURVE_FOURQ"
        }],
        "inputs": [{
          "type": "sf.table.individual",
          "meta": {
            "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
            "line_count": "-1"
          },
          "data_refs": [{
            "uri": "alice.csv",
            "party": "alice",
            "format": "csv"
          }]
        }, {
          "type": "sf.table.individual",
          "meta": {
            "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
            "line_count": "-1"
          },
          "data_refs": [{
            "uri": "bob.csv",
            "party": "bob",
            "format": "csv"
          }]
        }],
        "checkpoint_uri": "ckjaqj-qvxgwzap-node-35-output-0"
      },
      "sf_output_uris": ["jaqj-qvxgwzap-node-35-output-0"],
      "sf_input_ids": ["alice-table", "bob-table"],
      "sf_output_ids": ["jaqj-qvxgwzap-node-35-output-0"]
    }
status:
  completionTime: "2024-09-12T10:49:29Z"
  conditions:
  - lastTransitionTime: "2024-09-12T10:49:29Z"
    message: Failed to create kusciaTask related resources, failed to build domain
      bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow
      "secretflow-image" not found
    reason: KusciaTaskCreateFailed
    status: "False"
    type: ResourceCreated
  lastReconcileTime: "2024-09-12T10:49:29Z"
  message: 'KusciaTask failed after 3x retry, last error: failed to build domain bob
    kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow
    "secretflow-image" not found'
  phase: Failed
  startTime: "2024-09-12T10:49:29Z"

@yushiqie
Copy link
Contributor

再检查一下部署节点步骤,appimage 需要手动创建 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/deployment/K8s_deployment_kuscia/K8s_master_lite_cn#appimage

@Meng-xiangkun
Copy link
Author

再检查一下部署节点步骤,appimage 需要手动创建 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/deployment/K8s_deployment_kuscia/K8s_master_lite_cn#appimage

image
文件不存在,这个文件要放到哪里啊

@Meng-xiangkun
Copy link
Author

image
文件不存在,这个文件要放到哪里啊

@lanyy9527
Copy link

image 文件不存在,这个文件要放到哪里啊

可以查看下当前路径下是否有AppImage.yaml这个文件

@Meng-xiangkun
Copy link
Author

Meng-xiangkun commented Sep 13, 2024

image 文件不存在,这个文件要放到哪里啊

可以查看下当前路径下是否有AppImage.yaml这个文件

没有这个文件,上传一份吗,上传到那个位置呀

@Meng-xiangkun
Copy link
Author

上面的问题好了,现在任务一直pending,获取不到secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8这个镜像,还有其他方法使用这个镜像吗,除了从secretflow-registry.cn-hangzhou.cr.aliyuncs.com拉取,集群环境不允许拉取外部镜像。

sh-4.4# kubectl get kt -n cross-domain
NAME                    STARTTIME   COMPLETIONTIME   LASTRECONCILETIME   PHASE
fpvu-alice              3m30s       3m30s            3m30s               Failed
gere-bob                3m30s       3m30s            3m30s               Failed
alzf-qvxgwzap-node-35   2m36s                        2m18s               Pending
sh-4.4# kubectl get kt alzf-qvxgwzap-node-35 -n cross-domain -o yaml
apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
  creationTimestamp: "2024-09-13T06:27:39Z"
  generation: 1
  labels:
    kuscia.secretflow/controller: kuscia-job
    kuscia.secretflow/job-id: alzf
    kuscia.secretflow/self-cluster-as-initiator: "true"
    kuscia.secretflow/task-alias: alzf-qvxgwzap-node-35
  name: alzf-qvxgwzap-node-35
  ownerReferences:
  - apiVersion: kuscia.secretflow/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: KusciaJob
    name: alzf
    uid: 1c3bf688-1a1d-4ba1-98dc-9239ec113ebd
  resourceVersion: "2736"
  uid: 7a1f8356-82da-44f4-8b10-cab10b0a87be
spec:
  initiator: bob
  parties:
  - appImageRef: secretflow-image
    domainID: bob
    template:
      spec: {}
  - appImageRef: secretflow-image
    domainID: alice
    template:
      spec: {}
  scheduleConfig: {}
  taskInputConfig: |-
    {
      "sf_datasource_config": {
        "bob": {
          "id": "default-data-source"
        },
        "alice": {
          "id": "default-data-source"
        }
      },
      "sf_cluster_desc": {
        "parties": ["bob", "alice"],
        "devices": [{
          "name": "spu",
          "type": "spu",
          "parties": ["bob", "alice"],
          "config": "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
        }, {
          "name": "heu",
          "type": "heu",
          "parties": ["bob", "alice"],
          "config": "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
        }],
        "ray_fed_config": {
          "cross_silo_comm_backend": "brpc_link"
        }
      },
      "sf_node_eval_param": {
        "domain": "data_prep",
        "name": "psi",
        "version": "0.0.5",
        "attr_paths": ["input/receiver_input/key", "input/sender_input/key", "protocol", "sort_result", "allow_duplicate_keys", "allow_duplicate_keys/no/skip_duplicates_check", "fill_value_int", "ecdh_curve"],
        "attrs": [{
          "is_na": false,
          "ss": ["id1"]
        }, {
          "is_na": false,
          "ss": ["id2"]
        }, {
          "is_na": false,
          "s": "PROTOCOL_RR22"
        }, {
          "b": true,
          "is_na": false
        }, {
          "is_na": false,
          "s": "no"
        }, {
          "is_na": true
        }, {
          "is_na": true
        }, {
          "is_na": false,
          "s": "CURVE_FOURQ"
        }],
        "inputs": [{
          "type": "sf.table.individual",
          "meta": {
            "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
            "line_count": "-1"
          },
          "data_refs": [{
            "uri": "alice.csv",
            "party": "alice",
            "format": "csv"
          }]
        }, {
          "type": "sf.table.individual",
          "meta": {
            "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
            "line_count": "-1"
          },
          "data_refs": [{
            "uri": "bob.csv",
            "party": "bob",
            "format": "csv"
          }]
        }],
        "checkpoint_uri": "ckalzf-qvxgwzap-node-35-output-0"
      },
      "sf_output_uris": ["alzf-qvxgwzap-node-35-output-0"],
      "sf_input_ids": ["alice-table", "bob-table"],
      "sf_output_ids": ["alzf-qvxgwzap-node-35-output-0"]
    }
status:
  allocatedPorts:
  - domainID: alice
    namedPort:
      alzf-qvxgwzap-node-35-0/client-server: 31454
      alzf-qvxgwzap-node-35-0/fed: 31450
      alzf-qvxgwzap-node-35-0/global: 31451
      alzf-qvxgwzap-node-35-0/node-manager: 31452
      alzf-qvxgwzap-node-35-0/object-manager: 31453
      alzf-qvxgwzap-node-35-0/spu: 31449
  - domainID: bob
    namedPort:
      alzf-qvxgwzap-node-35-0/client-server: 32739
      alzf-qvxgwzap-node-35-0/fed: 32741
      alzf-qvxgwzap-node-35-0/global: 32742
      alzf-qvxgwzap-node-35-0/node-manager: 32737
      alzf-qvxgwzap-node-35-0/object-manager: 32738
      alzf-qvxgwzap-node-35-0/spu: 32740
  conditions:
  - lastTransitionTime: "2024-09-13T06:27:39Z"
    status: "True"
    type: ResourceCreated
  lastReconcileTime: "2024-09-13T06:27:57Z"
  phase: Pending
  podStatuses:
    alice/alzf-qvxgwzap-node-35-0:
      createTime: "2024-09-13T06:27:39Z"
      message: 'container[secretflow] waiting state reason: "ImageInspectError", message:
        "Failed to inspect image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\":
        failed to get image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
        manifest, detail-> image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
        not exist in local repository"'
      namespace: alice
      nodeName: kuscia-lite-alice-9b7cdf6fd-l8dt5
      podName: alzf-qvxgwzap-node-35-0
      podPhase: Pending
      reason: ImageInspectError
      startTime: "2024-09-13T06:27:41Z"
    bob/alzf-qvxgwzap-node-35-0:
      createTime: "2024-09-13T06:27:39Z"
      message: 'container[secretflow] waiting state reason: "ImageInspectError", message:
        "Failed to inspect image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\":
        failed to get image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
        manifest, detail-> image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
        not exist in local repository"'
      namespace: bob
      nodeName: kuscia-lite-bob-7df5b89f5-vcrl9
      podName: alzf-qvxgwzap-node-35-0
      podPhase: Pending
      reason: ImageInspectError
      startTime: "2024-09-13T06:27:41Z"
  serviceStatuses:
    alice/alzf-qvxgwzap-node-35-0-fed:
      createTime: "2024-09-13T06:27:39Z"
      namespace: alice
      portName: fed
      portNumber: 31450
      readyTime: "2024-09-13T06:27:41Z"
      scope: Cluster
      serviceName: alzf-qvxgwzap-node-35-0-fed
    alice/alzf-qvxgwzap-node-35-0-global:
      createTime: "2024-09-13T06:27:39Z"
      namespace: alice
      portName: global
      portNumber: 31451
      readyTime: "2024-09-13T06:27:41Z"
      scope: Domain
      serviceName: alzf-qvxgwzap-node-35-0-global
    alice/alzf-qvxgwzap-node-35-0-spu:
      createTime: "2024-09-13T06:27:39Z"
      namespace: alice
      portName: spu
      portNumber: 31449
      readyTime: "2024-09-13T06:27:41Z"
      scope: Cluster
      serviceName: alzf-qvxgwzap-node-35-0-spu
    bob/alzf-qvxgwzap-node-35-0-fed:
      createTime: "2024-09-13T06:27:39Z"
      namespace: bob
      portName: fed
      portNumber: 32741
      readyTime: "2024-09-13T06:27:41Z"
      scope: Cluster
      serviceName: alzf-qvxgwzap-node-35-0-fed
    bob/alzf-qvxgwzap-node-35-0-global:
      createTime: "2024-09-13T06:27:39Z"
      namespace: bob
      portName: global
      portNumber: 32742
      readyTime: "2024-09-13T06:27:41Z"
      scope: Domain
      serviceName: alzf-qvxgwzap-node-35-0-global
    bob/alzf-qvxgwzap-node-35-0-spu:
      createTime: "2024-09-13T06:27:39Z"
      namespace: bob
      portName: spu
      portNumber: 32740
      readyTime: "2024-09-13T06:27:41Z"
      scope: Cluster
      serviceName: alzf-qvxgwzap-node-35-0-spu
  startTime: "2024-09-13T06:27:39Z"

@yushiqie
Copy link
Contributor

kuscia 0.10.x 版本 runp 容器运行时任务镜像不支持动态拉取,可以采取以下措施:

  1. 通过 docker build -f kuscia-secretflow.Dockerfile . 将 kuscia 和 secretflow 打包在一起 kuscia-secretflow.Dockerfile
  2. 升级 kuscia 版本到 0.11.x

@Meng-xiangkun
Copy link
Author

kuscia 0.10.x 版本 runp 容器运行时任务镜像不支持动态拉取,可以采取以下措施:

  1. 通过 docker build -f kuscia-secretflow.Dockerfile . 将 kuscia 和 secretflow 打包在一起 kuscia-secretflow.Dockerfile
  2. 升级 kuscia 版本到 0.11.x

ERROR: failed to solve: secretflow/anolis8-python:3.10.13: failed to resolve source metadata for docker.io/secretflow/anolis8-python:3.10.13: failed to do request: Head "https://registry-1.docker.io/v2/secretflow/anolis8-python/manifests/3.10.13": dial tcp 108.160.169.185:443: connect: connection refused
这个镜像还有别的地址能拉取吗

@yushiqie
Copy link
Contributor

可以用 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/anolis8-python:3.10.13

@wangzul
Copy link
Contributor

wangzul commented Sep 13, 2024

image
secretflow/secretpad#130

@Meng-xiangkun
Copy link
Author

Meng-xiangkun commented Sep 13, 2024

kuscia 0.10.x 版本 runp 容器运行时任务镜像不支持动态拉取,可以采取以下措施:

  1. 通过 docker build -f kuscia-secretflow.Dockerfile . 将 kuscia 和 secretflow 打包在一起 kuscia-secretflow.Dockerfile
  2. 升级 kuscia 版本到 0.11.x

我使用了 将 kuscia 和 secretflow 打包在一起的镜像,还是报这个错
"Failed to inspect image "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0":
failed to get image "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0"
manifest, detail-> image "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0"
not exist in local repository"'
是需要配置修改什么吗?才能找到镜像

@Meng-xiangkun
Copy link
Author

image secretflow/secretpad#130

是按照图片这么设置的

@yushiqie
Copy link
Contributor

@Meng-xiangkun
Copy link
Author

看下 dockerfile 默认导入的 secretflow 版本 https://github.com/secretflow/kuscia/blob/release/0.10.x/build/dockerfile/kuscia-secretflow.Dockerfile#L15

镜像问题解决了,现在用本地上传的数据集进行隐私求交计算的时候失败了

apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
  annotations:
    kuscia.secretflow/job-id: gsid
    kuscia.secretflow/self-cluster-as-participant: "true"
    kuscia.secretflow/task-alias: gsid-dwdkvwbe-node-35
  creationTimestamp: "2024-09-14T02:37:41Z"
  generation: 1
  labels:
    kuscia.secretflow/controller: kuscia-job
    kuscia.secretflow/job-uid: 25d3045f-2277-41d3-8cb6-eeb23747073b
  name: gsid-dwdkvwbe-node-35
  namespace: cross-domain
  ownerReferences:
  - apiVersion: kuscia.secretflow/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: KusciaJob
    name: gsid
    uid: 25d3045f-2277-41d3-8cb6-eeb23747073b
  resourceVersion: "12285"
  uid: 3f11ec51-7e6c-4928-89f6-b16374ef50b5
spec:
  initiator: bob
  parties:
  - appImageRef: secretflow-image
    domainID: bob
    template:
      spec: {}
  - appImageRef: secretflow-image
    domainID: alice
    template:
      spec: {}
  scheduleConfig: {}
  taskInputConfig: |-
    {
      "sf_datasource_config": {
        "bob": {
          "id": "default-data-source"
        },
        "alice": {
          "id": "default-data-source"
        }
      },
      "sf_cluster_desc": {
        "parties": ["bob", "alice"],
        "devices": [{
          "name": "spu",
          "type": "spu",
          "parties": ["bob", "alice"],
          "config": "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
        }, {
          "name": "heu",
          "type": "heu",
          "parties": ["bob", "alice"],
          "config": "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
        }],
        "ray_fed_config": {
          "cross_silo_comm_backend": "brpc_link"
        }
      },
      "sf_node_eval_param": {
        "domain": "data_prep",
        "name": "psi",
        "version": "0.0.5",
        "attr_paths": ["input/receiver_input/key", "input/sender_input/key", "protocol", "sort_result", "allow_duplicate_keys", "allow_duplicate_keys/no/skip_duplicates_check", "fill_value_int", "ecdh_curve"],
        "attrs": [{
          "is_na": false,
          "ss": ["id"]
        }, {
          "is_na": false,
          "ss": ["id"]
        }, {
          "is_na": false,
          "s": "PROTOCOL_RR22"
        }, {
          "b": true,
          "is_na": false
        }, {
          "is_na": false,
          "s": "no"
        }, {
          "is_na": true
        }, {
          "is_na": true
        }, {
          "is_na": false,
          "s": "CURVE_FOURQ"
        }],
        "inputs": [{
          "type": "sf.table.individual",
          "meta": {
            "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
            "line_count": "-1"
          },
          "data_refs": [{
            "uri": "alice1_1010363635.csv",
            "party": "alice",
            "format": "csv"
          }]
        }, {
          "type": "sf.table.individual",
          "meta": {
            "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
            "line_count": "-1"
          },
          "data_refs": [{
            "uri": "bob1_1907238687.csv",
            "party": "bob",
            "format": "csv"
          }]
        }],
        "checkpoint_uri": "ckgsid-dwdkvwbe-node-35-output-0"
      },
      "sf_output_uris": ["gsid-dwdkvwbe-node-35-output-0"],
      "sf_input_ids": ["astrqxxq", "yxcxhdat"],
      "sf_output_ids": ["gsid-dwdkvwbe-node-35-output-0"]
    }
status:
  allocatedPorts:
  - domainID: bob
    namedPort:
      gsid-dwdkvwbe-node-35-0/client-server: 20393
      gsid-dwdkvwbe-node-35-0/fed: 20395
      gsid-dwdkvwbe-node-35-0/global: 20390
      gsid-dwdkvwbe-node-35-0/node-manager: 20391
      gsid-dwdkvwbe-node-35-0/object-manager: 20392
      gsid-dwdkvwbe-node-35-0/spu: 20394
  - domainID: alice
    namedPort:
      gsid-dwdkvwbe-node-35-0/client-server: 21057
      gsid-dwdkvwbe-node-35-0/fed: 21059
      gsid-dwdkvwbe-node-35-0/global: 21054
      gsid-dwdkvwbe-node-35-0/node-manager: 21055
      gsid-dwdkvwbe-node-35-0/object-manager: 21056
      gsid-dwdkvwbe-node-35-0/spu: 21058
  completionTime: "2024-09-14T02:37:57Z"
  conditions:
  - lastTransitionTime: "2024-09-14T02:37:41Z"
    status: "True"
    type: ResourceCreated
  - lastTransitionTime: "2024-09-14T02:37:43Z"
    status: "True"
    type: Running
  - lastTransitionTime: "2024-09-14T02:37:57Z"
    status: "False"
    type: Success
  lastReconcileTime: "2024-09-14T02:37:57Z"
  message: The remaining no-failed party task counts 1 are less than the threshold
    2 that meets the conditions for task success. pending party[], running party[alice],
    successful party[], failed party[bob]
  partyTaskStatus:
  - domainID: bob
    phase: Failed
  - domainID: alice
    phase: Failed
  phase: Failed
  podStatuses:
    alice/gsid-dwdkvwbe-node-35-0:
      createTime: "2024-09-14T02:37:41Z"
      namespace: alice
      nodeName: kuscia-lite-alice-784b59647f-55mdx
      podName: gsid-dwdkvwbe-node-35-0
      podPhase: Failed
      readyTime: "2024-09-14T02:37:44Z"
      startTime: "2024-09-14T02:37:43Z"
    bob/gsid-dwdkvwbe-node-35-0:
      createTime: "2024-09-14T02:37:41Z"
      namespace: bob
      nodeName: kuscia-lite-bob-6d7d6c998f-zhtll
      podName: gsid-dwdkvwbe-node-35-0
      podPhase: Failed
      readyTime: "2024-09-14T02:37:43Z"
      reason: Error
      startTime: "2024-09-14T02:37:43Z"
      terminationLog: 'container[secretflow] terminated state reason "Error", message:
        "... Ignore 12413 characters at the beginning ...\ning_failure'': True}\n\x1b[36m(SenderReceiverProxyActor
        pid=9199)\x1b[0m I0914 10:37:52.646880  9199 external/com_github_brpc_brpc/src/brpc/server.cpp:1181]
        Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on
        port=20395.\n\x1b[36m(SenderReceiverProxyActor pid=9199)\x1b[0m W0914 10:37:52.646909  9199
        external/com_github_brpc_brpc/src/brpc/server.cpp:1187] Builtin services are
        disabled according to ServerOptions.has_builtin_services\n\x1b[36m(SenderReceiverProxyActor
        pid=9199)\x1b[0m I0914 10:37:53.321158  9421 external/com_github_brpc_brpc/src/brpc/span.cpp:506]
        Opened ./rpc_data/rpcz/20240914.103753.9199/id.db and ./rpc_data/rpcz/20240914.103753.9199/time.db\n2024-09-14
        10:37:53.676 INFO barriers.py:465 [bob] -- [Anonymous_job] Succeeded to create
        receiver proxy actor.\n2024-09-14 10:37:53.676 INFO barriers.py:520 [bob]
        -- [Anonymous_job] Try ping [''alice''] at 0 attemp, up to 3600 attemps.\n2024-09-14
        10:37:53.685 WARNING psi.py:361 [bob] -- [Anonymous_job] {''cluster_def'':
        {''nodes'': [{''party'': ''bob'', ''address'': ''0.0.0.0:20394'', ''listen_address'':
        ''''}, {''party'': ''alice'', ''address'': ''http://gsid-dwdkvwbe-node-35-0-spu.alice.svc:80'',
        ''listen_address'': ''''}], ''runtime_config'': {''protocol'': 2, ''field'':
        3}}, ''link_desc'': {''connect_retry_times'': 60, ''connect_retry_interval_ms'':
        1000, ''brpc_channel_protocol'': ''http'', ''brpc_channel_connection_type'':
        ''pooled'', ''recv_timeout_ms'': 1200000, ''http_timeout_ms'': 1200000}}\n2024-09-14
        10:37:55.340 ERROR component.py:1130 [bob] -- [Anonymous_job] eval on domain:
        \"data_prep\"\nname: \"psi\"\nversion: \"0.0.5\"\nattr_paths: \"input/receiver_input/key\"\nattr_paths:
        \"input/sender_input/key\"\nattr_paths: \"protocol\"\nattr_paths: \"sort_result\"\nattr_paths:
        \"allow_duplicate_keys\"\nattr_paths: \"allow_duplicate_keys/no/skip_duplicates_check\"\nattr_paths:
        \"fill_value_int\"\nattr_paths: \"ecdh_curve\"\nattrs {\n  ss: \"id\"\n}\nattrs
        {\n  ss: \"id\"\n}\nattrs {\n  s: \"PROTOCOL_RR22\"\n}\nattrs {\n  b: true\n}\nattrs
        {\n  s: \"no\"\n}\nattrs {\n  is_na: true\n}\nattrs {\n  is_na: true\n}\nattrs
        {\n  s: \"CURVE_FOURQ\"\n}\ninputs {\n  name: \"alice1\"\n  type: \"sf.table.individual\"\n  meta
        {\n    type_url: \"type.googleapis.com/secretflow.spec.v1.IndividualTable\"\n    value:
        \"\\n\\t\\022\\002id*\\003int\\020\\377\\377\\377\\377\\377\\377\\377\\377\\377\\001\"\n  }\n  data_refs
        {\n    uri: \"alice1_1010363635.csv\"\n    party: \"alice\"\n    format: \"csv\"\n  }\n}\ninputs
        {\n  name: \"bob1\"\n  type: \"sf.table.individual\"\n  meta {\n    type_url:
        \"type.googleapis.com/secretflow.spec.v1.IndividualTable\"\n    value: \"\\n\\t\\022\\002id*\\003int\\020\\377\\377\\377\\377\\377\\377\\377\\377\\377\\001\"\n  }\n  data_refs
        {\n    uri: \"bob1_1907238687.csv\"\n    party: \"bob\"\n    format: \"csv\"\n  }\n}\noutput_uris:
        \"gsid-dwdkvwbe-node-35-output-0\"\ncheckpoint_uri: \"ckgsid-dwdkvwbe-node-35-output-0\"\n
        failed, error <\x1b[36mray::_run()\x1b[39m (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)\n  At
        least one of the input arguments for this task could not be computed:\nray.exceptions.RayTaskError:
        \x1b[36mray::_run()\x1b[39m (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py\",
        line 156, in _run\n    return fn(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\",
        line 839, in download_file\n    comp_storage.download_file(uri, output_path)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py\",
        line 32, in download_file\n    impl.download_file(remote_fn, local_fn)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py\",
        line 171, in download_file\n    assert os.path.exists(full_remote_fn)\nAssertionError>\n2024-09-14
        10:37:55.341 INFO api.py:342 [bob] -- [Anonymous_job] Shutdowning rayfed intendedly...\n2024-09-14
        10:37:55.341 INFO api.py:356 [bob] -- [Anonymous_job] No wait for data sending.\n2024-09-14
        10:37:55.342 INFO message_queue.py:72 [bob] -- [Anonymous_job] Notify message
        polling thread[DataSendingQueueThread] to exit.\n2024-09-14 10:37:55.342 INFO
        message_queue.py:72 [bob] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread]
        to exit.\n2024-09-14 10:37:55.342 INFO api.py:384 [bob] -- [Anonymous_job]
        Shutdowned rayfed.\n\x1b[33m(raylet)\x1b[0m [2024-09-14 10:37:54,186 I 9422
        9422] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL
        to -1\x1b[32m [repeated 3x across cluster] (Ray deduplicates logs by default.
        Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication
        for more options.)\x1b[0m\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.10/runpy.py\",
        line 196, in _run_module_as_main\n    return _run_code(code, main_globals,
        None,\n  File \"/usr/local/lib/python3.10/runpy.py\", line 86, in _run_code\n    exec(code,
        run_globals)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py\",
        line 547, in <module>\n    main()\n  File \"/usr/local/lib/python3.10/site-packages/click/core.py\",
        line 1157, in __call__\n    return self.main(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/click/core.py\",
        line 1078, in main\n    rv = self.invoke(ctx)\n  File \"/usr/local/lib/python3.10/site-packages/click/core.py\",
        line 1434, in invoke\n    return ctx.invoke(self.callback, **ctx.params)\n  File
        \"/usr/local/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n    return
        __callback(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py\",
        line 527, in main\n    res = comp_eval(sf_node_eval_param, storage_config,
        sf_cluster_config)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py\",
        line 176, in comp_eval\n    res = comp.eval(\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\",
        line 1132, in eval\n    raise e from None\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\",
        line 1127, in eval\n    ret = self.__eval_callback(ctx=ctx, **kwargs)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/psi.py\",
        line 371, in two_party_balanced_psi_eval_fn\n    download_files(ctx, uri,
        input_path)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\",
        line 847, in download_files\n    wait(waits)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py\",
        line 213, in wait\n    reveal([o.device(lambda o: None)(o) for o in objs])\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py\", line
        162, in reveal\n    all_object = sfd.get(all_object_refs)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py\",
        line 156, in get\n    return fed.get(object_refs)\n  File \"/usr/local/lib/python3.10/site-packages/fed/api.py\",
        line 621, in get\n    values = ray.get(ray_refs)\n  File \"/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py\",
        line 22, in auto_init_wrapper\n    return fn(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py\",
        line 103, in wrapper\n    return func(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/ray/_private/worker.py\",
        line 2624, in get\n    raise value.as_instanceof_cause()\nray.exceptions.RayTaskError(AssertionError):
        \x1b[36mray::_run()\x1b[39m (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)\n  At
        least one of the input arguments for this task could not be computed:\nray.exceptions.RayTaskError:
        \x1b[36mray::_run()\x1b[39m (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py\",
        line 156, in _run\n    return fn(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\",
        line 839, in download_file\n    comp_storage.download_file(uri, output_path)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py\",
        line 32, in download_file\n    impl.download_file(remote_fn, local_fn)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py\",
        line 171, in download_file\n    assert os.path.exists(full_remote_fn)\nAssertionError\n"'
  serviceStatuses:
    alice/gsid-dwdkvwbe-node-35-0-fed:
      createTime: "2024-09-14T02:37:41Z"
      namespace: alice
      portName: fed
      portNumber: 21059
      readyTime: "2024-09-14T02:37:44Z"
      scope: Cluster
      serviceName: gsid-dwdkvwbe-node-35-0-fed
    alice/gsid-dwdkvwbe-node-35-0-global:
      createTime: "2024-09-14T02:37:41Z"
      namespace: alice
      portName: global
      portNumber: 21054
      readyTime: "2024-09-14T02:37:44Z"
      scope: Domain
      serviceName: gsid-dwdkvwbe-node-35-0-global
    alice/gsid-dwdkvwbe-node-35-0-spu:
      createTime: "2024-09-14T02:37:41Z"
      namespace: alice
      portName: spu
      portNumber: 21058
      readyTime: "2024-09-14T02:37:44Z"
      scope: Cluster
      serviceName: gsid-dwdkvwbe-node-35-0-spu
    bob/gsid-dwdkvwbe-node-35-0-fed:
      createTime: "2024-09-14T02:37:41Z"
      namespace: bob
      portName: fed
      portNumber: 20395
      readyTime: "2024-09-14T02:37:43Z"
      scope: Cluster
      serviceName: gsid-dwdkvwbe-node-35-0-fed
    bob/gsid-dwdkvwbe-node-35-0-global:
      createTime: "2024-09-14T02:37:41Z"
      namespace: bob
      portName: global
      portNumber: 20390
      readyTime: "2024-09-14T02:37:43Z"
      scope: Domain
      serviceName: gsid-dwdkvwbe-node-35-0-global
    bob/gsid-dwdkvwbe-node-35-0-spu:
      createTime: "2024-09-14T02:37:41Z"
      namespace: bob
      portName: spu
      portNumber: 20394
      readyTime: "2024-09-14T02:37:43Z"
      scope: Cluster
      serviceName: gsid-dwdkvwbe-node-35-0-spu
  startTime: "2024-09-14T02:37:41Z"

@wangzul
Copy link
Contributor

wangzul commented Sep 14, 2024

参考这个文档提供一下双方的pod日志 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/troubleshoot/run_job_failed#id6

@Meng-xiangkun
Copy link
Author

参考这个文档提供一下双方的pod日志 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/troubleshoot/run_job_failed#id6

alice节点下的pod日志

WARNING:root:Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.
2024-09-14 10:37:47,052|alice|INFO|secretflow|entry.py:start_ray:59| ray_conf: RayConfig(ray_node_ip_address='gsid-dwdkvwbe-node-35-0-global.alice.svc', ray_node_manager_port=21055, ray_object_manager_port=21056, ray_client_server_port=21057, ray_worker_ports=[], ray_gcs_port=21054)
2024-09-14 10:37:47,058|alice|INFO|secretflow|entry.py:start_ray:67| Trying to start ray head node at gsid-dwdkvwbe-node-35-0-global.alice.svc, start command: ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=gsid-dwdkvwbe-node-35-0-global.alice.svc --port=21054 --node-manager-port=21055 --object-manager-port=21056 --ray-client-server-port=21057
2024-09-14 10:37:51,042|alice|INFO|secretflow|entry.py:start_ray:80| 2024-09-14 10:37:47,713    INFO usage_lib.py:423 -- Usage stats collection is disabled.
2024-09-14 10:37:47,713 INFO scripts.py:744 -- Local node IP: gsid-dwdkvwbe-node-35-0-global.alice.svc
2024-09-14 10:37:50,726 SUCC scripts.py:781 -- --------------------
2024-09-14 10:37:50,727 SUCC scripts.py:782 -- Ray runtime started.
2024-09-14 10:37:50,727 SUCC scripts.py:783 -- --------------------
2024-09-14 10:37:50,727 INFO scripts.py:785 -- Next steps
2024-09-14 10:37:50,727 INFO scripts.py:788 -- To add another node to this Ray cluster, run
2024-09-14 10:37:50,727 INFO scripts.py:791 --   ray start --address='gsid-dwdkvwbe-node-35-0-global.alice.svc:21054'
2024-09-14 10:37:50,727 INFO scripts.py:800 -- To connect to this Ray cluster:
2024-09-14 10:37:50,728 INFO scripts.py:802 -- import ray
2024-09-14 10:37:50,728 INFO scripts.py:803 -- ray.init(_node_ip_address='gsid-dwdkvwbe-node-35-0-global.alice.svc')
2024-09-14 10:37:50,728 INFO scripts.py:834 -- To terminate the Ray runtime, run
2024-09-14 10:37:50,728 INFO scripts.py:835 --   ray stop
2024-09-14 10:37:50,728 INFO scripts.py:838 -- To view the status of the cluster, use
2024-09-14 10:37:50,728 INFO scripts.py:839 --   ray status

2024-09-14 10:37:51,042|alice|INFO|secretflow|entry.py:start_ray:81| Succeeded to start ray head node at gsid-dwdkvwbe-node-35-0-global.alice.svc.
2024-09-14 10:37:51,047|alice|INFO|secretflow|entry.py:main:510| datasource.access_directly True
sf_node_eval_param  {
  "domain": "data_prep",
  "name": "psi",
  "version": "0.0.5",
  "attrPaths": [
    "input/receiver_input/key",
    "input/sender_input/key",
    "protocol",
    "sort_result",
    "allow_duplicate_keys",
    "allow_duplicate_keys/no/skip_duplicates_check",
    "fill_value_int",
    "ecdh_curve"
  ],
  "attrs": [
    {
      "ss": [
        "id"
      ]
    },
    {
      "ss": [
        "id"
      ]
    },
    {
      "s": "PROTOCOL_RR22"
    },
    {
      "b": true
    },
    {
      "s": "no"
    },
    {
      "isNa": true
    },
    {
      "isNa": true
    },
    {
      "s": "CURVE_FOURQ"
    }
  ],
  "inputs": [
    {
      "type": "sf.table.individual",
      "meta": {
        "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
        "lineCount": "-1"
      },
      "dataRefs": [
        {
          "uri": "alice1_1010363635.csv",
          "party": "alice",
          "format": "csv"
        }
      ]
    },
    {
      "type": "sf.table.individual",
      "meta": {
        "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
        "lineCount": "-1"
      },
      "dataRefs": [
        {
          "uri": "bob1_1907238687.csv",
          "party": "bob",
          "format": "csv"
        }
      ]
    }
  ],
  "checkpointUri": "ckgsid-dwdkvwbe-node-35-output-0"
}
2024-09-14 10:37:51,059|alice|WARNING|secretflow|meta_conversion.py:convert_domain_data_to_individual_table:29| kuscia adapter has to deduce dist data from domain data at this moment.
2024-09-14 10:37:51,059|alice|INFO|secretflow|entry.py:domaindata_id_to_dist_data:160| domaindata_id astrqxxq to
...........
name: "alice1"
type: "sf.table.individual"
meta {
  type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
  value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
  uri: "alice1_1010363635.csv"
  party: "alice"
  format: "csv"
}

....
2024-09-14 10:37:51,070|alice|WARNING|secretflow|meta_conversion.py:convert_domain_data_to_individual_table:29| kuscia adapter has to deduce dist data from domain data at this moment.
2024-09-14 10:37:51,070|alice|INFO|secretflow|entry.py:domaindata_id_to_dist_data:160| domaindata_id yxcxhdat to
...........
name: "bob1"
type: "sf.table.individual"
meta {
  type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
  value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
  uri: "bob1_1907238687.csv"
  party: "bob"
  format: "csv"
}

....
2024-09-14 10:37:51,071|alice|WARNING|secretflow|entry.py:comp_eval:169|
--
Secretflow 1.7.0b0
Build time (Jun 25 2024, 11:25:31) with commit id: d08547cb86d07d5515e8b997236fad81972cdef7
--

2024-09-14 10:37:51,071|alice|WARNING|secretflow|entry.py:comp_eval:170|
--
*param*

domain: "data_prep"
name: "psi"
version: "0.0.5"
attr_paths: "input/receiver_input/key"
attr_paths: "input/sender_input/key"
attr_paths: "protocol"
attr_paths: "sort_result"
attr_paths: "allow_duplicate_keys"
attr_paths: "allow_duplicate_keys/no/skip_duplicates_check"
attr_paths: "fill_value_int"
attr_paths: "ecdh_curve"
attrs {
  ss: "id"
}
attrs {
  ss: "id"
}
attrs {
  s: "PROTOCOL_RR22"
}
attrs {
  b: true
}
attrs {
  s: "no"
}
attrs {
  is_na: true
}
attrs {
  is_na: true
}
attrs {
  s: "CURVE_FOURQ"
}
inputs {
  name: "alice1"
  type: "sf.table.individual"
  meta {
    type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
    value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
  }
  data_refs {
    uri: "alice1_1010363635.csv"
    party: "alice"
    format: "csv"
  }
}
inputs {
  name: "bob1"
  type: "sf.table.individual"
  meta {
    type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
    value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
  }
  data_refs {
    uri: "bob1_1907238687.csv"
    party: "bob"
    format: "csv"
  }
}
output_uris: "gsid-dwdkvwbe-node-35-output-0"
checkpoint_uri: "ckgsid-dwdkvwbe-node-35-output-0"

--

2024-09-14 10:37:51,071|alice|WARNING|secretflow|entry.py:comp_eval:171|
--
*storage_config*

type: "local_fs"
local_fs {
  wd: "/home/kuscia/var/storage/data"
}

--

2024-09-14 10:37:51,071|alice|WARNING|secretflow|entry.py:comp_eval:172|
--
*cluster_config*

desc {
  parties: "bob"
  parties: "alice"
  devices {
    name: "spu"
    type: "spu"
    parties: "bob"
    parties: "alice"
    config: "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
  }
  devices {
    name: "heu"
    type: "heu"
    parties: "bob"
    parties: "alice"
    config: "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
  }
  ray_fed_config {
    cross_silo_comm_backend: "brpc_link"
  }
}
public_config {
  ray_fed_config {
    parties: "bob"
    parties: "alice"
    addresses: "gsid-dwdkvwbe-node-35-0-fed.bob.svc:80"
    addresses: "0.0.0.0:21059"
  }
  spu_configs {
    name: "spu"
    parties: "bob"
    parties: "alice"
    addresses: "http://gsid-dwdkvwbe-node-35-0-spu.bob.svc:80"
    addresses: "0.0.0.0:21058"
  }
}
private_config {
  self_party: "alice"
  ray_head_addr: "gsid-dwdkvwbe-node-35-0-global.alice.svc:21054"
}

--

2024-09-14 10:37:51,074|alice|WARNING|secretflow|driver.py:init:442| When connecting to an existing cluster, num_cpus must not be provided. Num_cpus is neglected at this moment.
2024-09-14 10:37:51,074 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: gsid-dwdkvwbe-node-35-0-global.alice.svc:21054...
2024-09-14 10:37:51,087|alice|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140509199005728 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/node_ip_address.json.lock
2024-09-14 10:37:51,087|alice|DEBUG|secretflow|_api.py:acquire:334| Lock 140509199005728 acquired on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/node_ip_address.json.lock
2024-09-14 10:37:51,087|alice|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140509199005728 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/node_ip_address.json.lock
2024-09-14 10:37:51,088|alice|DEBUG|secretflow|_api.py:release:367| Lock 140509199005728 released on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/node_ip_address.json.lock
2024-09-14 10:37:51,092|alice|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140509199005824 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,092|alice|DEBUG|secretflow|_api.py:acquire:334| Lock 140509199005824 acquired on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,093|alice|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140509199005824 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,093|alice|DEBUG|secretflow|_api.py:release:367| Lock 140509199005824 released on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,093|alice|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140509199005584 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,093|alice|DEBUG|secretflow|_api.py:acquire:334| Lock 140509199005584 acquired on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,093|alice|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140509199005584 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:release:367| Lock 140509199005584 released on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140509199005824 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:acquire:334| Lock 140509199005824 acquired on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140509199005824 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:release:367| Lock 140509199005824 released on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140509199005584 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,095|alice|DEBUG|secretflow|_api.py:acquire:334| Lock 140509199005584 acquired on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,095|alice|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140509199005584 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,095|alice|DEBUG|secretflow|_api.py:release:367| Lock 140509199005584 released on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,095 INFO worker.py:1724 -- Connected to Ray cluster.
2024-09-14 10:37:51.870 INFO api.py:233 [alice] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'bob': 'http://gsid-dwdkvwbe-node-35-0-fed.bob.svc:80', 'alice': '0.0.0.0:21059'}, 'CURRENT_PARTY_NAME': 'alice', 'TLS_CONFIG': {}}
(raylet) [2024-09-14 10:37:52,467 I 9291 9291] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
(SenderReceiverProxyActor pid=9291) 2024-09-14 10:37:53.277 INFO link.py:38 [alice] -- [Anonymous_job] brpc options: {'proxy_max_restarts': 3, 'timeout_in_ms': 300000, 'recv_timeout_ms': 604800000, 'connect_retry_times': 3600, 'connect_retry_interval_ms': 1000,'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'exit_on_sending_failure': True}
(SenderReceiverProxyActor pid=9291) I0914 10:37:53.306789  9291 external/com_github_brpc_brpc/src/brpc/server.cpp:1181] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=21059.
(SenderReceiverProxyActor pid=9291) W0914 10:37:53.306837  9291 external/com_github_brpc_brpc/src/brpc/server.cpp:1187] Builtin services are disabled according to ServerOptions.has_builtin_services
2024-09-14 10:37:53.675 INFO barriers.py:465 [alice] -- [Anonymous_job] Succeeded to create receiver proxy actor.
2024-09-14 10:37:53.675 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 0 attemp, up to 3600 attemps.
2024-09-14 10:37:53.683 WARNING psi.py:361 [alice] -- [Anonymous_job] {'cluster_def': {'nodes': [{'party': 'bob', 'address': 'http://gsid-dwdkvwbe-node-35-0-spu.bob.svc:80', 'listen_address': ''}, {'party': 'alice', 'address': '0.0.0.0:21058', 'listen_address':''}], 'runtime_config': {'protocol': 2, 'field': 3}}, 'link_desc': {'connect_retry_times': 60, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'recv_timeout_ms': 1200000, 'http_timeout_ms': 1200000}}
(SenderReceiverProxyActor pid=9291) I0914 10:37:53.680885  9513 external/com_github_brpc_brpc/src/brpc/span.cpp:506] Opened ./rpc_data/rpcz/20240914.103753.9291/id.db and ./rpc_data/rpcz/20240914.103753.9291/time.db
2024-09-14 10:37:55.665 ERROR component.py:1130 [alice] -- [Anonymous_job] eval on domain: "data_prep"
name: "psi"
version: "0.0.5"
attr_paths: "input/receiver_input/key"
attr_paths: "input/sender_input/key"
attr_paths: "protocol"
attr_paths: "sort_result"
attr_paths: "allow_duplicate_keys"
attr_paths: "allow_duplicate_keys/no/skip_duplicates_check"
attr_paths: "fill_value_int"
attr_paths: "ecdh_curve"
attrs {
  ss: "id"
}
attrs {
  ss: "id"
}
attrs {
  s: "PROTOCOL_RR22"
}
attrs {
  b: true
}
attrs {
  s: "no"
}
attrs {
  is_na: true
}
attrs {
  is_na: true
}
attrs {
  s: "CURVE_FOURQ"
}
inputs {
  name: "alice1"
  type: "sf.table.individual"
  meta {
    type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
    value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
  }
  data_refs {
    uri: "alice1_1010363635.csv"
    party: "alice"
    format: "csv"
  }
}
inputs {
  name: "bob1"
  type: "sf.table.individual"
  meta {
    type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
    value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
  }
  data_refs {
    uri: "bob1_1907238687.csv"
    party: "bob"
    format: "csv"
  }
}
output_uris: "gsid-dwdkvwbe-node-35-output-0"
checkpoint_uri: "ckgsid-dwdkvwbe-node-35-output-0"
 failed, error <ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
    comp_storage.download_file(uri, output_path)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
    impl.download_file(remote_fn, local_fn)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
    assert os.path.exists(full_remote_fn)
AssertionError>
2024-09-14 10:37:55.666 INFO api.py:342 [alice] -- [Anonymous_job] Shutdowning rayfed intendedly...
2024-09-14 10:37:55.666 INFO api.py:356 [alice] -- [Anonymous_job] No wait for data sending.
2024-09-14 10:37:55.668 INFO message_queue.py:72 [alice] -- [Anonymous_job] Notify message polling thread[DataSendingQueueThread] to exit.
2024-09-14 10:37:55.669 INFO message_queue.py:72 [alice] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread] to exit.
2024-09-14 10:37:55.669 INFO api.py:384 [alice] -- [Anonymous_job] Shutdowned rayfed.
2024-09-14 10:37:55.670 WARNING cleanup.py:154 [alice] -- [Anonymous_job] Failed to send ObjectRef(82891771158d68c1fcce2f44215c103cf6cd60270100000001000000) to bob, error: ray::SenderReceiverProxyActor.send() (pid=9291, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc, actor_id=fcce2f44215c103cf6cd602701000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fec182ddde0>)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
    comp_storage.download_file(uri, output_path)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
    impl.download_file(remote_fn, local_fn)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
    assert os.path.exists(full_remote_fn)
AssertionError,upstream_seq_id: 7#0, downstream_seq_id: 9.
2024-09-14 10:37:55.670 INFO cleanup.py:161 [alice] -- [Anonymous_job] Sending error  to bob.
Exception in thread DataSendingQueueThread:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/fed/cleanup.py", line 152, in _process_data_sending_task_return
    res = ray.get(obj_ref)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::SenderReceiverProxyActor.send() (pid=9291, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc, actor_id=fcce2f44215c103cf6cd602701000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fec182ddde0>)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
    comp_storage.download_file(uri, output_path)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
    impl.download_file(remote_fn, local_fn)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
    assert os.path.exists(full_remote_fn)
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/site-packages/fed/_private/message_queue.py", line 51, in _loop
    res = self._msg_handler(message)
  File "/usr/local/lib/python3.10/site-packages/fed/cleanup.py", line 47, in <lambda>
    lambda msg: self._process_data_sending_task_return(msg),
  File "/usr/local/lib/python3.10/site-packages/fed/cleanup.py", line 166, in _process_data_sending_task_return
    send(
  File "/usr/local/lib/python3.10/site-packages/fed/proxy/barriers.py", line 502, in send
    get_global_context().get_cleanup_manager().push_to_sending(
AttributeError: 'NoneType' object has no attribute 'get_cleanup_manager'
(raylet) [2024-09-14 10:37:54,180 I 9514 9514] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 547, in <module>
    main()
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 527, in main
    res = comp_eval(sf_node_eval_param, storage_config, sf_cluster_config)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py", line 176, in comp_eval
    res = comp.eval(
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1132, in eval
    raise e from None
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1127, in eval
    ret = self.__eval_callback(ctx=ctx, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/psi.py", line 371, in two_party_balanced_psi_eval_fn
    download_files(ctx, uri, input_path)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 847, in download_files
    wait(waits)
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py", line 213, in wait
    reveal([o.device(lambda o: None)(o) for o in objs])
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py", line 162, in reveal
    all_object = sfd.get(all_object_refs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py", line 156, in get
    return fed.get(object_refs)
  File "/usr/local/lib/python3.10/site-packages/fed/api.py", line 621, in get
    values = ray.get(ray_refs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
    comp_storage.download_file(uri, output_path)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
    impl.download_file(remote_fn, local_fn)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
    assert os.path.exists(full_remote_fn)
AssertionError

@Meng-xiangkun
Copy link
Author

参考这个文档提供一下双方的pod日志 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/troubleshoot/run_job_failed#id6

bob节点下的pod日志

WARNING:root:Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.
2024-09-14 10:37:46,688|bob|INFO|secretflow|entry.py:start_ray:59| ray_conf: RayConfig(ray_node_ip_address='gsid-dwdkvwbe-node-35-0-global.bob.svc', ray_node_manager_port=20391, ray_object_manager_port=20392, ray_client_server_port=20393, ray_worker_ports=[], ray_gcs_port=20390)
2024-09-14 10:37:46,694|bob|INFO|secretflow|entry.py:start_ray:67| Trying to start ray head node at gsid-dwdkvwbe-node-35-0-global.bob.svc, start command: ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=gsid-dwdkvwbe-node-35-0-global.bob.svc --port=20390 --node-manager-port=20391 --object-manager-port=20392 --ray-client-server-port=20393
2024-09-14 10:37:50,465|bob|INFO|secretflow|entry.py:start_ray:80| 2024-09-14 10:37:47,288      INFO usage_lib.py:423 -- Usage stats collection is disabled.
2024-09-14 10:37:47,288 INFO scripts.py:744 -- Local node IP: gsid-dwdkvwbe-node-35-0-global.bob.svc
2024-09-14 10:37:50,314 SUCC scripts.py:781 -- --------------------
2024-09-14 10:37:50,314 SUCC scripts.py:782 -- Ray runtime started.
2024-09-14 10:37:50,314 SUCC scripts.py:783 -- --------------------
2024-09-14 10:37:50,314 INFO scripts.py:785 -- Next steps
2024-09-14 10:37:50,315 INFO scripts.py:788 -- To add another node to this Ray cluster, run
2024-09-14 10:37:50,315 INFO scripts.py:791 --   ray start --address='gsid-dwdkvwbe-node-35-0-global.bob.svc:20390'
2024-09-14 10:37:50,315 INFO scripts.py:800 -- To connect to this Ray cluster:
2024-09-14 10:37:50,315 INFO scripts.py:802 -- import ray
2024-09-14 10:37:50,315 INFO scripts.py:803 -- ray.init(_node_ip_address='gsid-dwdkvwbe-node-35-0-global.bob.svc')
2024-09-14 10:37:50,315 INFO scripts.py:834 -- To terminate the Ray runtime, run
2024-09-14 10:37:50,315 INFO scripts.py:835 --   ray stop
2024-09-14 10:37:50,315 INFO scripts.py:838 -- To view the status of the cluster, use
2024-09-14 10:37:50,315 INFO scripts.py:839 --   ray status

2024-09-14 10:37:50,465|bob|INFO|secretflow|entry.py:start_ray:81| Succeeded to start ray head node at gsid-dwdkvwbe-node-35-0-global.bob.svc.
2024-09-14 10:37:50,470|bob|INFO|secretflow|entry.py:main:510| datasource.access_directly True
sf_node_eval_param  {
  "domain": "data_prep",
  "name": "psi",
  "version": "0.0.5",
  "attrPaths": [
    "input/receiver_input/key",
    "input/sender_input/key",
    "protocol",
    "sort_result",
    "allow_duplicate_keys",
    "allow_duplicate_keys/no/skip_duplicates_check",
    "fill_value_int",
    "ecdh_curve"
  ],
  "attrs": [
    {
      "ss": [
        "id"
      ]
    },
    {
      "ss": [
        "id"
      ]
    },
    {
      "s": "PROTOCOL_RR22"
    },
    {
      "b": true
    },
    {
      "s": "no"
    },
    {
      "isNa": true
    },
    {
      "isNa": true
    },
    {
      "s": "CURVE_FOURQ"
    }
  ],
  "inputs": [
    {
      "type": "sf.table.individual",
      "meta": {
        "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
        "lineCount": "-1"
      },
      "dataRefs": [
        {
          "uri": "alice1_1010363635.csv",
          "party": "alice",
          "format": "csv"
        }
      ]
    },
    {
      "type": "sf.table.individual",
      "meta": {
        "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
        "lineCount": "-1"
      },
      "dataRefs": [
        {
          "uri": "bob1_1907238687.csv",
          "party": "bob",
          "format": "csv"
        }
      ]
    }
  ],
  "checkpointUri": "ckgsid-dwdkvwbe-node-35-output-0"
}
2024-09-14 10:37:50,482|bob|WARNING|secretflow|meta_conversion.py:convert_domain_data_to_individual_table:29| kuscia adapter has to deduce dist data from domain data at this moment.
2024-09-14 10:37:50,482|bob|INFO|secretflow|entry.py:domaindata_id_to_dist_data:160| domaindata_id astrqxxq to
...........
name: "alice1"
type: "sf.table.individual"
meta {
  type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
  value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
  uri: "alice1_1010363635.csv"
  party: "alice"
  format: "csv"
}

....
2024-09-14 10:37:50,492|bob|WARNING|secretflow|meta_conversion.py:convert_domain_data_to_individual_table:29| kuscia adapter has to deduce dist data from domain data at this moment.
2024-09-14 10:37:50,492|bob|INFO|secretflow|entry.py:domaindata_id_to_dist_data:160| domaindata_id yxcxhdat to
...........
name: "bob1"
type: "sf.table.individual"
meta {
  type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
  value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
  uri: "bob1_1907238687.csv"
  party: "bob"
  format: "csv"
}

....
2024-09-14 10:37:50,492|bob|WARNING|secretflow|entry.py:comp_eval:169|
--
Secretflow 1.7.0b0
Build time (Jun 25 2024, 11:25:31) with commit id: d08547cb86d07d5515e8b997236fad81972cdef7
--

2024-09-14 10:37:50,493|bob|WARNING|secretflow|entry.py:comp_eval:170|
--
*param*

domain: "data_prep"
name: "psi"
version: "0.0.5"
attr_paths: "input/receiver_input/key"
attr_paths: "input/sender_input/key"
attr_paths: "protocol"
attr_paths: "sort_result"
attr_paths: "allow_duplicate_keys"
attr_paths: "allow_duplicate_keys/no/skip_duplicates_check"
attr_paths: "fill_value_int"
attr_paths: "ecdh_curve"
attrs {
  ss: "id"
}
attrs {
  ss: "id"
}
attrs {
  s: "PROTOCOL_RR22"
}
attrs {
  b: true
}
attrs {
  s: "no"
}
attrs {
  is_na: true
}
attrs {
  is_na: true
}
attrs {
  s: "CURVE_FOURQ"
}
inputs {
  name: "alice1"
  type: "sf.table.individual"
  meta {
    type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
    value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
  }
  data_refs {
    uri: "alice1_1010363635.csv"
    party: "alice"
    format: "csv"
  }
}
inputs {
  name: "bob1"
  type: "sf.table.individual"
  meta {
    type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
    value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
  }
  data_refs {
    uri: "bob1_1907238687.csv"
    party: "bob"
    format: "csv"
  }
}
output_uris: "gsid-dwdkvwbe-node-35-output-0"
checkpoint_uri: "ckgsid-dwdkvwbe-node-35-output-0"

--

2024-09-14 10:37:50,493|bob|WARNING|secretflow|entry.py:comp_eval:171|
--
*storage_config*

type: "local_fs"
local_fs {
  wd: "/home/kuscia/var/storage/data"
}

--

2024-09-14 10:37:50,493|bob|WARNING|secretflow|entry.py:comp_eval:172|
--
*cluster_config*

desc {
  parties: "bob"
  parties: "alice"
  devices {
    name: "spu"
    type: "spu"
    parties: "bob"
    parties: "alice"
    config: "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
  }
  devices {
    name: "heu"
    type: "heu"
    parties: "bob"
    parties: "alice"
    config: "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
  }
  ray_fed_config {
    cross_silo_comm_backend: "brpc_link"
  }
}
public_config {
  ray_fed_config {
    parties: "bob"
    parties: "alice"
    addresses: "0.0.0.0:20395"
    addresses: "gsid-dwdkvwbe-node-35-0-fed.alice.svc:80"
  }
  spu_configs {
    name: "spu"
    parties: "bob"
    parties: "alice"
    addresses: "0.0.0.0:20394"
    addresses: "http://gsid-dwdkvwbe-node-35-0-spu.alice.svc:80"
  }
}
private_config {
  self_party: "bob"
  ray_head_addr: "gsid-dwdkvwbe-node-35-0-global.bob.svc:20390"
}

--

2024-09-14 10:37:50,495|bob|WARNING|secretflow|driver.py:init:442| When connecting to an existing cluster, num_cpus must not be provided. Num_cpus is neglected at this moment.
2024-09-14 10:37:50,496 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: gsid-dwdkvwbe-node-35-0-global.bob.svc:20390...
2024-09-14 10:37:50,508|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140454971734048 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/node_ip_address.json.lock
2024-09-14 10:37:50,509|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140454971734048 acquired on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/node_ip_address.json.lock
2024-09-14 10:37:50,509|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140454971734048 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/node_ip_address.json.lock
2024-09-14 10:37:50,509|bob|DEBUG|secretflow|_api.py:release:367| Lock 140454971734048 released on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/node_ip_address.json.lock
2024-09-14 10:37:50,513|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140454971734144 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,514|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140454971734144 acquired on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,514|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140454971734144 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,514|bob|DEBUG|secretflow|_api.py:release:367| Lock 140454971734144 released on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,514|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140454971733904 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,514|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140454971733904 acquired on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140454971733904 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:release:367| Lock 140454971733904 released on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140454971734144 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140454971734144 acquired on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140454971734144 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:release:367| Lock 140454971734144 released on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,516|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140454971733904 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,516|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140454971733904 acquired on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,516|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140454971733904 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,516|bob|DEBUG|secretflow|_api.py:release:367| Lock 140454971733904 released on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,516 INFO worker.py:1724 -- Connected to Ray cluster.
2024-09-14 10:37:51.327 INFO api.py:233 [bob] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'bob': '0.0.0.0:20395', 'alice': 'http://gsid-dwdkvwbe-node-35-0-fed.alice.svc:80'}, 'CURRENT_PARTY_NAME': 'bob', 'TLS_CONFIG': {}}
(raylet) [2024-09-14 10:37:51,273 I 7581 7581] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
(SenderReceiverProxyActor pid=9199) 2024-09-14 10:37:52.620 INFO link.py:38 [bob] -- [Anonymous_job] brpc options: {'proxy_max_restarts': 3, 'timeout_in_ms': 300000, 'recv_timeout_ms': 604800000, 'connect_retry_times': 3600, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'exit_on_sending_failure': True}
(SenderReceiverProxyActor pid=9199) I0914 10:37:52.646880  9199 external/com_github_brpc_brpc/src/brpc/server.cpp:1181] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=20395.
(SenderReceiverProxyActor pid=9199) W0914 10:37:52.646909  9199 external/com_github_brpc_brpc/src/brpc/server.cpp:1187] Builtin services are disabled according to ServerOptions.has_builtin_services
(SenderReceiverProxyActor pid=9199) I0914 10:37:53.321158  9421 external/com_github_brpc_brpc/src/brpc/span.cpp:506] Opened ./rpc_data/rpcz/20240914.103753.9199/id.db and ./rpc_data/rpcz/20240914.103753.9199/time.db
2024-09-14 10:37:53.676 INFO barriers.py:465 [bob] -- [Anonymous_job] Succeeded to create receiver proxy actor.
2024-09-14 10:37:53.676 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 0 attemp, up to 3600 attemps.
2024-09-14 10:37:53.685 WARNING psi.py:361 [bob] -- [Anonymous_job] {'cluster_def': {'nodes': [{'party': 'bob', 'address': '0.0.0.0:20394', 'listen_address': ''}, {'party': 'alice', 'address': 'http://gsid-dwdkvwbe-node-35-0-spu.alice.svc:80', 'listen_address':''}], 'runtime_config': {'protocol': 2, 'field': 3}}, 'link_desc': {'connect_retry_times': 60, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'recv_timeout_ms': 1200000, 'http_timeout_ms': 1200000}}
2024-09-14 10:37:55.340 ERROR component.py:1130 [bob] -- [Anonymous_job] eval on domain: "data_prep"
name: "psi"
version: "0.0.5"
attr_paths: "input/receiver_input/key"
attr_paths: "input/sender_input/key"
attr_paths: "protocol"
attr_paths: "sort_result"
attr_paths: "allow_duplicate_keys"
attr_paths: "allow_duplicate_keys/no/skip_duplicates_check"
attr_paths: "fill_value_int"
attr_paths: "ecdh_curve"
attrs {
  ss: "id"
}
attrs {
  ss: "id"
}
attrs {
  s: "PROTOCOL_RR22"
}
attrs {
  b: true
}
attrs {
  s: "no"
}
attrs {
  is_na: true
}
attrs {
  is_na: true
}
attrs {
  s: "CURVE_FOURQ"
}
inputs {
  name: "alice1"
  type: "sf.table.individual"
  meta {
    type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
    value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
  }
  data_refs {
    uri: "alice1_1010363635.csv"
    party: "alice"
    format: "csv"
  }
}
inputs {
  name: "bob1"
  type: "sf.table.individual"
  meta {
    type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
    value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
  }
  data_refs {
    uri: "bob1_1907238687.csv"
    party: "bob"
    format: "csv"
  }
}
output_uris: "gsid-dwdkvwbe-node-35-output-0"
checkpoint_uri: "ckgsid-dwdkvwbe-node-35-output-0"
 failed, error <ray::_run() (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
    comp_storage.download_file(uri, output_path)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
    impl.download_file(remote_fn, local_fn)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
    assert os.path.exists(full_remote_fn)
AssertionError>
2024-09-14 10:37:55.341 INFO api.py:342 [bob] -- [Anonymous_job] Shutdowning rayfed intendedly...
2024-09-14 10:37:55.341 INFO api.py:356 [bob] -- [Anonymous_job] No wait for data sending.
2024-09-14 10:37:55.342 INFO message_queue.py:72 [bob] -- [Anonymous_job] Notify message polling thread[DataSendingQueueThread] to exit.
2024-09-14 10:37:55.342 INFO message_queue.py:72 [bob] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread] to exit.
2024-09-14 10:37:55.342 INFO api.py:384 [bob] -- [Anonymous_job] Shutdowned rayfed.
(raylet) [2024-09-14 10:37:54,186 I 9422 9422] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1 [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 547, in <module>
    main()
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 527, in main
    res = comp_eval(sf_node_eval_param, storage_config, sf_cluster_config)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py", line 176, in comp_eval
    res = comp.eval(
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1132, in eval
    raise e from None
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1127, in eval
    ret = self.__eval_callback(ctx=ctx, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/psi.py", line 371, in two_party_balanced_psi_eval_fn
    download_files(ctx, uri, input_path)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 847, in download_files
    wait(waits)
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py", line 213, in wait
    reveal([o.device(lambda o: None)(o) for o in objs])
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py", line 162, in reveal
    all_object = sfd.get(all_object_refs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py", line 156, in get
    return fed.get(object_refs)
  File "/usr/local/lib/python3.10/site-packages/fed/api.py", line 621, in get
    values = ray.get(ray_refs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::_run() (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
    comp_storage.download_file(uri, output_path)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
    impl.download_file(remote_fn, local_fn)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
    assert os.path.exists(full_remote_fn)
AssertionError

@yushiqie
Copy link
Contributor

显示实际的物理文件找不到,如果是自定义的用户数据,需要把物理文件分别放到 alice/bob 节点的 /home/kuscia/var/storage/data 目录下
https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.9.0b0/deployment/K8s_deployment_kuscia/K8s_master_lite_cn#id11

@Meng-xiangkun
Copy link
Author

显示实际的物理文件找不到,如果是自定义的用户数据,需要把物理文件分别放到 alice/bob 节点的 /home/kuscia/var/storage/data 目录下 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.9.0b0/deployment/K8s_deployment_kuscia/K8s_master_lite_cn#id11

image
我是用pad前端页面上传的数据源,咋还需要准备数据这一步啊

@yushiqie
Copy link
Contributor

kuscia 是 k8s 部署吗,当前你部署的方式 secretpad 是如何跟 k8s 部署的 kuscia 交互的

@Meng-xiangkun
Copy link
Author

kuscia 是 k8s 部署吗,当前你部署的方式 secretpad 是如何跟 k8s 部署的 kuscia 交互的

kuscia 是 k8s 部署的,secretpad是源码打的镜像k8s部署的,部署同一个环境中,下面是secretpad的配置文件

server:
  tomcat:
    accesslog:
      enabled: true
      directory: /var/log/secretpad
  servlet:
    session:
      timeout: 30m
  http-port: 8080
  http-port-inner: 9001
  port: 443
  ssl:
    enabled: true
    key-store: "file:./config/server.jks"
    key-store-password: ${KEY_PASSWORD:secretpad}
    key-alias: secretpad-server
    key-password: ${KEY_PASSWORD:secretpad}
    key-store-type: JKS
  compression:
    enabled: true
    mime-types:
      - application/javascript
      - text/css
    min-response-size: 1024
spring:
  task:
    scheduling:
      pool:
        size: 10
  application:
    name: secretpad
  jpa:
    database-platform: org.hibernate.community.dialect.SQLiteDialect
    show-sql: false
    properties:
      hibernate:
        format_sql: false
    open-in-view: false
  datasource:
    driver-class-name: org.sqlite.JDBC
    url: jdbc:sqlite:./db/secretpad.sqlite
    hikari:
      idle-timeout: 60000
      maximum-pool-size: 1
      connection-timeout: 6000
  flyway:
    baseline-on-migrate: true
    locations:
      - filesystem:./config/schema/center

  #datasource used for mysql
  #spring:
  #  task:
  #    scheduling:
  #      pool:
  #        size: 10
  #  application:
  #    name: secretpad
  #  jpa:
  #    database-platform: org.hibernate.dialect.MySQLDialect
  #    show-sql: false
  #    properties:
  #      hibernate:
  #        format_sql: false
  #  datasource:
  #    driver-class-name: com.mysql.cj.jdbc.Driver
  #    url: your mysql url
  #    username:
  #    password:
  #    hikari:
  #      idle-timeout: 60000
  #      maximum-pool-size: 10
  #      connection-timeout: 5000
  jackson:
    deserialization:
      fail-on-missing-external-type-id-property: false
      fail-on-ignored-properties: false
      fail-on-unknown-properties: false
    serialization:
      fail-on-empty-beans: false
  web:
    locale: zh_CN # default locale, overridden by request "Accept-Language" header.
  cache:
    jcache:
      config:
        classpath:ehcache.xml
springdoc:
  api-docs:
    enabled: true
management:
  endpoints:
    web:
      exposure:
        include: health,info,readiness,prometheus
    enabled-by-default: false
kusciaapi:
  protocol: ${KUSCIA_PROTOCOL:notls}

kuscia:
  nodes:
    - domainId: kuscia-system
      mode: master
      host: ${KUSCIA_API_ADDRESS:kuscia-master.data-develop-operate-dev.svc.cluster.local}
      port: ${KUSCIA_API_PORT:8083}
      protocol: ${KUSCIA_PROTOCOL:notls}
      cert-file: config/certs/client.crt
      key-file: config/certs/client.pem
      token: config/certs/token

    - domainId: alice
      mode: lite
      host: ${KUSCIA_API_LITE_ALICE_ADDRESS:kuscia-lite-alice.data-develop-operate-dev.svc.cluster.local}
      port: ${KUSCIA_API_PORT:8083}
      protocol: ${KUSCIA_PROTOCOL:notls}
      cert-file: config/certs/alice/client.crt
      key-file: config/certs/alice/client.pem
      token: config/certs/alice/token

    - domainId: bob
      mode: lite
      host: ${KUSCIA_API_LITE_BOB_ADDRESS:kuscia-lite-bob.data-develop-operate-dev.svc.cluster.local}
      port: ${KUSCIA_API_PORT:8083}
      protocol: ${KUSCIA_PROTOCOL:notls}
      cert-file: config/certs/bob/client.crt
      key-file: config/certs/bob/client.pem
      token: config/certs/bob/token


job:
  max-parallelism: 1

secretpad:
  logs:
    path: ${SECRETPAD_LOG_PATH:../log}
  deploy-mode: ${DEPLOY_MODE:ALL-IN-ONE} # MPC TEE ALL-IN-ONE
  platform-type: CENTER
  node-id: kuscia-system
  center-platform-service: secretpad.master.svc
  gateway: ${KUSCIA_GW_ADDRESS:127.0.0.1:80}
  auth:
    enabled: true
    pad_name: ${SECRETPAD_USER_NAME}
    pad_pwd: ${SECRETPAD_PASSWORD}
  response:
    extra-headers:
      Content-Security-Policy: "base-uri 'self';frame-src 'self';worker-src blob: 'self' data:;object-src 'self';"
  upload-file:
    max-file-size: -1    # -1 means not limit, e.g.  200MB, 1GB
    max-request-size: -1 # -1 means not limit, e.g.  200MB, 1GB
  data:
    dir-path: /app/data/
  datasync:
    center: true
    p2p: false
  version:
    secretpad-image: ${SECRETPAD_IMAGE:0.5.0b0}
    kuscia-image: ${KUSCIA_IMAGE:0.6.0b0}
    secretflow-image: ${SECRETFLOW_IMAGE:1.4.0b0}
    secretflow-serving-image: ${SECRETFLOW_SERVING_IMAGE:0.2.0b0}
    tee-app-image: ${TEE_APP_IMAGE:0.1.0b0}
    tee-dm-image: ${TEE_DM_IMAGE:0.1.0b0}
    capsule-manager-sim-image: ${CAPSULE_MANAGER_SIM_IMAGE:0.1.2b0}

  component:
    hide:
      - secretflow/io/read_data:0.0.1
      - secretflow/io/write_data:0.0.1
      - secretflow/io/identity:0.0.1
      - secretflow/model/model_export:0.0.1
      - secretflow/ml.train/slnn_train:0.0.1
      - secretflow/ml.predict/slnn_predict:0.0.2

sfclusterDesc:
  deviceConfig:
    spu: "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
    heu: "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
  rayFedConfig:
    crossSiloCommBackend: "brpc_link"

tee:
  capsule-manager: capsule-manager.#.svc

data:
  sync:
    - org.secretflow.secretpad.persistence.entity.ProjectDO
    - org.secretflow.secretpad.persistence.entity.ProjectNodeDO
    - org.secretflow.secretpad.persistence.entity.NodeDO
    - org.secretflow.secretpad.persistence.entity.NodeRouteDO
    - org.secretflow.secretpad.persistence.entity.ProjectJobDO
    - org.secretflow.secretpad.persistence.entity.ProjectTaskDO
    - org.secretflow.secretpad.persistence.entity.ProjectDatatableDO
    - org.secretflow.secretpad.persistence.entity.VoteRequestDO
    - org.secretflow.secretpad.persistence.entity.VoteInviteDO
    - org.secretflow.secretpad.persistence.entity.TeeDownLoadAuditConfigDO
    - org.secretflow.secretpad.persistence.entity.NodeRouteApprovalConfigDO
    - org.secretflow.secretpad.persistence.entity.TeeNodeDatatableManagementDO
    - org.secretflow.secretpad.persistence.entity.ProjectModelServingDO
    - org.secretflow.secretpad.persistence.entity.ProjectGraphNodeKusciaParamsDO
    - org.secretflow.secretpad.persistence.entity.ProjectModelPackDO
    - org.secretflow.secretpad.persistence.entity.FeatureTableDO
    - org.secretflow.secretpad.persistence.entity.ProjectFeatureTableDO
    - org.secretflow.secretpad.persistence.entity.ProjectGraphDomainDatasourceDO

inner-port:
  path:
    - /api/v1alpha1/vote_sync/create
    - /api/v1alpha1/user/node/resetPassword
    - /sync
    - /api/v1alpha1/data/sync
# ip block config (None of them are allowed in the configured IP list)
ip:
  block:
    enable: true
    list:
      - 0.0.0.0/32
      - 127.0.0.1/8
      - 10.0.0.0/8
      - 11.0.0.0/8
      - 30.0.0.0/8
      - 100.64.0.0/10
      - 172.16.0.0/12
      - 192.168.0.0/16
      - 33.0.0.0/8

@yushiqie
Copy link
Contributor

docker 部署的 secretpad + kuscia 通过挂载统一个数据目录实现 secretpad 和 kuscia 数据共享。k8s 部署当前需要同样的挂载统一个 volume实现相同的功能,k8s 方式推荐 oss 数据源

@Meng-xiangkun
Copy link
Author

docker 部署的 secretpad + kuscia 通过挂载统一个数据目录实现 secretpad 和 kuscia 数据共享。k8s 部署当前需要同样的挂载统一个 volume实现相同的功能,k8s 方式推荐 oss 数据源

好的,k8s 部署当前需要同样的挂载统一个 volume实现相同的功能,这个方法有文档吗

@yushiqie
Copy link
Contributor

目前没有,可以找一下 k8s 部署文档

@yushiqie
Copy link
Contributor

volume 本身是依赖集群中 storage driver。单副本情况下,可以使得 secretpad 和 kuscia 调度到统一个节点上,挂载改节点的 hostpath。当然上生产还是推荐使用外部数据源比如 oss

@Meng-xiangkun
Copy link
Author

好的,还有一个问题,我用的官方的联合圈人这个测试案例,隐私求交的结果下载下来是空表没有数据,全表统计的结果是有数据的,这是什么问题呀
image
image
image

@yushiqie
Copy link
Contributor

和上面训练找不到数据是同一个问题,k8s 部署的 secetpad 和 kuscia 没有共享物理数据,但是共享数据元信息。 求交的数据是放在kuscia 点下的,属于实际的物理数据。全表统计输出的是指标,本身属于数据元信息,所以可以通过 secretpad 下载

@Meng-xiangkun
Copy link
Author

和上面训练找不到数据是同一个问题,k8s 部署的 secetpad 和 kuscia 没有共享物理数据,但是共享数据元信息。 求交的数据是放在kuscia 点下的,属于实际的物理数据。全表统计输出的是指标,本身属于数据元信息,所以可以通过 secretpad 下载

明白了,感谢所有的解答

@Meng-xiangkun
Copy link
Author

@Meng-xiangkun secretpad是基于哪个分支打的镜像,org.secretflow.secretpad.web.controller.DataController#download是否有改动?

secretpad是基于0.9.0b0分支打的镜像,org.secretflow.secretpad.web.controller.DataController#download没有过改动

Copy link

Stale issue message. Please comment to remove stale tag. Otherwise this issue will be closed soon.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants