Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to mount volumes for pod Learner #152

Open
JunFugithub opened this issue Dec 6, 2018 · 8 comments
Open

Unable to mount volumes for pod Learner #152

JunFugithub opened this issue Dec 6, 2018 · 8 comments

Comments

@JunFugithub
Copy link

JunFugithub commented Dec 6, 2018

What happend:
Hi there, thanks a lot for your work. It's impressive, so I was trying to deploy it on local MINIKUBE and local DIND, but in fact none of them worked properly. I was stuck in an issue for few days, so I'd like to ask you guys for help. By chance I've found something similar to my issue from your docs but in the different condition, which means:

  1. my local minikube encountered the issue which was recorded in the DIND-TRAING -- all pods worked as expected
alertmanager-7bd87d99cc-jhp2b                                     1/1       Running             0          6h
etcd0                                                             1/1       Running             0          6h
ffdl-lcm-8d555c7bf-dqqhg                                          1/1       Running             0          6h
ffdl-restapi-7f5c57c77d-k67pm                                     1/1       Running             0          6h
ffdl-trainer-6777dd5756-xkk65                                     1/1       Running             0          6h
ffdl-trainingdata-696b99ff5c-tvbtc                                1/1       Running             0          6h
ffdl-ui-95d6464c7-bv2sn                                           1/1       Running             0          6h
jobmonitor-0d296791-2adc-4336-4f01-b280090460c3-cbdb48cfd-qqsvz   1/1       Running             0          1h
learner-0d296791-2adc-4336-4f01-b280090460c3-0                    0/1       ContainerCreating   0          1h
lhelper-0d296791-2adc-4336-4f01-b280090460c3-54858658b-p7vfc      2/2       Running             0          1h
mongo-0                                                           1/1       Running             4          6h
prometheus-67fb854b59-c884p                                       2/2       Running             0          6h
pushgateway-5665768d5c-jdlnl                                      2/2       Running             0          6h
storage-0                                                         1/1       Running             0          6h

except the pod learner with eternal pending status because of the following warning.

Unable to mount volumes for pod "learner-d3a04eac-a64a-427e-56e5-8366cc84292f-0_default(33f78708-f963-11e8-aa08-0800275e57f0)": timeout expired waiting for volumes to attach or mount for pod "default"/"learner-d3a04eac-a64a-427e-56e5-8366cc84292f-0". list of unmounted volumes=[cosinputmount-d3a04eac-a64a-427e-56e5-8366cc84292f cosoutputmount-d3a04eac-a64a-427e-56e5-8366cc84292f]. list of unattached volumes=[cosinputmount-d3a04eac-a64a-427e-56e5-8366cc84292f cosoutputmount-d3a04eac-a64a-427e-56e5-8366cc84292f learner-entrypoint-files jobdata]

and here's the details of pod learner-x

Name:               learner-0d296791-2adc-4336-4f01-b280090460c3-0
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               minikube/10.0.2.15
Start Time:         Thu, 06 Dec 2018 17:05:52 +0100
Labels:             controller-revision-hash=learner-0d296791-2adc-4336-4f01-b280090460c3-999bf4986
                    service=dlaas-learner
                    statefulset.kubernetes.io/pod-name=learner-0d296791-2adc-4336-4f01-b280090460c3-0
                    training_id=training-bFEXXGPmR
                    user_id=test-user
Annotations:        scheduler.alpha.kubernetes.io/nvidiaGPU={ "AllocationPriority": "Dense" }
                    scheduler.alpha.kubernetes.io/tolerations=[ { "key": "dedicated", "operator": "Equal", "value": "gpu-task" } ]
Status:             Pending
IP:
Controlled By:      StatefulSet/learner-0d296791-2adc-4336-4f01-b280090460c3
Containers:
  learner:
    Container ID:
    Image:         tensorflow/tensorflow:1.5.0-py3
    Image ID:
    Ports:         22/TCP, 2222/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      bash
      -c
      export PATH=/usr/local/bin/:$PATH; cp /entrypoint-files/*.sh /usr/local/bin/; chmod +x /usr/local/bin/*.sh;
                        if [ ! -f /job/load-model.exit ]; then
                          while [ ! -f /job/load-model.start ]; do sleep 2; done ;
                          date "+%s%N" | cut -b1-13 > /job/load-model.start_time ;

                        echo "Starting Training $TRAINING_ID"
                        mkdir -p "$MODEL_DIR" ;
                        python -m zipfile -e $RESULT_DIR/_submitted_code/model.zip $MODEL_DIR  ;
                          echo $? > /job/load-model.exit ;
                        fi
                        echo "Done load-model" ;
                        if [ ! -f /job/learner.exit ]; then
                          while [ ! -f /job/learner.start ]; do sleep 2; done ;
                          date "+%s%N" | cut -b1-13 > /job/learner.start_time ;

                        for i in ${!ALERTMANAGER*} ${!DLAAS*} ${!ETCD*} ${!GRAFANA*} ${!HOSTNAME*} ${!KUBERNETES*} ${!MONGO*} ${!PUSHGATEWAY*}; do unset $i; done;
                        export LEARNER_ID=$((${DOWNWARD_API_POD_NAME##*-} + 1)) ;
                        mkdir -p $RESULT_DIR/learner-$LEARNER_ID ;
                        mkdir -p $CHECKPOINT_DIR ;bash -c 'train.sh >> $JOB_STATE_DIR/latest-log 2>&1 ; exit ${PIPESTATUS[0]}' ;
                          echo $? > /job/learner.exit ;
                        fi
                        echo "Done learner" ;
                        if [ ! -f /job/store-logs.exit ]; then
                          while [ ! -f /job/store-logs.start ]; do sleep 2; done ;
                          date "+%s%N" | cut -b1-13 > /job/store-logs.start_time ;

                        echo Calling copy logs.
                        mv -nf $LOG_DIR/* $RESULT_DIR/learner-$LEARNER_ID ;
                        ERROR_CODE=$? ;
                        echo $ERROR_CODE > $RESULT_DIR/learner-$LEARNER_ID/.log-copy-complete ;
                        bash -c 'exit $ERROR_CODE' ;
                          echo $? > /job/store-logs.exit ;
                        fi
                        echo "Done store-logs" ;
                      while true; do sleep 2; done ;
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:             500m
      memory:          1048576k
      nvidia.com/gpu:  0
    Requests:
      cpu:             500m
      memory:          1048576k
      nvidia.com/gpu:  0
    Environment:
      LOG_DIR:                     /job/logs
      GPU_COUNT:                   0.000000
      TRAINING_COMMAND:            python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz   --trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz   --testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001   --trainingIters 2000
      TRAINING_ID:                 training-bFEXXGPmR
      DATA_DIR:                    /mnt/data/tf_training_data
      MODEL_DIR:                   /job/model-code
      RESULT_DIR:                  /mnt/results/tf_trained_model/training-bFEXXGPmR
      DOWNWARD_API_POD_NAME:       learner-0d296791-2adc-4336-4f01-b280090460c3-0 (v1:metadata.name)
      DOWNWARD_API_POD_NAMESPACE:  default (v1:metadata.namespace)
      LEARNER_NAME_PREFIX:         learner-0d296791-2adc-4336-4f01-b280090460c3
      TRAINING_ID:                 training-bFEXXGPmR
      NUM_LEARNERS:                1
      JOB_STATE_DIR:               /job
      CHECKPOINT_DIR:              /mnt/results/tf_trained_model/_wml_checkpoints
      RESULT_BUCKET_DIR:           /mnt/results/tf_trained_model
    Mounts:
      /entrypoint-files from learner-entrypoint-files (rw)
      /job from jobdata (rw)
      /mnt/data/tf_training_data from cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 (rw)
      /mnt/results/tf_trained_model from cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3 (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  cosinputmount-0d296791-2adc-4336-4f01-b280090460c3:
    Type:       FlexVolume (a generic volume resource that is provisioned/attached using an exec based plugin)
    Driver:     ibm/ibmc-s3fs
    FSType:
    SecretRef:  &{cossecretdata-0d296791-2adc-4336-4f01-b280090460c3}
    ReadOnly:   false
    Options:    map[debug-level:warn endpoint:http://192.168.99.105:31172 tls-cipher-suite:DEFAULT cache-size-gb:0 chunk-size-mb:52 curl-debug:false kernel-cache:true multireq-max:20 bucket:tf_training_data ensure-disk-free:0 parallel-count:5 region:us-standard s3fs-fuse-retry-count:30 stat-cache-size:100000]
  cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3:
    Type:       FlexVolume (a generic volume resource that is provisioned/attached using an exec based plugin)
    Driver:     ibm/ibmc-s3fs
    FSType:
    SecretRef:  &{cossecretresults-0d296791-2adc-4336-4f01-b280090460c3}
    ReadOnly:   false
    Options:    map[cache-size-gb:0 curl-debug:false endpoint:http://192.168.99.105:31172 parallel-count:2 bucket:tf_trained_model debug-level:warn s3fs-fuse-retry-count:30 stat-cache-size:100000 chunk-size-mb:52 kernel-cache:false ensure-disk-free:2048 region:us-standard tls-cipher-suite:DEFAULT multireq-max:20]
  learner-entrypoint-files:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      learner-entrypoint-files
    Optional:  false
  jobdata:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     dedicated=gpu-task:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason       Age               From               Message
  ----     ------       ----              ----               -------
  Warning  FailedMount  1m (x40 over 1h)  kubelet, minikube  Unable to mount volumes for pod "learner-0d296791-2adc-4336-4f01-b280090460c3-0_default(ce612f9d-f970-11e8-aa08-0800275e57f0)": timeout expired waiting for volumes to attach or mount for pod "default"/"learner-0d296791-2adc-4336-4f01-b280090460c3-0". list of unmounted volumes=[cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3]. list of unattached volumes=[cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3 learner-entrypoint-files jobdata]
  1. my local dind encountered the issue with non-hint FAILED ERROR while training. All the pods was running, but there're no pods jobmonitor, learner and lhelper.
Deploying model with manifest 'manifest_testrun.yml' and model files in '.'...
Handling connection for 31404
Handling connection for 31404
FAILED
Error 200: OK

What you expected to happen:
Make FfDL work as properly on either local DIND or MINIKUBE.

Environment:
OS: Darwin local 17.4.0 Darwin Kernel Version 17.4.0:
MINIKUBE:

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:36:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

How to reproduce it (as minimally and precisely as possible):

I was just following README.rd with several make instructions

make deploy-plugin
make quickstart-deploy
make test-push-data-s3
make test-job-submit

Anything else we need to know?:

In situation 2, I totally followed the above-mentioned steps;
In situation 1, because it popped out hints that nfs error at first, and I just remember one of the doc I've read about MINIKUBE as if to say that, for persistent volumes, it just supports hostpath type, so I created a PV and PVC, here's the details.

$ kubectl describe pv hostpathtest
Name:            hostpathtest
Labels:          <none>
Annotations:     pv.kubernetes.io/bound-by-controller=yes
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:
Status:          Bound
Claim:           default/static-volume-1
Reclaim Policy:  Retain
Access Modes:    RWO
Capacity:        20Gi
Node Affinity:   <none>
Message:
Source:
    Type:          HostPath (bare host directory volume)
    Path:          /data/hostpath_test
    HostPathType:
Events:            <none>
$ kubectl describe pvc learner-1
Name:          learner-1
Namespace:     default
StorageClass:
Status:        Bound
Volume:        hostpathtest-learner
Labels:        type=dlaas-static-volume
Annotations:   kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{"volume.beta.kubernetes.io/storage-class":""},"labels":{"type":"dlaas-stat...
               pv.kubernetes.io/bind-completed=yes
               pv.kubernetes.io/bound-by-controller=yes
               volume.beta.kubernetes.io/storage-class=
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      20Gi
Access Modes:  RWO
Events:        <none>

Thanks in advance for all advices and have a good day

@JunFugithub
Copy link
Author

Hi, sorry to bother again, I recently deployed ffdl on Google cloud again, but one of those pod, ibmcloud-object-storage-deployer which runs in the kube-system namespace, can't work with following reason.

+ DRIVER_LOCATION=/host/usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
+ KUBELET_SVC_CONFIG=/host/lib/systemd/system/kubelet.service
+ apt-get -y update
Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [83.2 kB]
Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:5 http://security.ubuntu.com/ubuntu bionic-security/universe Sources [32.0 kB]
Get:6 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [133 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic-updates/universe Sources [167 kB]
Get:8 http://security.ubuntu.com/ubuntu bionic-security/multiverse amd64 Packages [1367 B]
Get:9 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [281 kB]
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [900 kB]
Get:11 http://archive.ubuntu.com/ubuntu bionic-updates/multiverse amd64 Packages [6931 B]
Get:12 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [599 kB]
Get:13 http://archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages [10.7 kB]
Get:14 http://archive.ubuntu.com/ubuntu bionic-backports/universe amd64 Packages [3655 B]
Fetched 2381 kB in 1s (1754 kB/s)
Reading package lists...
+ apt-get -y install s3fs
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  ca-certificates file fuse libasn1-8-heimdal libcurl3-gnutls libfuse2
  libgssapi3-heimdal libhcrypto4-heimdal libheimbase1-heimdal
  libheimntlm0-heimdal libhx509-5-heimdal libicu60 libkrb5-26-heimdal
  libldap-2.4-2 libldap-common libmagic-mgc libmagic1 libnghttp2-14 libpsl5
  libroken18-heimdal librtmp1 libsasl2-2 libsasl2-modules libsasl2-modules-db
  libsqlite3-0 libssl1.1 libwind0-heimdal libxml2 mime-support openssl
  publicsuffix xz-utils
Suggested packages:
  libsasl2-modules-gssapi-mit | libsasl2-modules-gssapi-heimdal
  libsasl2-modules-ldap libsasl2-modules-otp libsasl2-modules-sql
The following NEW packages will be installed:
  ca-certificates file fuse libasn1-8-heimdal libcurl3-gnutls libfuse2
  libgssapi3-heimdal libhcrypto4-heimdal libheimbase1-heimdal
  libheimntlm0-heimdal libhx509-5-heimdal libicu60 libkrb5-26-heimdal
  libldap-2.4-2 libldap-common libmagic-mgc libmagic1 libnghttp2-14 libpsl5
  libroken18-heimdal librtmp1 libsasl2-2 libsasl2-modules libsasl2-modules-db
  libsqlite3-0 libssl1.1 libwind0-heimdal libxml2 mime-support openssl
  publicsuffix s3fs xz-utils
0 upgraded, 33 newly installed, 0 to remove and 33 not upgraded.
Need to get 13.3 MB of archives.
After this operation, 52.2 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libssl1.1 amd64 1.1.0g-2ubuntu4.3 [1130 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 openssl amd64 1.1.0g-2ubuntu4.3 [532 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/main amd64 ca-certificates all 20180409 [151 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmagic-mgc amd64 1:5.32-2ubuntu0.1 [184 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmagic1 amd64 1:5.32-2ubuntu0.1 [68.4 kB]
Get:6 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 file amd64 1:5.32-2ubuntu0.1 [22.1 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic/main amd64 libicu60 amd64 60.2-3ubuntu3 [8054 kB]
Get:8 http://archive.ubuntu.com/ubuntu bionic/main amd64 libsqlite3-0 amd64 3.22.0-1 [496 kB]
Get:9 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libxml2 amd64 2.9.4+dfsg1-6.1ubuntu1.2 [663 kB]
Get:10 http://archive.ubuntu.com/ubuntu bionic/main amd64 mime-support all 3.60ubuntu1 [30.1 kB]
Get:11 http://archive.ubuntu.com/ubuntu bionic/main amd64 xz-utils amd64 5.2.2-1.3 [83.8 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic/main amd64 libfuse2 amd64 2.9.7-1ubuntu1 [80.9 kB]
Get:13 http://archive.ubuntu.com/ubuntu bionic/main amd64 fuse amd64 2.9.7-1ubuntu1 [24.5 kB]
Get:14 http://archive.ubuntu.com/ubuntu bionic/main amd64 libpsl5 amd64 0.19.1-5build1 [41.8 kB]
Get:15 http://archive.ubuntu.com/ubuntu bionic/main amd64 publicsuffix all 20180223.1310-1 [97.6 kB]
Get:16 http://archive.ubuntu.com/ubuntu bionic/main amd64 libroken18-heimdal amd64 7.5.0+dfsg-1 [41.3 kB]
Get:17 http://archive.ubuntu.com/ubuntu bionic/main amd64 libasn1-8-heimdal amd64 7.5.0+dfsg-1 [175 kB]
Get:18 http://archive.ubuntu.com/ubuntu bionic/main amd64 libheimbase1-heimdal amd64 7.5.0+dfsg-1 [29.3 kB]
Get:19 http://archive.ubuntu.com/ubuntu bionic/main amd64 libhcrypto4-heimdal amd64 7.5.0+dfsg-1 [85.9 kB]
Get:20 http://archive.ubuntu.com/ubuntu bionic/main amd64 libwind0-heimdal amd64 7.5.0+dfsg-1 [47.8 kB]
Get:21 http://archive.ubuntu.com/ubuntu bionic/main amd64 libhx509-5-heimdal amd64 7.5.0+dfsg-1 [107 kB]
Get:22 http://archive.ubuntu.com/ubuntu bionic/main amd64 libkrb5-26-heimdal amd64 7.5.0+dfsg-1 [206 kB]
Get:23 http://archive.ubuntu.com/ubuntu bionic/main amd64 libheimntlm0-heimdal amd64 7.5.0+dfsg-1 [14.8 kB]
Get:24 http://archive.ubuntu.com/ubuntu bionic/main amd64 libgssapi3-heimdal amd64 7.5.0+dfsg-1 [96.5 kB]
Get:25 http://archive.ubuntu.com/ubuntu bionic/main amd64 libsasl2-modules-db amd64 2.1.27~101-g0780600+dfsg-3ubuntu2 [14.8 kB]
Get:26 http://archive.ubuntu.com/ubuntu bionic/main amd64 libsasl2-2 amd64 2.1.27~101-g0780600+dfsg-3ubuntu2 [49.2 kB]
Get:27 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libldap-common all 2.4.45+dfsg-1ubuntu1.1 [16.6 kB]
Get:28 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libldap-2.4-2 amd64 2.4.45+dfsg-1ubuntu1.1 [155 kB]
Get:29 http://archive.ubuntu.com/ubuntu bionic/main amd64 libnghttp2-14 amd64 1.30.0-1ubuntu1 [77.8 kB]
Get:30 http://archive.ubuntu.com/ubuntu bionic/main amd64 librtmp1 amd64 2.4+20151223.gitfa8646d.1-1 [54.2 kB]
Get:31 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libcurl3-gnutls amd64 7.58.0-2ubuntu3.5 [212 kB]
Get:32 http://archive.ubuntu.com/ubuntu bionic/main amd64 libsasl2-modules amd64 2.1.27~101-g0780600+dfsg-3ubuntu2 [48.7 kB]
Get:33 http://archive.ubuntu.com/ubuntu bionic/universe amd64 s3fs amd64 1.82-1 [200 kB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 13.3 MB in 2s (8077 kB/s)
Selecting previously unselected package libssl1.1:amd64.
(Reading database ... 
(Reading database ... 5%
(Reading database ... 10%
(Reading database ... 15%
(Reading database ... 20%
(Reading database ... 25%
(Reading database ... 30%
(Reading database ... 35%
(Reading database ... 40%
(Reading database ... 45%
(Reading database ... 50%
(Reading database ... 55%
(Reading database ... 60%
(Reading database ... 65%
(Reading database ... 70%
(Reading database ... 75%
(Reading database ... 80%
(Reading database ... 85%
(Reading database ... 90%
(Reading database ... 95%
(Reading database ... 100%
(Reading database ... 4458 files and directories currently installed.)
Preparing to unpack .../00-libssl1.1_1.1.0g-2ubuntu4.3_amd64.deb ...
Unpacking libssl1.1:amd64 (1.1.0g-2ubuntu4.3) ...
Selecting previously unselected package openssl.
Preparing to unpack .../01-openssl_1.1.0g-2ubuntu4.3_amd64.deb ...
Unpacking openssl (1.1.0g-2ubuntu4.3) ...
Selecting previously unselected package ca-certificates.
Preparing to unpack .../02-ca-certificates_20180409_all.deb ...
Unpacking ca-certificates (20180409) ...
Selecting previously unselected package libmagic-mgc.
Preparing to unpack .../03-libmagic-mgc_1%3a5.32-2ubuntu0.1_amd64.deb ...
Unpacking libmagic-mgc (1:5.32-2ubuntu0.1) ...
Selecting previously unselected package libmagic1:amd64.
Preparing to unpack .../04-libmagic1_1%3a5.32-2ubuntu0.1_amd64.deb ...
Unpacking libmagic1:amd64 (1:5.32-2ubuntu0.1) ...
Selecting previously unselected package file.
Preparing to unpack .../05-file_1%3a5.32-2ubuntu0.1_amd64.deb ...
Unpacking file (1:5.32-2ubuntu0.1) ...
Selecting previously unselected package libicu60:amd64.
Preparing to unpack .../06-libicu60_60.2-3ubuntu3_amd64.deb ...
Unpacking libicu60:amd64 (60.2-3ubuntu3) ...
Selecting previously unselected package libsqlite3-0:amd64.
Preparing to unpack .../07-libsqlite3-0_3.22.0-1_amd64.deb ...
Unpacking libsqlite3-0:amd64 (3.22.0-1) ...
Selecting previously unselected package libxml2:amd64.
Preparing to unpack .../08-libxml2_2.9.4+dfsg1-6.1ubuntu1.2_amd64.deb ...
Unpacking libxml2:amd64 (2.9.4+dfsg1-6.1ubuntu1.2) ...
Selecting previously unselected package mime-support.
Preparing to unpack .../09-mime-support_3.60ubuntu1_all.deb ...
Unpacking mime-support (3.60ubuntu1) ...
Selecting previously unselected package xz-utils.
Preparing to unpack .../10-xz-utils_5.2.2-1.3_amd64.deb ...
Unpacking xz-utils (5.2.2-1.3) ...
Selecting previously unselected package libfuse2:amd64.
Preparing to unpack .../11-libfuse2_2.9.7-1ubuntu1_amd64.deb ...
Unpacking libfuse2:amd64 (2.9.7-1ubuntu1) ...
Selecting previously unselected package fuse.
Preparing to unpack .../12-fuse_2.9.7-1ubuntu1_amd64.deb ...
Unpacking fuse (2.9.7-1ubuntu1) ...
Selecting previously unselected package libpsl5:amd64.
Preparing to unpack .../13-libpsl5_0.19.1-5build1_amd64.deb ...
Unpacking libpsl5:amd64 (0.19.1-5build1) ...
Selecting previously unselected package publicsuffix.
Preparing to unpack .../14-publicsuffix_20180223.1310-1_all.deb ...
Unpacking publicsuffix (20180223.1310-1) ...
Selecting previously unselected package libroken18-heimdal:amd64.
Preparing to unpack .../15-libroken18-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libroken18-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libasn1-8-heimdal:amd64.
Preparing to unpack .../16-libasn1-8-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libasn1-8-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libheimbase1-heimdal:amd64.
Preparing to unpack .../17-libheimbase1-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libheimbase1-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libhcrypto4-heimdal:amd64.
Preparing to unpack .../18-libhcrypto4-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libhcrypto4-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libwind0-heimdal:amd64.
Preparing to unpack .../19-libwind0-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libwind0-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libhx509-5-heimdal:amd64.
Preparing to unpack .../20-libhx509-5-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libhx509-5-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libkrb5-26-heimdal:amd64.
Preparing to unpack .../21-libkrb5-26-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libkrb5-26-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libheimntlm0-heimdal:amd64.
Preparing to unpack .../22-libheimntlm0-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libheimntlm0-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libgssapi3-heimdal:amd64.
Preparing to unpack .../23-libgssapi3-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libgssapi3-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libsasl2-modules-db:amd64.
Preparing to unpack .../24-libsasl2-modules-db_2.1.27~101-g0780600+dfsg-3ubuntu2_amd64.deb ...
Unpacking libsasl2-modules-db:amd64 (2.1.27~101-g0780600+dfsg-3ubuntu2) ...
Selecting previously unselected package libsasl2-2:amd64.
Preparing to unpack .../25-libsasl2-2_2.1.27~101-g0780600+dfsg-3ubuntu2_amd64.deb ...
Unpacking libsasl2-2:amd64 (2.1.27~101-g0780600+dfsg-3ubuntu2) ...
Selecting previously unselected package libldap-common.
Preparing to unpack .../26-libldap-common_2.4.45+dfsg-1ubuntu1.1_all.deb ...
Unpacking libldap-common (2.4.45+dfsg-1ubuntu1.1) ...
Selecting previously unselected package libldap-2.4-2:amd64.
Preparing to unpack .../27-libldap-2.4-2_2.4.45+dfsg-1ubuntu1.1_amd64.deb ...
Unpacking libldap-2.4-2:amd64 (2.4.45+dfsg-1ubuntu1.1) ...
Selecting previously unselected package libnghttp2-14:amd64.
Preparing to unpack .../28-libnghttp2-14_1.30.0-1ubuntu1_amd64.deb ...
Unpacking libnghttp2-14:amd64 (1.30.0-1ubuntu1) ...
Selecting previously unselected package librtmp1:amd64.
Preparing to unpack .../29-librtmp1_2.4+20151223.gitfa8646d.1-1_amd64.deb ...
Unpacking librtmp1:amd64 (2.4+20151223.gitfa8646d.1-1) ...
Selecting previously unselected package libcurl3-gnutls:amd64.
Preparing to unpack .../30-libcurl3-gnutls_7.58.0-2ubuntu3.5_amd64.deb ...
Unpacking libcurl3-gnutls:amd64 (7.58.0-2ubuntu3.5) ...
Selecting previously unselected package libsasl2-modules:amd64.
Preparing to unpack .../31-libsasl2-modules_2.1.27~101-g0780600+dfsg-3ubuntu2_amd64.deb ...
Unpacking libsasl2-modules:amd64 (2.1.27~101-g0780600+dfsg-3ubuntu2) ...
Selecting previously unselected package s3fs.
Preparing to unpack .../32-s3fs_1.82-1_amd64.deb ...
Unpacking s3fs (1.82-1) ...
Setting up libicu60:amd64 (60.2-3ubuntu3) ...
Setting up libnghttp2-14:amd64 (1.30.0-1ubuntu1) ...
Setting up mime-support (3.60ubuntu1) ...
Setting up libldap-common (2.4.45+dfsg-1ubuntu1.1) ...
Setting up libpsl5:amd64 (0.19.1-5build1) ...
Setting up libfuse2:amd64 (2.9.7-1ubuntu1) ...
Setting up libsasl2-modules-db:amd64 (2.1.27~101-g0780600+dfsg-3ubuntu2) ...
Setting up libsasl2-2:amd64 (2.1.27~101-g0780600+dfsg-3ubuntu2) ...
Setting up libroken18-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up librtmp1:amd64 (2.4+20151223.gitfa8646d.1-1) ...
Setting up libxml2:amd64 (2.9.4+dfsg1-6.1ubuntu1.2) ...
Setting up libmagic-mgc (1:5.32-2ubuntu0.1) ...
Setting up libmagic1:amd64 (1:5.32-2ubuntu0.1) ...
Processing triggers for libc-bin (2.27-3ubuntu1) ...
Setting up publicsuffix (20180223.1310-1) ...
Setting up libssl1.1:amd64 (1.1.0g-2ubuntu4.3) ...
debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install the Term::ReadLine module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.26.1 /usr/local/share/perl/5.26.1 /usr/lib/x86_64-linux-gnu/perl5/5.26 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.26 /usr/share/perl/5.26 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)
debconf: falling back to frontend: Teletype
Setting up xz-utils (5.2.2-1.3) ...
update-alternatives: using /usr/bin/xz to provide /usr/bin/lzma (lzma) in auto mode
update-alternatives: warning: skip creation of /usr/share/man/man1/lzma.1.gz because associated file /usr/share/man/man1/xz.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/unlzma.1.gz because associated file /usr/share/man/man1/unxz.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzcat.1.gz because associated file /usr/share/man/man1/xzcat.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzmore.1.gz because associated file /usr/share/man/man1/xzmore.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzless.1.gz because associated file /usr/share/man/man1/xzless.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzdiff.1.gz because associated file /usr/share/man/man1/xzdiff.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzcmp.1.gz because associated file /usr/share/man/man1/xzcmp.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzgrep.1.gz because associated file /usr/share/man/man1/xzgrep.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzegrep.1.gz because associated file /usr/share/man/man1/xzegrep.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzfgrep.1.gz because associated file /usr/share/man/man1/xzfgrep.1.gz (of link group lzma) doesn't exist
Setting up libheimbase1-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up openssl (1.1.0g-2ubuntu4.3) ...
Setting up libsqlite3-0:amd64 (3.22.0-1) ...
Setting up libsasl2-modules:amd64 (2.1.27~101-g0780600+dfsg-3ubuntu2) ...
Setting up ca-certificates (20180409) ...
debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install the Term::ReadLine module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.26.1 /usr/local/share/perl/5.26.1 /usr/lib/x86_64-linux-gnu/perl5/5.26 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.26 /usr/share/perl/5.26 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)
debconf: falling back to frontend: Teletype
Updating certificates in /etc/ssl/certs...
133 added, 0 removed; done.
Setting up fuse (2.9.7-1ubuntu1) ...
Setting up libwind0-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up libasn1-8-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up libhcrypto4-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up file (1:5.32-2ubuntu0.1) ...
Setting up libhx509-5-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up libkrb5-26-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up libheimntlm0-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up libgssapi3-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up libldap-2.4-2:amd64 (2.4.45+dfsg-1ubuntu1.1) ...
Setting up libcurl3-gnutls:amd64 (7.58.0-2ubuntu3.5) ...
Setting up s3fs (1.82-1) ...
Processing triggers for libc-bin (2.27-3ubuntu1) ...
Processing triggers for ca-certificates (20180409) ...
Updating certificates in /etc/ssl/certs...
0 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d...
done.
+ cp /root/bin/s3fs /host/usr/local/bin/
cp: cannot create regular file '/host/usr/local/bin/': Not a directory

I was taking a look at ffdl/ibmcloud-object-storage-deployer:v0.1 on dockerhub, but no dockerfile there. Thanks in advance for any ideas.

@animeshsingh
Copy link

@sboagibm @fplk Please look into this

@fplk
Copy link
Contributor

fplk commented Dec 20, 2018

GKE should use Container-Optimized OS underneath, cmp. https://cloud.google.com/container-optimized-os/ and it is possible the open source driver will not work without modification on that. If you want to deploy to GKE, you would have to first make sure https://github.com/IBM/ibmcloud-object-storage-plugin works. Since I don't have access to GKE, I cannot test or fix this for you.

Regarding the general FfDL setup, it should cleanly deploy against IBM Cloud. Unfortunately, I have two hard deadlines at the end of the week, so I cannot look into deployment on Minikube and DIND right now. DIND 1.10 worked a while ago, I briefly tried to deploy against 1.12 not too long ago and also ran into problems.

@JunFugithub
Copy link
Author

Thanks for the advice of locating the issue, still on trying.

@sboagibm
Copy link
Contributor

@JunFugithub For minikube could you look for the statefulset that is created and do a kubectl get ss/xxxxx -o yaml and send results? And for DIND do the same, but also do a kubectl describe for the failed pod?

@JunFugithub
Copy link
Author

  • minikub
    As for the minikube part, there're something have to be mentioned.
  1. I created a set of pv and pvc(learner-1) by method of hostPath instead of previous NFS, and I do edit the pvc section of deployment lhelper and statefulset learner mannually, because I used to think NFS probably is part of reason.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  creationTimestamp: 2018-12-20T14:53:36Z
  generation: 2
  labels:
    service: dlaas-learner
    training_id: training-pTOewHymR
    user_id: test-user
  name: learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1
  namespace: default
  resourceVersion: "22353"
  selfLink: /apis/apps/v1/namespaces/default/statefulsets/learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1
  uid: 0785179e-0467-11e9-a165-c2aacdd61c5f
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 0
  selector:
    matchLabels:
      service: dlaas-learner
      training_id: training-pTOewHymR
      user_id: test-user
  serviceName: learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/nvidiaGPU: '{ "AllocationPriority": "Dense"
          }'
        scheduler.alpha.kubernetes.io/tolerations: '[ { "key": "dedicated", "operator":
          "Equal", "value": "gpu-task" } ]'
      creationTimestamp: null
      labels:
        service: dlaas-learner
        training_id: training-pTOewHymR
        user_id: test-user
    spec:
      automountServiceAccountToken: false
      containers:
      - command:
        - bash
        - -c
        - "export PATH=/usr/local/bin/:$PATH; cp /entrypoint-files/*.sh /usr/local/bin/;
          chmod +x /usr/local/bin/*.sh;\n\t\t\tif [ ! -f /job/load-model.exit ]; then\n\t\t\t\twhile
          [ ! -f /job/load-model.start ]; do sleep 2; done ;\n\t\t\t\tdate \"+%s%N\"
          | cut -b1-13 > /job/load-model.start_time ;\n\t\t\t\t\n\t\t\techo \"Starting
          Training $TRAINING_ID\"\n\t\t\tmkdir -p \"$MODEL_DIR\" ;\n\t\t\tpython -m
          zipfile -e $RESULT_DIR/_submitted_code/model.zip $MODEL_DIR  ;\n\t\t\t\techo
          $? > /job/load-model.exit ;\n\t\t\tfi\n\t\t\techo \"Done load-model\" ;\n\t\t\tif
          [ ! -f /job/learner.exit ]; then\n\t\t\t\twhile [ ! -f /job/learner.start
          ]; do sleep 2; done ;\n\t\t\t\tdate \"+%s%N\" | cut -b1-13 > /job/learner.start_time
          ;\n\t\t\t\t\n\t\t\tfor i in ${!ALERTMANAGER*} ${!DLAAS*} ${!ETCD*} ${!GRAFANA*}
          ${!HOSTNAME*} ${!KUBERNETES*} ${!MONGO*} ${!PUSHGATEWAY*}; do unset $i;
          done;\n\t\t\texport LEARNER_ID=$((${DOWNWARD_API_POD_NAME##*-} + 1)) ;\n\t\t\tmkdir
          -p $RESULT_DIR/learner-$LEARNER_ID ;\n\t\t\tmkdir -p $CHECKPOINT_DIR ;bash
          -c 'train.sh >> $JOB_STATE_DIR/latest-log 2>&1 ; exit ${PIPESTATUS[0]}'
          ;\n\t\t\t\techo $? > /job/learner.exit ;\n\t\t\tfi\n\t\t\techo \"Done learner\"
          ;\n\t\t\tif [ ! -f /job/store-logs.exit ]; then\n\t\t\t\twhile [ ! -f /job/store-logs.start
          ]; do sleep 2; done ;\n\t\t\t\tdate \"+%s%N\" | cut -b1-13 > /job/store-logs.start_time
          ;\n\t\t\t\t\n\t\t\techo Calling copy logs.\n\t\t\tmv -nf $LOG_DIR/* $RESULT_DIR/learner-$LEARNER_ID
          ;\n\t\t\tERROR_CODE=$? ;\n\t\t\techo $ERROR_CODE > $RESULT_DIR/learner-$LEARNER_ID/.log-copy-complete
          ;\n\t\t\tbash -c 'exit $ERROR_CODE' ;\n\t\t\t\techo $? > /job/store-logs.exit
          ;\n\t\t\tfi\n\t\t\techo \"Done store-logs\" ;\n\t\twhile true; do sleep
          2; done ;"
        env:
        - name: DATA_DIR
          value: /mnt/data/tf_training_data
        - name: LOG_DIR
          value: /job/logs
        - name: RESULT_DIR
          value: /mnt/results/tf_trained_model/training-pTOewHymR
        - name: MODEL_DIR
          value: /job/model-code
        - name: TRAINING_COMMAND
          value: 'python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz   --trainLabelsFile
            ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz   --testLabelsFile
            ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001   --trainingIters
            2000 '
        - name: TRAINING_ID
          value: training-pTOewHymR
        - name: GPU_COUNT
          value: "0.000000"
        - name: DOWNWARD_API_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: DOWNWARD_API_POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: LEARNER_NAME_PREFIX
          value: learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1
        - name: TRAINING_ID
          value: training-pTOewHymR
        - name: NUM_LEARNERS
          value: "1"
        - name: JOB_STATE_DIR
          value: /job
        - name: CHECKPOINT_DIR
          value: /mnt/results/tf_trained_model/_wml_checkpoints
        - name: RESULT_BUCKET_DIR
          value: /mnt/results/tf_trained_model
        image: tensorflow/tensorflow:1.5.0-py3
        imagePullPolicy: IfNotPresent
        name: learner
        ports:
        - containerPort: 22
          protocol: TCP
        - containerPort: 2222
          protocol: TCP
        resources:
          limits:
            cpu: 500m
            memory: 1048576k
            nvidia.com/gpu: "0"
          requests:
            cpu: 500m
            memory: 1048576k
            nvidia.com/gpu: "0"
        securityContext:
          capabilities:
            drop:
            - CHOWN
            - DAC_OVERRIDE
            - FOWNER
            - FSETID
            - KILL
            - SETPCAP
            - NET_RAW
            - MKNOD
            - SETFCAP
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /mnt/data/tf_training_data
          name: cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1
        - mountPath: /mnt/results/tf_trained_model
          name: cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1
        - mountPath: /job
          name: jobdata
          subPath: training-pTOewHymR
        - mountPath: /entrypoint-files
          name: learner-entrypoint-files
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: regcred
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: dedicated
        operator: Equal
        value: gpu-task
      volumes:
      - flexVolume:
          driver: ibm/ibmc-s3fs
          options:
            bucket: tf_training_data
            cache-size-gb: "0"
            chunk-size-mb: "52"
            curl-debug: "false"
            debug-level: warn
            endpoint: http://192.168.64.25:31971
            ensure-disk-free: "0"
            kernel-cache: "true"
            multireq-max: "20"
            parallel-count: "5"
            region: us-standard
            s3fs-fuse-retry-count: "30"
            stat-cache-size: "100000"
            tls-cipher-suite: DEFAULT
          secretRef:
            name: cossecretdata-3a77bbc9-7418-44d6-7797-e697a1d43fd1
        name: cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1
      - flexVolume:
          driver: ibm/ibmc-s3fs
          options:
            bucket: tf_trained_model
            cache-size-gb: "0"
            chunk-size-mb: "52"
            curl-debug: "false"
            debug-level: warn
            endpoint: http://192.168.64.25:31971
            ensure-disk-free: "2048"
            kernel-cache: "false"
            multireq-max: "20"
            parallel-count: "2"
            region: us-standard
            s3fs-fuse-retry-count: "30"
            stat-cache-size: "100000"
            tls-cipher-suite: DEFAULT
          secretRef:
            name: cossecretresults-3a77bbc9-7418-44d6-7797-e697a1d43fd1
        name: cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1
      - configMap:
          defaultMode: 420
          name: learner-entrypoint-files
        name: learner-entrypoint-files
      - name: jobdata
        persistentVolumeClaim:
          claimName: learner-1
  updateStrategy:
    type: OnDelete
status:
  collisionCount: 0
  currentRevision: learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-7df856b884
  observedGeneration: 2
  replicas: 1
  updateRevision: learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-5dc4cfdf78
  updatedReplicas: 1
  1. I found that the same issue of pod ibmcloud-object-storage-deployer occurs in the very beginning. I chose to ignore the pod ibmcloud-object-storage-deployer because other pods except this one deployed in a specific namespace(generally in default namespace) worked. From the advice of @fplk , I realized that there're no s3fs and ibm-volume-plugin working inside minikube. I tried again yesterday, and I followed the steps in this doc, copying the ibmc-s3fs volume plugin to the path /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs inside minikube. However, about s3fs in minikube, I don't really know how to install it inside minikube manually, which is based on os called Buildroot 2018.05 probably.(From my understanding, the pod ibmcloud-object-storage-deployer works for this step somehow, but I don't know why it failed) Following is the warning of the pods. It failed again though, there's subtle difference from previous one.
$ kubectl describe po learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0

Events:
  Type     Reason                 Age               From               Message
  ----     ------                 ----              ----               -------
  Normal   Scheduled              9m                default-scheduler  Successfully assigned learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0 to minikube
  Normal   SuccessfulMountVolume  9m                kubelet, minikube  MountVolume.SetUp succeeded for volume "learner-entrypoint-files"
  Normal   SuccessfulMountVolume  9m                kubelet, minikube  MountVolume.SetUp succeeded for volume "hostpathtest"
  Warning  FailedMount            5m (x2 over 7m)   kubelet, minikube  Unable to mount volumes for pod "learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0_default(06ed9d78-0529-11e9-a165-c2aacdd61c5f)": timeout expired waiting for volumes to attach or mount for pod "default"/"learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0". list of unmounted volumes=[cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1]. list of unattached volumes=[cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 learner-entrypoint-files jobdata]
  Warning  FailedMount            3m (x11 over 9m)  kubelet, minikube  MountVolume.SetUp failed for volume "cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1" : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed:
  Warning  FailedMount            3m (x11 over 9m)  kubelet, minikube  MountVolume.SetUp failed for volume "cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1" : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed:

# truncated, same results over and over again
$ minikube logs
Dec 21 14:06:29 minikube kubelet[17241]: E1221 14:06:29.648527   17241 driver-call.go:258] mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed:
Dec 21 14:06:29 minikube kubelet[17241]: E1221 14:06:29.649222   17241 nestedpendingoperations.go:267] Operation for "\"flexvolume-ibm/ibmc-s3fs/06ed9d78-0529-11e9-a165-c2aacdd61c5f-cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1\" (\"06ed9d78-0529-11e9-a165-c2aacdd61c5f\")" failed. No retries permitted until 2018-12-21 14:08:31.649187316 +0000 UTC m=+474.175452305 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1\" (UniqueName: \"flexvolume-ibm/ibmc-s3fs/06ed9d78-0529-11e9-a165-c2aacdd61c5f-cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1\") pod \"learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0\" (UID: \"06ed9d78-0529-11e9-a165-c2aacdd61c5f\") : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed: "
Dec 21 14:06:29 minikube kubelet[17241]: E1221 14:06:29.953671   17241 driver-call.go:258] mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed:
Dec 21 14:06:29 minikube kubelet[17241]: E1221 14:06:29.954160   17241 nestedpendingoperations.go:267] Operation for "\"flexvolume-ibm/ibmc-s3fs/06ed9d78-0529-11e9-a165-c2aacdd61c5f-cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1\" (\"06ed9d78-0529-11e9-a165-c2aacdd61c5f\")" failed. No retries permitted until 2018-12-21 14:08:31.954118575 +0000 UTC m=+474.480383265 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1\" (UniqueName: \"flexvolume-ibm/ibmc-s3fs/06ed9d78-0529-11e9-a165-c2aacdd61c5f-cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1\") pod \"learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0\" (UID: \"06ed9d78-0529-11e9-a165-c2aacdd61c5f\") : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed: "
Dec 21 14:06:30 minikube kubelet[17241]: W1221 14:06:30.227089   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-trainer-858b8ccf95-fpttp due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:06:34 minikube kubelet[17241]: E1221 14:06:34.923885   17241 kubelet.go:1635] Unable to mount volumes for pod "learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0_default(06ed9d78-0529-11e9-a165-c2aacdd61c5f)": timeout expired waiting for volumes to attach or mount for pod "default"/"learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0". list of unmounted volumes=[cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1]. list of unattached volumes=[cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 learner-entrypoint-files jobdata]; skipping pod
Dec 21 14:06:34 minikube kubelet[17241]: E1221 14:06:34.924016   17241 pod_workers.go:186] Error syncing pod 06ed9d78-0529-11e9-a165-c2aacdd61c5f ("learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0_default(06ed9d78-0529-11e9-a165-c2aacdd61c5f)"), skipping: timeout expired waiting for volumes to attach or mount for pod "default"/"learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0". list of unmounted volumes=[cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1]. list of unattached volumes=[cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 learner-entrypoint-files jobdata]
Dec 21 14:06:42 minikube kubelet[17241]: W1221 14:06:42.227372   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-ui-55f5754ffb-d8msw due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:07:19 minikube kubelet[17241]: W1221 14:07:19.227941   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/jobmonitor-3a77bbc9-7418-44d6-7797-e697a1d43fd1-6c7d4d484942m5x due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:07:19 minikube kubelet[17241]: W1221 14:07:19.231865   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/lhelper-3a77bbc9-7418-44d6-7797-e697a1d43fd1-f7c7d96c5-4qqhh due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:07:20 minikube kubelet[17241]: W1221 14:07:20.229235   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-restapi-6fc48bd5b5-wdwbr due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:07:21 minikube kubelet[17241]: W1221 14:07:21.227647   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-lcm-6d96b5767b-g2nn6 due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:07:23 minikube kubelet[17241]: W1221 14:07:23.226763   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-trainingdata-c57f5cddd-bsfm4 due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:07:31 minikube kubelet[17241]: W1221 14:07:31.227217   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-trainer-858b8ccf95-fpttp due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:07:47 minikube kubelet[17241]: W1221 14:07:47.226982   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-ui-55f5754ffb-d8msw due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:08:23 minikube kubelet[17241]: W1221 14:08:23.233435   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/lhelper-3a77bbc9-7418-44d6-7797-e697a1d43fd1-f7c7d96c5-4qqhh due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:08:25 minikube kubelet[17241]: W1221 14:08:25.230748   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-trainingdata-c57f5cddd-bsfm4 due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:08:27 minikube kubelet[17241]: W1221 14:08:27.226867   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-restapi-6fc48bd5b5-wdwbr due to secrets "regcred" not found.  The image pull may not succeed.

reproduce

$ minikube start --insecure-registry 9.0.0.0/8 --insecure-registry 10.0.0.0/8 \
                 --cpus 4 \
                 --memory 4096 --disk-size=40g\
                 --vm-driver=hyperkit --apiserver-ips 127.0.0.1 --apiserver-name localhost --logtostderr
$ make deploy-plugin
$ make quickstart-deploy
$ make test-push-data-s3
$ make test-job-submit
  • dind
    As for dind, every parts was going well prior to the training part. when I ran command make test-job-submit, it got stuck. It's abnormal because no more pods was about to produce afterwards, so I can't provide neither the yaml file of statefulset of learner nor the description of failure pods. It popped out FAILED,\n Error 200: OK. I thought there's a request problem, so I retrieved the logs from pod ffdl-restapi-xx, I got a rpc error from the pod log, but have no idea what happened.
$ kubectl logs ffdl-restapi-7f5c57c77d-lp4k2
time="2018-12-21T14:31:06Z" level=debug msg="Log level set to 'debug'"
time="2018-12-21T14:31:06Z" level=debug msg="Milli CPU is: 60"
time="2018-12-21T14:31:06Z" level=info msg="GetTrainingDataMemInMB() returns 300"
time="2018-12-21T14:31:06Z" level=debug msg="Training Data Mem in MB is: 300"
time="2018-12-21T14:31:06Z" level=debug msg="No config file 'config-dev.yml' found. Using environment variables only."
{"level":"info","msg":"DLaaS REST API v1 serving on :8080","time":"2018-12-21T14:31:10Z"}
{"level":"info","method":"POST","msg":"Started handling request","remote":"127.0.0.1:40906","request":"/v1/models?version=2017-02-13","time":"2018-12-21T14:43:46Z"}
{"level":"debug","msg":"Enter into auth handler","time":"2018-12-21T14:43:46Z"}
{"level":"debug","msg":"request: \u0026{Method:POST URL:/v1/models?version=2017-02-13 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Accept:[application/json] Authorization:[Basic dGVzdC11c2VyOnRlc3Q=] Content-Type:[multipart/form-data; boundary=79f1ce044563b1c04bbc0fef5a4af5484d5361472883bc5af5b39e48168e] X-Watson-Userinfo:[bluemix-instance-id=test-user] Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] Body:0xc420374e00 GetBody:\u003cnil\u003e ContentLength:-1 TransferEncoding:[chunked] Close:false Host:localhost:32605 Form:map[] PostForm:map[] MultipartForm:\u003cnil\u003e Trailer:map[] RemoteAddr:127.0.0.1:40906 RequestURI:/v1/models?version=2017-02-13 TLS:\u003cnil\u003e Cancel:\u003cnil\u003e Response:\u003cnil\u003e ctx:0xc420374e40}","time":"2018-12-21T14:43:46Z"}
{"level":"debug","msg":"Writing to header in callBefore \"Access-Control-Allow-Origin: *\"","time":"2018-12-21T14:43:46Z"}
{"level":"debug","msg":"wmlTenantID: ","time":"2018-12-21T14:43:46Z"}
{"level":"debug","msg":"X-DLaaS-UserID: test-user","time":"2018-12-21T14:43:46Z"}
{"Accept":["application/json"],"Accept-Encoding":["gzip"],"Authorization":["Basic dGVzdC11c2VyOnRlc3Q="],"Content-Type":["multipart/form-data; boundary=79f1ce044563b1c04bbc0fef5a4af5484d5361472883bc5af5b39e48168e"],"User-Agent":["Go-http-client/1.1"],"X-Dlaas-Userid":["test-user"],"X-Watson-Userinfo":["bluemix-instance-id=test-user"],"level":"debug","msg":"Request headers:","time":"2018-12-21T14:43:46Z"}
{"caller_info":"server/models_impl.go:63 postModel -","level":"debug","model_filename":"manifest_testrun.yml","module":"rest-api","msg":"postModel invoked: map[Accept:[application/json] Authorization:[Basic dGVzdC11c2VyOnRlc3Q=] Content-Type:[multipart/form-data; boundary=79f1ce044563b1c04bbc0fef5a4af5484d5361472883bc5af5b39e48168e] X-Watson-Userinfo:[bluemix-instance-id=test-user] Accept-Encoding:[gzip] X-Dlaas-Userid:[test-user] User-Agent:[Go-http-client/1.1]]","time":"2018-12-21T14:43:46Z","user_id":"test-user"}
{"caller_info":"server/models_impl.go:59 postModel -","level":"debug","model_filename":"manifest_testrun.yml","module":"rest-api","msg":"Loading Manifest","time":"2018-12-21T14:43:46Z","user_id":"test-user"}
{"level":"info","msg":"dialing to target with scheme: \"\"","time":"2018-12-21T14:43:47Z"}
{"level":"info","msg":"ccResolverWrapper: sending new addresses to cc: [{ffdl-trainer.default.svc.cluster.local:80 0  \u003cnil\u003e}]","time":"2018-12-21T14:43:47Z"}
{"level":"info","msg":"ClientConn switching balancer to \"pick_first\"","time":"2018-12-21T14:43:47Z"}
{"level":"info","msg":"pickfirstBalancer: HandleSubConnStateChange: 0xc420281ea0, CONNECTING","time":"2018-12-21T14:43:47Z"}
{"level":"info","msg":"pickfirstBalancer: HandleSubConnStateChange: 0xc420281ea0, READY","time":"2018-12-21T14:43:47Z"}
{"caller_info":"server/manifest.go:237 manifest2TrainingRequest -","level":"debug","model_filename":"manifest_testrun.yml","module":"rest-api","msg":"EMExtractionSpec ImageTag: ","time":"2018-12-21T14:43:47Z","user_id":"test-user"}
{"caller_info":"server/models_impl.go:117 postModel -","error":"rpc error: code = Canceled desc = context canceled","level":"error","model_filename":"manifest_testrun.yml","module":"rest-api","msg":"Trainer service call failed","time":"2018-12-21T14:43:56Z","user_id":"test-user"}
{"caller_info":"server/models_impl.go:857 error500 -","level":"error","model_filename":"manifest_testrun.yml","module":"rest-api","msg":"Returning 500 error: ","time":"2018-12-21T14:43:56Z","user_id":"test-user"}
{"level":"info","measure#rest-api.latency":9943356700,"method":"POST","msg":"Completed handling request","remote":"127.0.0.1:40906","request":"/v1/models?version=2017-02-13","status":500,"text_status":"Internal Server Error","time":"2018-12-21T14:43:56Z","took":9943356700}

Yesterday @fplk mentioned about the version of dind. I downloaded dind 1.10.9, but decide to give it up because of failure of dind 1.10.9 installation.

FYI, there're two more thing I'd like to mention. I left NULL to environment variable SHARED_VOLUME_STORAGE_CLASS under both minikube and dind VM. I hope there's no connection with this part. And S3 service part works well, I mean I checked out s3 buckets which does have the training data after the command make test-push-data-s3 for both dind and minikube.

@sboagibm Thanks a lot for any of your suggestions.

@fplk
Copy link
Contributor

fplk commented Jan 3, 2019

I apologize for the delay due to the holidays. I think I can reproduce the error you encountered and have been able to get it working. A couple of things:

a) The scripts in https://github.com/IBM/FfDL/tree/master/bin/dind_scripts should largely work with the exception that you need to update DIND in launch_kubernetes.sh (RawGit is deprecated and 1.9 is old, so I successfully used https://github.com/kubernetes-sigs/kubeadm-dind-cluster/releases/download/v0.1.0/dind-cluster-v1.13.sh - just replace all occurrences in the file accordingly)

b) Here are my steps:

ssh root@<machine>.sl.cloud9.ibm.com
apt install -y git software-properties-common
mkdir -p /home/ffdlr/go/src/github.com/IBM/ && cd $_ && git clone https://github.com/IBM/FfDL.git && cd FfDL
# Replace DIND version as explained in (a)
cd bin/dind_scripts/
chmod +x create_user.sh
. create_user.sh
# Enter new password and get kicked out

ssh ffdlr@<machine>.sl.cloud9.ibm.com
cd /home/ffdlr/go/src/github.com/IBM/FfDL/bin/dind_scripts/
sudo chmod +x experimental_master.sh
. experimental_master.sh

Build own manifest with:

name: tf_convolutional_network_tutorial
description: Convolutional network model using tensorflow
version: "1.0"
gpus: 0
cpus: 0.5
memory: 1Gb
learners: 1

# Object stores that allow the system to retrieve training data.
data_stores:
  - id: sl-internal-os
    type: mount_cos
    training_data:
      container: REPLACE_INPUT_BUCKET
    training_results:
      container: REPLACE_OUTPUT_BUCKET
    connection:
      auth_url: REPLACE_ENDPOINT
      user_name: REPLACE_KEY_ID
      password: REPLACE_KEY

framework:
  name: tensorflow
  version: "1.5.0-py3"
  command: >
    python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz
      --trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz
      --testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001
      --trainingIters 2000
  # Change trainingIters to 20000 if you want your model to have over 80% Accuracy rate.

evaluation_metrics:
  type: tensorboard
  in: "$JOB_STATE_DIR/logs/tb"
  # (Eventual) Available event types: 'images', 'distributions', 'histograms', 'images'
  # 'audio', 'scalars', 'tensors', 'graph', 'meta_graph', 'run_metadata'
  #  event_types: [scalars]

Run with DLAAS_URL=http://10.192.0.3:31826 DLAAS_USERNAME=test-user DLAAS_PASSWORD=test /home/ffdlr/go/src/github.com/IBM/FfDL/cli/bin/ffdl-linux train mymanifest.yml . from within /home/ffdlr/go/src/github.com/IBM/FfDL/etc/examples/tf-model

c) This should work, but is currently not the most user-friendly way of setting things up. I can try to push some changes to improve usability - it looks from the outside like the manifest creation gets garbled up somewhere, but until then this should get you running.

d) Minikube is a suboptimal environment due to unfixed storage bugs on their side. FfDL should be able to run against GKE in principle, but I'm not sure the open source S3 driver will work against that. I think the storage team supports DIND and IBM Cloud and their architecture should work against any cloud provider, but you would have to test the driver manually and PR minor changes if it does not work against your target provider out of the box. Or use a different driver.

@fplk fplk mentioned this issue Jan 4, 2019
@fplk
Copy link
Contributor

fplk commented Jan 4, 2019

OK, with #158 I can deploy FfDL with the typical 4 commands from the README against DIND on macOS and Linux as well as IBM Cloud. Please let me know if it helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants