This document mainly describes how to deploy Fluid with Helm, and use Fluid to create a dataset and speed up your application.
-
Kubernetes 1.14+
If you don't have a Kubernetes now, we highly recommend you use a cloud Kubernetes service. Usually, with a few steps, you can get your own Kubernetes Cluster. Here's some of the certified cloud Kubernetes services:
- Amazon Elastic Kubernetes Service
- Google Kubernetes Engine
- Azure Kubernetes Service
- Aliyun Container Service for Kubernetes
Note: While convenient, Minikube is not recommended to deploy Fluid due to its limited functionalities.
-
Kubectl 1.14+
Please make sure your kubectl is properly configured to interact with your Kubernetes environment.
-
In the following steps, we'll deploy Fluid with Helm 3
-
Create namespace for Fluid
$ kubectl create ns fluid-system
-
Download the latest Fluid from Github release page
-
Deploy Fluid with Helm
$ helm install fluid fluid-<version>.tgz NAME: fluid LAST DEPLOYED: Tue Jul 7 11:22:07 2020 NAMESPACE: default STATUS: deployed REVISION: 1 TEST SUITE: None
-
Check running status of Fluid
$ kubectl get po -n fluid-system NAME READY STATUS RESTARTS AGE alluxioruntime-controller-64948b68c9-zzsx2 1/1 Running 0 108s csi-nodeplugin-fluid-2mfcr 2/2 Running 0 108s csi-nodeplugin-fluid-l7lv6 2/2 Running 0 108s dataset-controller-5465c4bbf9-5ds5p 1/1 Running 0 108s
Fluid provides cloud-native data acceleration and management capabilities, and use dataset as a high-level abstraction to facilitate user management. Here we will show you how to create a dataset with Fluid.
-
Create a Dataset object through the CRD file, which describes the source of the dataset.
$ cat<<EOF >dataset.yaml apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: demo spec: mounts: - mountPoint: https://mirrors.bit.edu.cn/apache/spark/spark-3.0.1/ name: spark EOF
kubectl create -f dataset.yaml
-
Create an
AlluxioRuntime
CRD object to support the dataset we created. We use Alluxio as its runtime here.$ cat<<EOF >runtime.yaml apiVersion: data.fluid.io/v1alpha1 kind: AlluxioRuntime metadata: name: demo spec: replicas: 1 tieredstore: levels: - mediumtype: MEM path: /dev/shm quota: 2Gi high: "0.95" low: "0.7" properties: alluxio.user.block.size.bytes.default: 256MB alluxio.user.streaming.reader.chunk.size.bytes: 256MB alluxio.user.local.reader.chunk.size.bytes: 256MB alluxio.worker.network.reader.buffer.size: 256MB fuse: args: - fuse - --fuse-opts=kernel_cache,ro,max_read=131072,attr_timeout=7200,entry_timeout=7200,nonempty,max_readahead=0 EOF
Create Alluxio Runtime with
kubectl
kubectl create -f runtime.yaml
-
Next, we create an application to access this dataset. Here we will access the same data multiple times and compare the time consumed by each access.
$ cat<<EOF >app.yaml apiVersion: v1 kind: Pod metadata: name: demo-app spec: containers: - name: demo image: nginx volumeMounts: - mountPath: /data name: demo volumes: - name: demo persistentVolumeClaim: claimName: demo EOF
Create Pod with
kubectl
$ kubectl create -f app.yaml
-
Dive into the container to access data, the first access will take longer.
$ kubectl exec -it demo-app -- bash $ du -sh /data/spark/spark-3.0.1-bin-without-hadoop.tgz 150M /data/spark/spark-3.0.1-bin-without-hadoop.tgz $ time cp /data/spark/spark-3.0.1-bin-without-hadoop.tgz /dev/null real 0m13.171s user 0m0.002s sys 0m0.028s
-
In order to avoid the influence of other factors like page cache, we will delete the previous container, create the same application, and try to access the same file. Since the file has been cached by alluxio at this time, you can see that it takes significantly less time now.
$ kubectl delete -f app.yaml && kubectl create -f app.yaml $ kubectl exec -it demo-app -- bash $ time cp /data/spark/spark-3.0.1-bin-without-hadoop.tgz /dev/null real 0m0.344s user 0m0.002s sys 0m0.020s
We've created a dataset and did some management in a very simple way. For more detail about Fluid, we provide several sample docs for you: