diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/404.html b/404.html new file mode 100644 index 00000000..96d59bed --- /dev/null +++ b/404.html @@ -0,0 +1,407 @@ + + + +
+ + + + + + + + + + + +git clone https://github.com/datashim-io/datashim.git
+cd datashim
+
After you check out the project and the correct branch, proceed with the installation of minio.
+If you already have a cloud object store, you can skip this step. +
kubectl apply -n dlf -f examples/minio/
+
dlf
namespace.
+A final step would be to create a secret named minio-conf
in the dlf
namespace which would point on the connection information for the cloud object store you would be using. In the case you have provisioned our demo minio instance, execute the below. In different case adopt the connection details to reflect on your setup.
+
kubectl create secret generic minio-conf --from-literal='AWS_ACCESS_KEY_ID=minio' --from-literal='AWS_SECRET_ACCESS_KEY=minio123' --from-literal='ENDPOINT=http://minio-service:9000' -n dlf
+
You can check the status of the installation: +
watch kubectl get pods -n dlf
+
NAME READY STATUS RESTARTS AGE
+csi-attacher-nfsplugin-0 2/2 Running 0 3m1s
+csi-attacher-s3-0 1/1 Running 0 3m1s
+csi-hostpath-attacher-0 1/1 Running 0 3m1s
+csi-hostpath-provisioner-0 1/1 Running 0 3m1s
+csi-hostpathplugin-0 3/3 Running 0 3m1s
+csi-nodeplugin-nfsplugin-vs7d9 2/2 Running 0 3m1s
+csi-provisioner-s3-0 1/1 Running 0 3m1s
+csi-s3-mrndx 2/2 Running 0 3m1s
+dataset-operator-76798546cf-9d6wj 1/1 Running 0 3m1s
+generate-keys-n7m5l 0/1 Completed 0 3m1s
+minio-7979c89d5c-khncd 0/1 Running 0 3m
+
Now we can create a Dataset based on a remote archive as follows: +
cat <<EOF | kubectl apply -f -
+apiVersion: com.ie.ibm.hpsys/v1alpha1
+kind: Dataset
+metadata:
+ name: example-dataset
+spec:
+ type: "ARCHIVE"
+ url: "https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-data-jfk-airport.tar.gz"
+ format: "application/x-tar"
+EOF
+
$ kubectl get pvc
+NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
+example-dataset Bound pvc-c58852a6-a597-4eb8-a05b-23d9899226bf 9314Gi RWX csi-s3 15s
+
cat <<EOF | kubectl apply -f -
+apiVersion: v1
+kind: Pod
+metadata:
+ name: nginx
+ labels:
+ dataset.0.id: "example-dataset"
+ dataset.0.useas: "mount"
+spec:
+ containers:
+ - name: nginx
+ image: nginx
+EOF
+
$ kubectl exec -it nginx /bin/bash
+root@nginx:/# ls /mnt/datasets/example-dataset/
+noaa-weather-data-jfk-airport
+root@nginx:/# ls /mnt/datasets/example-dataset/noaa-weather-data-jfk-airport/
+LICENSE.txt README.txt clean_data.py jfk_weather.csv jfk_weather_cleaned.csv
+
Inside plugins/ceph-cache-plugin/deploy/rook
directory execute:
+
kubectl create -f common.yaml
+kubectl create -f operator.yaml
+
storage.nodes
e.g.
+storage:
+ useAllNodes: false
+ useAllDevices: false
+ nodes:
+ - name: "minikube"
+ devices:
+ - name: "sdb"
+ config:
+ storeType: bluestore
+ osdsPerDevice: "1"
+
kubectl create -f cluster.yaml
+
rook-ceph-mgr-a-5f8f5c978-xgcpw 1/1 Running 0 79s
+rook-ceph-mon-a-6879b87656-bxbrw 1/1 Running 0 89s
+rook-ceph-operator-86f9b59b8-2fvkb 1/1 Running 0 5m29s
+rook-ceph-osd-0-9dcb46c48-hrzvz 1/1 Running 0 43s
+
kubectl delete -f cluster.yaml
+You need also to delete the paths in defined in dataDirHostPath
and directories.path
+Now we can proceed with installing DLF.
+If you are after maximum performance we strongly advice to set up your ceph cluster according to the method above. However, for testing purposes and/or lacking of disk devices we describe a method to test this inside minikube and provide a script plugins/ceph-cache-plugin/deploy/rook/setup_ceph_cluster.sh
that installs rook with csi-lvm storage class.
First we need to have a working cluster.
+minikube start --memory='6G' --cpus=4 --disk-size='40g' --driver=virtualbox -p rooktest
NOTE: run ./minikube/fix_minikube_losetup.py
to bypass the current issue of minikube with loset.
NOTE2: if you change the disk-size of the minikube command make sure to tune accordingly the following parameters
+Before invoking the script you should tune according to your needs the following attributes
+Attribute | +File | +Description | +
---|---|---|
GIGA_SPACE | +plugins/ceph-cache-plugin/deploy/rook/csi-lvm-setup/create-loops.yaml |
+Size of the loop device that csi-lvm will create on each node | +
spec.mon.volumeClaimTemplate.spec.resources.requests.storage |
+plugins/ceph-cache-plugin/deploy/rook/cluster-on-pvc.yaml |
+Storage Size of mon ceph service | +
spec.storage.storageClassDeviceSets.volumeClaimTemplates.spec.resources.requests.storage |
+plugins/ceph-cache-plugin/deploy/rook/cluster-on-pvc.yaml |
+Storage size of CEPH osds | +
spec.storage.storageClassDeviceSets.count |
+plugins/ceph-cache-plugin/deploy/rook/cluster-on-pvc.yaml |
+Total number of CEPH osds | +
The command line arguments of the script are the names of the nodes that the csi-lvm should create loop devices on and the corresponding CEPH services will run on, e.g.
+cd plugins/ceph-cache-plugin/deploy/rook && \
+./setup_ceph_cluster.sh nodename1 ...
+
Keep in mind that the script will uninstall any previous installations of csi-lvm and rook-ceph which made through the script. If no command line arguments are passed to the script this will result in uninstalling everything.
+Go into the root of this directory and execute:
+make deployment
The pods in the default namespace would look like this: +
csi-attacher-nfsplugin-0 2/2 Running 0 7s
+csi-attacher-s3-0 1/1 Running 0 8s
+csi-nodeplugin-nfsplugin-nqgtl 2/2 Running 0 7s
+csi-provisioner-s3-0 2/2 Running 0 8s
+csi-s3-k9b5j 2/2 Running 0 8s
+dataset-operator-7b8f65f7d4-hg8n5 1/1 Running 0 6s
+
kubectl create -f my-dataset.yaml
+apiVersion: com.ie.ibm.hpsys/v1alpha1
+kind: Dataset
+metadata:
+ name: example-dataset
+spec:
+ local:
+ type: "COS"
+ accessKeyID: "{AWS_ACCESS_KEY_ID}"
+ secretAccessKey: "{AWS_SECRET_ACCESS_KEY}"
+ endpoint: "{S3_SERVICE_URL}"
+ bucket: "{BUCKET_NAME}"
+ region: "" #it can be empty
+
Now if you check about datasetsinternal and PVC you would be able to see the example-dataset +
kubectl get datasetsinternal
+kubectl get pvc
+
kubectl delete dataset/example-dataset
+If you execute kubectl describe datasetinternal/example-dataset
you would see the credentials and the endpoints you originally specified.
+Let's try to add the caching plugin.
+Change into the directory and invoke:
+make deployment
Let's create the same dataset now that the plugin is deployed:
+kubectl create -f my-dataset.yaml
You should see a new rgw pod starting up on rook-ceph namespace: +
rook-ceph-rgw-test-a-77f78b7b69-z5kp9 1/1 Running 0 4m43s
+
kubectl describe datasetinternal/example-dataset
you will notice that the credentials are different and they point to the rados gateway instance, therefore the PVC would reflect the cached version of the dataset.
+
+ One new Custom Resource Definition: the Dataset. Essentially this CRD is a declarative way to reference an existing data source. Moreover, we provide a mount-point in user's pod for each Dataset and expose an interface for caching mechanisms to leverage. +Current implementation supports S3- and NFS-based data sources.
+Not quite. For every Dataset we create one Persistent Volume Claim +which users can mount directly to their pods. We have implemented +that logic as a regular Kubernetes Operator.
+Since the introduction of Container Storage Interface, there are more +and more storage providers becoming available on Kubernetes environments. +However we feel that for the non-experienced Kubernetes users it might be +a high barrier for them to install/maintain/configure in order to leverage +the available CSI plugins and gain access to the remote data sources on their pods.
+By introducing a higher level of abstraction (Dataset) and by taking care of all the necessary work +around invoking the appropriate CSI plugin, configuring and provisioning +the PVC we aim to improve the User Experience of data access in Kubernetes
+On the contrary! Every type of data source we support actually comes with its own +completely standalone CSI implementation.
+We are aspiring to be a meta-framework for the CSI plugins. +If we have to make a comparison, we want make accessible different types of data sources +the same way Kubeflow makes Machine Learning frameworks accessible on Kubernetes
+Absolutely no. COSI aims to manage the full lifecycle of a bucket like provisioning, configuring access etc. which is beyond our scope. We just want to offer a mountpoint for COS buckets
+We believe that by introducing Dataset as a CRD you can accomplish higher level +orchestration and bring contributions on: +- Performance: We have attempted to create a pluggable caching interface like the example implementation: Ceph Caching Plugin +- Security: Another effort we are exploring is to have a common access management layer +for credentials of the different types of datasources
+We'll roughly follow the Github development flow used by the Kubernetes project.
+Visit https://github.com/datashim-io/datashim. Fork your own copy of Datashim to your Github account. For the sake of illustration, let's say this fork corresponds to https://github.com/$user/datashim
where $user
is your username.
Go to the source directory of your Go workspace and clone your fork there. Using the example above where the workspace is in $HOME/goprojects
,
+
$> mkdir -p $HOME/goprojects/src
+$> cd $HOME/goprojects/src
+$> git clone https://github.com/$user/datashim.git
+
Set the Datashim repo as your upstream and rebase +
$> cd $HOME/goprojects/src/datashim
+$> git remote add upstream https://github.com/datashim-io/datashim
+$> git remote set-url --push upstream no_push
+
git remote -v
+ $> git fetch upstream
+$> git checkout master
+$> git rebase upstream/master
+
Create a new branch to work on a feature or fix. Before this, please create an issue in the main Datashim repository that describes the problem or feature. Note the issue number (e.g. nnn
) and assign it to yourself. In your local repository, create a branch to work on the fix. Use a short title (2 or 3 words) formed from the issue title/description along with the issue number as the branch name
$> git checkout -b nnn-short-title
+
$> git commit -s -m "short descriptive message"
+$> git push $your_remote nnn-short-title
+
When you are ready to submit a Pull Request (PR) for your completed feature or branch, visit your fork on Github and click the button titled Compare and Pull Request
next to your nnn-short-title
branch. This will submit the PR to Datashim.io for review
After the review, prepare your PR for merging by squashing your commits.
+Visit https://go.dev/doc/install to download and install Go on your computer. Alternatively, you can also use package managers for your operating system (e..g Homebrew for macOS)
+Once installed, run go version
to verify that the installation is working
(Recommended) Go uses a variable GOPATH
to point to the current workspace. Package install commands such as go install
will use this as their destination. If you are using a package as well as extending it, then it would be better to set up a separate workspace for development. To do this, create a separate directory, e.g. $HOME/goprojects
and set it up with bin
,src
, and pkg
sub-directories, and set GOPATH
to point to it when developing. You can also use VSCode to modify GOPATH
per project (see below)
Download VSCode from https://code.visualstudio.com/download. Open Extensions tab and search for Go or go to https://marketplace.visualstudio.com/items?itemName=golang.go. Verify that the extension is by Go team at Google. Install extension to VSCode and test it with a sample program
+Before following the below suggestions, please ensure that you have checked out Datashim following the git workflow for development. Datashim is a collection of multiple Go projects including the Dataset Operator, CSI-S3, Ceph Cache plugin, etc. Therefore, the VSCode setup is not as straightforward as with a single Go project.
+Start VSCode. Open a new window (File -> New Window). Select the Explorer view (generally the topmost icon on the left pane)
+Add a folder to the workspace (File -> Add Folder To Workspace). In the file picker dialog, traverse to $HOME/goprojects/src/github.com/$user/datashim
and then deeper into subprojects (i.e. src/
folder). At this point, add the subfolder representing the project that you want to work on (e.g. dataset-operator
). Do not add the project root folder to the VSCode workspace.
Your Explorer view will have the project in the side panel like so:
++
"go.toolsGopath": "$HOME/go",
+"go.gopath": "$HOME/goprojects",
+
The order of the features/milestones represents loosely the order of which development will start.
+The S3-to-S3 caching is currently only supported by the Ceph/Rook-based plugin. However, we have been facing various problems as it's setup/configuration is not fully dynamic the way Noobaa is.
+In the wiki Caching-Remote-Buckets-(User-Guide) we have few hints about how to provision the cache buckets and this logic would be reflected on the Noobaa Caching Plugin
+Our current approach is based on our modified version of csi-s3 which is not maintained. The Object Bucket API will reduce the code we have to maintain as the S3 operations would be supported in a more K8s native manner with the new API.
+All the S3-related operations should be replaced with the Object Bucket API once it's ready to be used.
+In our current approach, for the datasets which require credentials are stored in secrets. Secrets is the de-facto kubernetes solution for storing credentials. However there are some problems when it comes to datasets. We might want to restrict the access to the datasets between the users in the same namespace. We would be able to support scenarios where UserA and UserB are on the same namespace but UserA has datasets which only they can access.
+Plan to leverage TSI
+Assuming Spectrum Scale installed on hosts we could leverage ibm-spectrum-scale-csi to provide the same functionality of S3 caching as Ceph-based and Noobaa-based.
+In our current approach, in the one implementation we have of a caching plugin, every dataset is being cached without priorities or checks (whether the cache is full etc). We need to tackle this.
+The most naive way to solve it is to not to use cache for a newly created dataset when the cache is full. A more sophisticated approach would be to monitor the usage of datasets and decide to evict based on some configurable policies.
+In our current approach, the only possible transformation we have is Dataset -> DatasetInternal -> PVCs. In the future we would like to be able to support any number of transformation of any type. So there would be plugins that can handle a flow like this: +Dataset(s3) -(caching)-> DatasetInternal(s3) -(expose)-> DatasetInternal(NFS) -> PVC +That would give the users the capability to cache and export their datasets in the format of their preference.
+Since we are aware of the nodes where a dataset is cached we can potentially offer this information to external schedulers or decorate the pods using nodeAffinity
to assist the default Kubernetes scheduler to place the pods closer to the cached data.
+This is expected to improve the performance of the pods using the specific datasets.
\n {translation(\"search.result.term.missing\")}: {...missing}\n
\n }\n