Install Ryax on Kubernetes¶
We assume that you are comfortable with Kubernetes. To keep this guide short, we leave out details on the Kubernetes commands.
Requirements¶
All you need to install Ryax is a Kubernetes cluster and Docker installed on your machine. You can get a managed Kubernetes instance from any Cloud provider. For a local development installation please refers the Getting Started Guide.
Supported Kubernetes versions:
kubernetes > 1.19
Hardware:
At least 2 CPU core
4GB or memory
40GB of disk available
Note that depending on the Actions that you run on your cluster you might need more resources.
Preparatory Steps¶
Make sure your configuration point to the intended cluster:
kubectl config current-context
.Your Kubernetes cluster dedicated to Ryax: we offer no guarantee that Ryax runs smoothly alongside other applications.
Make sure you have complete admin access to the cluster. Try to run
kubectl auth can-i create ns
orkubectl auth can-i create pc
, for instance.$ kubectl auth can-i create ns Warning: resource 'namespaces' is not namespace scoped yes
Have access to a DNS server where you can add a new
A
orCNAME
entry for your cluster.
Configure your Installation¶
Installing Ryax is analogous to installing a Helm chart. To begin we will start with a default configuration, and make a few tweaks so that everything is compatible with your Kubernetes provider. Be assured however that you will be able to fine-tune your installation later on.
Warning
Special warning for EKS (AWS Elastic Kubernetes Service)
Ryax requires persistent storage and by default, EKS does not provide any storage driver. Please, install the EBS CSI plugin with:
# Get this from `eksctl get clusters`
cluster_name=<My cluster name>
eksctl utils associate-iam-oidc-provider --cluster=$cluster_name --approve
eksctl create iamserviceaccount \
--name ebs-csi-controller-sa \
--namespace kube-system \
--cluster $cluster_name \
--role-name AmazonEKS_EBS_CSI_DriverRole \
--role-only \
--attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
--approve
eksctl create addon --name aws-ebs-csi-driver --cluster $cluster_name --service-account-role-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AmazonEKS_EBS_CSI_DriverRole --force
See the the official documentation for more details.
Also be aware that you cannot use Fargate because it does not support persistent storage
Initialize¶
First create a directory to organize the Ryax installation and initialize it with the default configuration:
mkdir ryax_install
cd ryax_install
docker run \
-v $PWD:/data/volume \
ryaxtech/ryax-adm:latest init --values volume/values.yaml
You are now in the ryax_install
folder and the values.yaml
containing the
default config was created.
Note
All the following commands assume that you are in the ryax_install directory.
To explain the configuration fields, here is an example of simple configuration file for Ryax:
# The Ryax version.
# Check here to get the latest version: https://github.com/RyaxTech/ryax-engine/releases
version: 24.10.0
# Cluster DNS
clusterName: myclustername
domainName: example.com
# Log level for all Ryax services
logLevel: info
# Set the storage size for each stateful service
datastore:
pvcSize: 10Gi
minio:
pvcSize: 40Gi
registry:
pvcSize: 20Gi
# Enable Prometheus + Grafana monitoring
monitoring:
enabled: true
# Use HTTPS by default
tls:
enabled: true
# Automate HTTPS with Let's Encrypt
certManager:
enabled: false
# Depends on your Kubernetes instance. Leave it empty to use the default
storageClass: ""
The Ryax installation is based on Helm charts, one for each service with a helmfile
to define the whole cluster configuration.
To customize your installation. You can set any configuration field using the values
keyword. A detailed description of all the values can be found in ryax-adm/helm-charts/values.yaml.
Settings¶
Set the version
field with the Ryax version, for example: 23.10.0
. The latest stable version can be found in the releases page.
The clusterName
and domainName
defines the name you give to your cluster, which is used in various places. One of those places is the URL of your cluster that will be <clusterName>.<domainName>, therefore it has to be consistent with your DNS.
If you do not intend to configure a DNS cluster, just leave this to the default value and disable the certManager, and in this case be aware you will access Ryax through the IP address directly and https certificate will be self-signed.
Warning
Depending on your Kubernetes cluster setup, you might have issue with Cert Manager which is use to get a valid HTTPS certificate. See the Cert Manager compatibility documentation for more details.
If you want to deal with the certificate yourself, you can disable it with:
certManager:
enabled: false
An important configuration is the storageClass
. If not set, Ryax will use the
default one provided by the Kubernetes cluster for all services. But, the
volumes are used to store the internal database (datastore
), object store for
workflows IO (filestore
), and a container registry for the Ryax Actions
containers (registry
) which all affect your Ryax instance performance, so it
is recommended to have SSD backed storage for all services to avoid delays
state persistence, deployments, and runs.
For more fine grained settings you can set each storage class independently
with the storageClass
inside each service.
Regarding the volume size, we recommend that you start small, you can extend them later
on with most Storage providers. The default values give comfortable volume sizes to start working on the platform.
Install Ryax¶
First, be sure that your Kubernetes context is set properly. Make sure that either your KUBECONFIG
variable is set and point to you cluster, or that the ~/.kube/config
file contains you cluster configuration. See Preparatory steps to check you cluster access.
Warning
Depending on the Cloud provider you are using you might have to mount its configuration inside the container. For the following providers add the associated option:
Microsoft Azure:
-v $HOME/.azure:/root/.azure
Google Cloud:
-v $HOME/.config/gcloud:/root/.config/gcloud
AWS:
-v $HOME/.aws:/root/.aws
Once you have customized your configuration you can install Ryax on your cluster (don’t forget to add extra option, see previous warning):
docker run \
-v $PWD:/data/volume \
-v $HOME/.kube/config:/data/kubeconfig.yml \
ryaxtech/ryax-adm:latest apply --values volume/values.yaml --suppress-diff
Note
Optionally you can populate your cluster with some first action to use in your workflows (don’t forget to add extra option, see previous warning):
docker run \
-v $PWD:/data/volume \
-v $HOME/.kube/config:/data/kubeconfig.yml \
--entrypoint=helm \
ryaxtech/ryax-adm:latest \
upgrade --install ryax-init ./helm-charts/ryax-init -n ryaxns
If the installation fails, check the logs, check your configuration and try again. If you are lost, or have any questions, please join our Discord server. We will be happy to help!
Configure DNS¶
The last step is configuring your DNS so that you can connect to your cluster. The address you should register is <clusterName>.<domainName>.
To retrieve the external IP of your cluster, run this one-liner
kubectl -n kube-system get svc traefik -o jsonpath='{.status.loadBalancer.ingress[].ip}'
# OR dpending on your provider
kubectl -n kube-system get svc traefik -o jsonpath='{.status.loadBalancer.ingress[].hostname}'
Or simply look at the response of kubectl -n kube-system get svc traefik
, under “External IP”.
Depending on your Cloud provider you will have an IP address which requires a A
entry, or a DNS (AWS) that requires you to create a CNAME
entry.
Now create a DNS entry for the cluster and another for every subdomain using a star entry:
<clusterName>.<domainName>
*.<clusterName>.<domainName>
Once your entries are created, and only if tls is enabled, you will have to wait for Let’s Encrypt to provide you a valid certificate. You can check with:
kubectl get certificates -n ryaxns
The state should be READY: true
.
Access to your cluster¶
Now you can access to you cluster with https://<clusterName>.<domainName>
on your web browser.
Default credentials are user1/pass1
Warning
Change this password and user as soon as you’re logged in!
Cluster Update¶
The Ryax configuration is declarative, so in order to update your cluster you just have to change the configuration and apply it.
Note
You need to configure your Kubernetes cluster access and to set the Cloud provider specific otions, see installation process for more details.
The Ryax configuration is stored as a secret inside your cluster after each successful apply. You can get the actual cluster configuration from the cluster itself with:
docker run \
-v $PWD:/data/volume \
-v $HOME/.kube/config:/data/kubeconfig.yml \
ryaxtech/ryax-adm:latest init --from-cluster --values volume/ryax_values.yaml
Warning
Before any updates, do a backup <./create-backups.html> and have a look at the changelog to see if there is any extra step needed.
Now you can simply change the version
field in the configuration before applying the configuration like in the installation steps described above.
Optimizing GPU compute resources with Nvidia’a Multi-instance GPUs (MIG)¶
MIG, or Multi-Instance GPU, is a technology developed by NVIDIA that allows a single GPU to be partitioned into multiple instances. Each instance operates with its own dedicated resources, enabling various workloads to run simultaneously on a single GPU, which optimizes utilization and maximizes data center investment. For AI applications, MIG can be particularly beneficial as it allows for the efficient distribution of resources, ensuring that each task has the necessary computational power without interference from other processes. HPC infrastructure providers, while using the Ryax platform, can leverage MIG to orchestrate calculation resources effectively, facilitating high-performance computing tasks and AI model training with improved quality of service.
For more details, please refer to Nvidia’s User Guide. This is also another concrete example of how MIG is applied to industrial workflows.
Prerequisites¶
In this setup guide, our attention will be on a particular HPC/Cloud service: Scaleway. This test has been carried out in September 2024. For more detailed information on this specific setup, please check the Scaleway documentation here. Otherwise please refer to your Cloud provider’s documentation on how to configure MIG for your Nvidia GPUs.
Configure the MIG profile based on the workload¶
Following the Kubernetes MIG User Manual, configure a MIG profile based on the 3g.40gb configuration and check with nvidia-smi if the MIG setup is complete. Update can take a few seconds to process. In the specific case of this test on Scaleway, the Kubernetes instructions would be as follows :
kubectl label nodes 'node-name' nvidia.com/mig.config=all-3g.40gb --overwrite
Use Nvidia-smi to check if the MIG profile have been instantiated.
kubectl exec nvidia-driver-daemonset-'reference' -t -n kube-system -- nvidia-smi -L
Try now with the 1g.10gb MIG config :
kubectl label nodes 'node-name' nvidia.com/mig.config=all-1g.10gb --overwrite
You can also check the GPU live usage with Nvidia-smi :
kubectl exec nvidia-driver-daemonset-'reference' -t -n kube-system -- nvidia-smi
Take into account that in the example above, one selected profile is 3g.40gb and this parameter needs to be adapted based on the required MIG profile supported by your GPU(s) which can be found here.
Updating your action’s metadata with the kubernetes addon supporting MIG¶
Once the desired MIG profile is set up, you can update your Ryax action in order for your workload to be offloaded to a MIG instance. In order for your action to be deployed on a specific MIG instance supported by your Nvidia GPU ; you need to update the ryax_metadata.yaml file with the kubernetes addon in the resources section.
For example :
resources:
gpu: 1
memory: 40G
cpu: 4
time: 30m
addons:
kubernetes:
node_selector: "nvidia.com/gpu.product=NVIDIA-H100-PCIe-MIG-3g.40gb"
Once the kubernetes addon is activated in the action you can change the nodeSelector in the Ryax UI once the action has been built. There you can change the MIG instance based on the profiles supported by your GPU. Again check the supported profiles here.
Use local registry only¶
Ryax uses an internal registry to store actions’ images. To allow other kubernetes
sites to join you are required to associate a valid domain name
for ryax by setting domainName and clusterName. Then, you need to configure
domain name resolution for both *.clusterName.domainName
and clusterName.domainName
pointing to the correct kubernetes cluster public IP address.
If your cluster is inaccessible from outside your private network you need
to use a nodeport to connect to the registry. This will allow actions’ pods to deploy,
however you will not be able to connect external kubernetes sites.
To accomplish that just disable tls on the ryax_values.yaml
to disable
registry credentials and make the internal registry available from a nodePort:
# Enable ryax to work on local site only, no external access to registry
# Notice that disabling tls you cannot add sites outside your local network
tls:
enabled: false
This will start one pod per node named ryax-registry-cert-setup-xxxxx
that
configures certificates to access the internal registry through 127.0.0.1:30012
.
The pod images for actions in namespace ryaxns-execs
will pull images through that
nodeport.
Resetting the GPU to a default configuration¶
Conclude MIG testing by reverting MIG to its default state using the no-MIG option. Then, verify that the GPU has returned to a MIG-disabled state by utilizing the nvidia-smi command (please refer to your own cloud provider documentation to see how to use nvidia-smi).
kubectl label nodes 'node-name' nvidia.com/mig.config=all-disabled --overwrite
kubectl exec nvidia-driver-daemonset-'reference' -t -n kube-system -- nvidia-smi -L
Again please refer to your Cloud provider’s documentation to adapt these command lines which are, in this setup example, specific to Scaleway’s GPU setup.
Troubleshooting¶
Cannot upgrade, ryax-adm gives rabbitmq password error¶
When trying to change configuration
using ryax-adm apply
you might experience rabbitmq errors like
below.
COMBINED OUTPUT:
Error: Failed to render chart: exit status 1: Error: execution error at (rabbitmq/templates/secrets.yaml:4:17):
PASSWORDS ERROR: You must provide your current passwords when upgrading the release.
Note that even after reinstallation, old credentials may be needed as they may be kept in persistent volume claims.
Further information can be obtained at https://docs.bitnami.com/general/how-to/troubleshoot-helm-chart-issues/#credential-errors-while-upgrading-chart-releases
'auth.password' must not be empty, please add '--set auth.password=$RABBITMQ_PASSWORD' to the command. To get the current value:
export RABBITMQ_PASSWORD=$(kubectl get secret --namespace "ryaxns" ryax-broker-secret -o jsonpath="{.data.rabbitmq-password}" | base64 -d)
Use --debug flag to render out invalid YAML
You can find the correct password with:
kubectl get secret --namespace "ryaxns" ryax-broker-secret -o jsonpath="{.data.rabbitmq-password}" | base64 -d
To fix this error add a section broker with the correct password like below (change secret with your password):
rabbitmq:
values:
auth:
password: secret
All actions’ pods on ryaxns-execs are in imagePullBackOff¶
If you are getting imagePullBackOff for pods on ryaxns-execs. You are probably having trouble accessing the registry through the external domain name. Assure that your DNS is configured and that the ryax traefik service is using the correct ip or fully qualified hostname. You can check Services by typing:
kubectl get service -A | grep -i LoadBalancer
Make sure that the ip/hostname associated to traefik LoadBalancer
is correct.
Make sure to add your dns entry with a wild card. For instance, if you configure
clusterName as example
and domainName as ryax.io
, make sure that you have
dns entries *.example.ryax.io
and example.ryax.io
pointing to the correct IP
address. See also (Configure DNS)[#configure_dns].
If you do not want to configure external access to your cluster you won’t be able to connect external kubernetes workers, but you can always have a local worker. In this case, to configure the internal registry refer to Use local registry only.