Install Ryax on Kubernetes

We assume that you are comfortable with Kubernetes. To keep this guide short, we leave out details on the Kubernetes commands.

Requirements

All you need to install Ryax is a Kubernetes cluster and Docker installed on your machine. You can get a managed Kubernetes instance from any Cloud provider. For a local development installation please refers the Getting Started Guide.

Supported Kubernetes versions:

  • kubernetes > 1.19

Hardware:

  • At least 2 CPU core

  • 4GB or memory

  • 40GB of disk available

Note that depending on the Actions that you run on your cluster you might need more resources.

Preparatory Steps

  • Make sure your configuration point to the intended cluster: kubectl config current-context.

  • Your Kubernetes cluster dedicated to Ryax: we offer no guarantee that Ryax runs smoothly alongside other applications.

  • Make sure you have complete admin access to the cluster. Try to run kubectl auth can-i create ns or kubectl auth can-i create pc, for instance.

    $ kubectl auth can-i create ns
    Warning: resource 'namespaces' is not namespace scoped
    yes
    
  • Have access to a DNS server where you can add a new A or CNAME entry for your cluster.

Configure your Installation

Installing Ryax is analogous to installing a Helm chart. To begin we will start with a default configuration, and make a few tweaks so that everything is compatible with your Kubernetes provider. Be assured however that you will be able to fine-tune your installation later on.

Warning

Special warning for EKS (AWS Elastic Kubernetes Service)

Ryax requires persistent storage and by default, EKS does not provide any storage driver. Please, install the EBS CSI plugin with:

# Get this from `eksctl get clusters`
cluster_name=<My cluster name>

eksctl utils associate-iam-oidc-provider --cluster=$cluster_name --approve

eksctl create iamserviceaccount \
  --name ebs-csi-controller-sa \
  --namespace kube-system \
  --cluster $cluster_name \
  --role-name AmazonEKS_EBS_CSI_DriverRole \
  --role-only \
  --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
  --approve

eksctl create addon --name aws-ebs-csi-driver --cluster $cluster_name --service-account-role-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AmazonEKS_EBS_CSI_DriverRole --force

See the the official documentation for more details.

Also be aware that you cannot use Fargate because it does not support persistent storage

Initialize

First create a directory to organize the Ryax installation and initialize it with the default configuration:

mkdir ryax_install
cd ryax_install
docker run \
  -v $PWD:/data/volume \
  ryaxtech/ryax-adm:latest init --values volume/values.yaml

You are now in the ryax_install folder and the values.yaml containing the default config was created.

Note

All the following commands assume that you are in the ryax_install directory.

To explain the configuration fields, here is an example of simple configuration file for Ryax:

# The Ryax version.
# Check here to get the latest version: https://github.com/RyaxTech/ryax-engine/releases
version: 24.10.0

# Cluster DNS
clusterName: myclustername
domainName: example.com

# Log level for all Ryax services
logLevel: info

# Set the storage size for each stateful service
datastore:
  pvcSize: 10Gi
minio:
  pvcSize: 40Gi
registry:
  pvcSize: 20Gi

# Enable Prometheus + Grafana monitoring
monitoring:
  enabled: true

# Use HTTPS by default
tls:
  enabled: true

# Automate HTTPS with Let's Encrypt
certManager:
  enabled: false

# Depends on your Kubernetes instance. Leave it empty to use the default
storageClass: ""

The Ryax installation is based on Helm charts, one for each service with a helmfile to define the whole cluster configuration.

To customize your installation. You can set any configuration field using the values keyword. A detailed description of all the values can be found in ryax-adm/helm-charts/values.yaml.

Settings

Set the version field with the Ryax version, for example: 23.10.0. The latest stable version can be found in the releases page.

The clusterName and domainName defines the name you give to your cluster, which is used in various places. One of those places is the URL of your cluster that will be <clusterName>.<domainName>, therefore it has to be consistent with your DNS.

If you do not intend to configure a DNS cluster, just leave this to the default value and disable the certManager, and in this case be aware you will access Ryax through the IP address directly and https certificate will be self-signed.

Warning

Depending on your Kubernetes cluster setup, you might have issue with Cert Manager which is use to get a valid HTTPS certificate. See the Cert Manager compatibility documentation for more details.

If you want to deal with the certificate yourself, you can disable it with:

certManager:
  enabled: false

An important configuration is the storageClass. If not set, Ryax will use the default one provided by the Kubernetes cluster for all services. But, the volumes are used to store the internal database (datastore), object store for workflows IO (filestore), and a container registry for the Ryax Actions containers (registry) which all affect your Ryax instance performance, so it is recommended to have SSD backed storage for all services to avoid delays state persistence, deployments, and runs. For more fine grained settings you can set each storage class independently with the storageClass inside each service. Regarding the volume size, we recommend that you start small, you can extend them later on with most Storage providers. The default values give comfortable volume sizes to start working on the platform.

Install Ryax

First, be sure that your Kubernetes context is set properly. Make sure that either your KUBECONFIG variable is set and point to you cluster, or that the ~/.kube/config file contains you cluster configuration. See Preparatory steps to check you cluster access.

Warning

Depending on the Cloud provider you are using you might have to mount its configuration inside the container. For the following providers add the associated option:

  • Microsoft Azure: -v $HOME/.azure:/root/.azure

  • Google Cloud: -v $HOME/.config/gcloud:/root/.config/gcloud

  • AWS: -v $HOME/.aws:/root/.aws

Once you have customized your configuration you can install Ryax on your cluster (don’t forget to add extra option, see previous warning):

docker run \
  -v $PWD:/data/volume \
  -v $HOME/.kube/config:/data/kubeconfig.yml \
  ryaxtech/ryax-adm:latest apply --values volume/values.yaml --suppress-diff

Note

Optionally you can populate your cluster with some first action to use in your workflows (don’t forget to add extra option, see previous warning):

docker run \
  -v $PWD:/data/volume \
  -v $HOME/.kube/config:/data/kubeconfig.yml \
  --entrypoint=helm \
  ryaxtech/ryax-adm:latest \
  upgrade --install ryax-init ./helm-charts/ryax-init -n ryaxns

If the installation fails, check the logs, check your configuration and try again. If you are lost, or have any questions, please join our Discord server. We will be happy to help!

Configure DNS

The last step is configuring your DNS so that you can connect to your cluster. The address you should register is <clusterName>.<domainName>.

To retrieve the external IP of your cluster, run this one-liner

kubectl -n kube-system get svc traefik -o jsonpath='{.status.loadBalancer.ingress[].ip}'
# OR dpending on your provider
kubectl -n kube-system get svc traefik -o jsonpath='{.status.loadBalancer.ingress[].hostname}'

Or simply look at the response of kubectl -n kube-system get svc traefik, under “External IP”.

Depending on your Cloud provider you will have an IP address which requires a A entry, or a DNS (AWS) that requires you to create a CNAME entry.

Now create a DNS entry for the cluster and another for every subdomain using a star entry:

  • <clusterName>.<domainName>

  • *.<clusterName>.<domainName>

Once your entries are created, and only if tls is enabled, you will have to wait for Let’s Encrypt to provide you a valid certificate. You can check with:

kubectl get certificates -n ryaxns

The state should be READY: true.

Access to your cluster

Now you can access to you cluster with https://<clusterName>.<domainName> on your web browser.

Default credentials are user1/pass1

Warning

Change this password and user as soon as you’re logged in!

Cluster Update

The Ryax configuration is declarative, so in order to update your cluster you just have to change the configuration and apply it.

Note

You need to configure your Kubernetes cluster access and to set the Cloud provider specific otions, see installation process for more details.

The Ryax configuration is stored as a secret inside your cluster after each successful apply. You can get the actual cluster configuration from the cluster itself with:

docker run \
    -v $PWD:/data/volume \
    -v $HOME/.kube/config:/data/kubeconfig.yml \
    ryaxtech/ryax-adm:latest init --from-cluster --values volume/ryax_values.yaml

Warning

Before any updates, do a backup <./create-backups.html> and have a look at the changelog to see if there is any extra step needed.

Now you can simply change the version field in the configuration before applying the configuration like in the installation steps described above.

Optimizing GPU compute resources with Nvidia’a Multi-instance GPUs (MIG)

MIG, or Multi-Instance GPU, is a technology developed by NVIDIA that allows a single GPU to be partitioned into multiple instances. Each instance operates with its own dedicated resources, enabling various workloads to run simultaneously on a single GPU, which optimizes utilization and maximizes data center investment. For AI applications, MIG can be particularly beneficial as it allows for the efficient distribution of resources, ensuring that each task has the necessary computational power without interference from other processes. HPC infrastructure providers, while using the Ryax platform, can leverage MIG to orchestrate calculation resources effectively, facilitating high-performance computing tasks and AI model training with improved quality of service.

For more details, please refer to Nvidia’s User Guide. This is also another concrete example of how MIG is applied to industrial workflows.

Prerequisites

In this setup guide, our attention will be on a particular HPC/Cloud service: Scaleway. This test has been carried out in September 2024. For more detailed information on this specific setup, please check the Scaleway documentation here. Otherwise please refer to your Cloud provider’s documentation on how to configure MIG for your Nvidia GPUs.

Configure the MIG profile based on the workload

Following the Kubernetes MIG User Manual, configure a MIG profile based on the 3g.40gb configuration and check with nvidia-smi if the MIG setup is complete. Update can take a few seconds to process. In the specific case of this test on Scaleway, the Kubernetes instructions would be as follows :

kubectl label nodes 'node-name' nvidia.com/mig.config=all-3g.40gb --overwrite

Use Nvidia-smi to check if the MIG profile have been instantiated.

kubectl exec nvidia-driver-daemonset-'reference'  -t -n kube-system -- nvidia-smi -L

Try now with the 1g.10gb MIG config :

kubectl label nodes 'node-name' nvidia.com/mig.config=all-1g.10gb --overwrite

You can also check the GPU live usage with Nvidia-smi :

kubectl exec nvidia-driver-daemonset-'reference'  -t -n kube-system -- nvidia-smi

Take into account that in the example above, one selected profile is 3g.40gb and this parameter needs to be adapted based on the required MIG profile supported by your GPU(s) which can be found here.

Updating your action’s metadata with the kubernetes addon supporting MIG

Once the desired MIG profile is set up, you can update your Ryax action in order for your workload to be offloaded to a MIG instance. In order for your action to be deployed on a specific MIG instance supported by your Nvidia GPU ; you need to update the ryax_metadata.yaml file with the kubernetes addon in the resources section.

For example :

  resources:
    gpu: 1
    memory: 40G
    cpu: 4
    time: 30m
  addons:
    kubernetes:
      node_selector: "nvidia.com/gpu.product=NVIDIA-H100-PCIe-MIG-3g.40gb"
 

Once the kubernetes addon is activated in the action you can change the nodeSelector in the Ryax UI once the action has been built. There you can change the MIG instance based on the profiles supported by your GPU. Again check the supported profiles here.

Use local registry only

Ryax uses an internal registry to store actions’ images. To allow other kubernetes sites to join you are required to associate a valid domain name for ryax by setting domainName and clusterName. Then, you need to configure domain name resolution for both *.clusterName.domainName and clusterName.domainName pointing to the correct kubernetes cluster public IP address.

If your cluster is inaccessible from outside your private network you need to use a nodeport to connect to the registry. This will allow actions’ pods to deploy, however you will not be able to connect external kubernetes sites. To accomplish that just disable tls on the ryax_values.yaml to disable registry credentials and make the internal registry available from a nodePort:

# Enable ryax to work on local site only, no external access to registry
# Notice that disabling tls you cannot add sites outside your local network
tls:
  enabled: false

This will start one pod per node named ryax-registry-cert-setup-xxxxx that configures certificates to access the internal registry through 127.0.0.1:30012. The pod images for actions in namespace ryaxns-execs will pull images through that nodeport.

Resetting the GPU to a default configuration

Conclude MIG testing by reverting MIG to its default state using the no-MIG option. Then, verify that the GPU has returned to a MIG-disabled state by utilizing the nvidia-smi command (please refer to your own cloud provider documentation to see how to use nvidia-smi).

kubectl label nodes 'node-name' nvidia.com/mig.config=all-disabled --overwrite
kubectl exec nvidia-driver-daemonset-'reference'  -t -n kube-system -- nvidia-smi -L

Again please refer to your Cloud provider’s documentation to adapt these command lines which are, in this setup example, specific to Scaleway’s GPU setup.

Troubleshooting

Cannot upgrade, ryax-adm gives rabbitmq password error

When trying to change configuration using ryax-adm apply you might experience rabbitmq errors like below.

COMBINED OUTPUT:
  Error: Failed to render chart: exit status 1: Error: execution error at (rabbitmq/templates/secrets.yaml:4:17):
  PASSWORDS ERROR: You must provide your current passwords when upgrading the release.
                   Note that even after reinstallation, old credentials may be needed as they may be kept in persistent volume claims.
                   Further information can be obtained at https://docs.bitnami.com/general/how-to/troubleshoot-helm-chart-issues/#credential-errors-while-upgrading-chart-releases
      'auth.password' must not be empty, please add '--set auth.password=$RABBITMQ_PASSWORD' to the command. To get the current value:
          export RABBITMQ_PASSWORD=$(kubectl get secret --namespace "ryaxns" ryax-broker-secret -o jsonpath="{.data.rabbitmq-password}" | base64 -d)
  Use --debug flag to render out invalid YAML

You can find the correct password with:

kubectl get secret --namespace "ryaxns" ryax-broker-secret -o jsonpath="{.data.rabbitmq-password}" | base64 -d

To fix this error add a section broker with the correct password like below (change secret with your password):

rabbitmq:
  values:
    auth:
      password: secret

All actions’ pods on ryaxns-execs are in imagePullBackOff

If you are getting imagePullBackOff for pods on ryaxns-execs. You are probably having trouble accessing the registry through the external domain name. Assure that your DNS is configured and that the ryax traefik service is using the correct ip or fully qualified hostname. You can check Services by typing:

kubectl get service -A  | grep -i LoadBalancer

Make sure that the ip/hostname associated to traefik LoadBalancer is correct. Make sure to add your dns entry with a wild card. For instance, if you configure clusterName as example and domainName as ryax.io, make sure that you have dns entries *.example.ryax.io and example.ryax.io pointing to the correct IP address. See also (Configure DNS)[#configure_dns].

If you do not want to configure external access to your cluster you won’t be able to connect external kubernetes workers, but you can always have a local worker. In this case, to configure the internal registry refer to Use local registry only.