Enable Multi-Site on Ryax¶
Warning
This documentation assumes that you already have a working Ryax installation, with a public IP and a configured DNS. See <./install_ryax_kubernetes.html>`_ for more details.
Ryax is able to use multiple computing infrastructure at once, even during the single run of a workflow.
To enable the multi-site mode you will need to install a Ryax Worker for each site.
Ryax Worker currently supports two type of sites SLURM_SSH
and KUBERNETES
.
This document explains how to install and configure Workers.
SLURM_SSH Worker¶
Requirements¶
Because the SLURM_SSH worker uses SSH to connect to the SLURM cluster, the simplest way to deploy an SLURM_SSH worker is on the Ryax main site. So, the only things you’ll need are:
SLURM installed on the cluster
SSH access to the cluster with credentials to run SLURM commands
Pyhton3 available on the Slurm login node
(Recommended) Singularity install on the cluster to run Ryax Actions containers
Note
With the usage of custom script you can run commands directly on the cluster and avoid the usage of Singularity, but Action packaging will be completely bypassed.
Configuration¶
Note
You are required to install ssh workers on the same cluster and namespace that is running ryax-runner.
Ryax allows you to register one or more SLURM partitions to run your actions on. To do so, you need to define in the configuration partition name and node resources
Here is a simple example of configuration:
config:
site:
name: HPCSite-1
type: SLURM_SSH
spec:
partitions:
- name: default
cpu: 16
memory: 24G
gpu: 1
time: 2H
credentials:
server: my.hpc-site.com
username: ryax
Each field explained in details:
site.name: the name of the site that identifies the site in Ryax
site.type: the type of the site (can SLURM_SSH or KUBERNETES)
site.spec.partitions: the partition definitions. Ryax only supports partition with homogeneous node for now. Each resource value is given by node.
name: name of the partition.
cpu: amount of allocatable cpu core per node.
memory: amount of allocatable memory in bytes per node.
site.credentials: Contains credential to SSH to HPC cluster login server.
To extract the partition information we provide a helper script to run on the Slurm login node:
wget "https://gitlab.com/ryax-tech/ryax/ryax-runner/-/raw/master/slurm-ryax-config.py"
chmod +x ./slurm-ryax-config.py
./slurm-ryax-config.py > worker-hpc.yaml
Now you can edit the file to add your credentials for Ryax to be able to SSH to the HPC loin node:
config:
site:
spec:
# ...
credentials:
username: ryax
server: hpc.example.com
Note
Names must be unique among worker on the main site.
The private key will be injected during the installation phase.
For more details about the Worker configuration please see the Worker reference documentation
Installation¶
Now you can install the Worker on the Ryax main site. To do so, we will use the configuration defined above.
Also, we will inject the SSH private key required to access the SSH cluster.
helm upgrade --install worker-hpc-1 \
oci://registry.ryax.org/release-charts/worker \
--namespace ryaxns \
--values worker-hpc.yaml \
--set-file hpcPrivateKeyFile=./my-ssh-private-key
worker-hpc-1
: name of the helm releaseworker-hpc.yaml
: file containing the configuration of the workermy-ssh-private-key
: rsa private key file that has authorization to login
Once the worker is up and running, you should see a new site available in UI, in the workflow edition, in the Deploy tab of each action. Now you just have to select the SLURM_SSH site in the Deploy configuration to tell Ryax to execute your Action there on the next run.
If you want more control on the way Slurm deploys your action (run parallel jobs), add the HPC addon to you action. See the HPC offloading reference for more details.
Kubernetes Worker¶
Requirements¶
First, you’ll need a Kubernetes cluster, of course! Be sure that your cluster is able to provision Persistent Volumes (most of the Kubernets clusters do, by default).
kubectl get storageclass
This command should show you at least one storage class with the default flag, if this is not the case you should install one. Make sure that the storage class you want to use is set as default. You can set a storage class as default with:
kubectl patch storageclass YourStorageClassName -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
Change YourStorageClassName
accordingly.
For a simple example you can use local-path
, from the Local Provisioner storage class
if available.
To install a Ryax Worker on Kubernetes we will use Helm.
Supported versions:
Kubernetes > 1.19
Helm > 3.x
Hardware Requirements:
At least 2 CPU core
2GB of memory
1GB of disk available
Note that resource requirements really depends on your usage of the cluster.
Configuration¶
In order to configure your Worker, you will need to select one or more node pools (set of homogeneous nodes) and give to the Worker some information about the nodes.
Note
Why we use node pools? Because it allows Ryax to leverage the Kubernetes node autoscaling with scale to zero !
Here is a simple example of configuration for aws:
config:
site:
name: aws-kubernetes-cluster-1
spec:
nodePools:
- name: small
cpu: 2
memory: 4G
selector:
eks.amazonaws.com/nodegroup: default
let’s exaplain each field:
site.name: the name of the site that identifies the site in Ryax
site.spec.nodePools: the node pools definitions (a node pool is a set of homogeneous node. Each resource value is given by node).
name: name of the node pool.
cpu: amount of allocatable cpu core per node.
memory: amount of allocatable memory in bytes per node.
These fields might change depending on the cloud provider. Below an example of configuration for Azure.
config:
site:
name: azure-kubernetes-cluster-2
spec:
nodePools:
- name: small
cpu: 2
memory: 4G
selector:
kubernetes.azure.com/agentpool: default
lokiAPIUrl: "ryax-loki:3100"
promtail:
config:
clients:
- url: "http://ryax-loki:3100/loki/api/v1/push"
Note
If you use a different release name then ryax, when running helm upgrade you must change lokiAPIUrl and promtail.config.clients[0].url to point to a different service. See troubleshooting for an in deep explanation.
All node pool information can be obtained using a simple:
kubectl describe nodes
To obtain resources values, look for the Allocatable fields. Regarding the selector, you should find the label(s) that uniquely refers to your node pool (little hint: there is often pool in it ☺️).
For more details about the Worker configuration please see the Worker reference documentation
Once your configuration is ready save it for later to a worker.yaml
file.
Installation¶
Before we proceed, we are required to have a namespaces ryaxns
and ryaxns-execs
created on
the Worker new site.
kubectl create namespace ryaxns
kubectl create namespace ryaxns-execs
For the worker to communicate securely to the main Ryax site, we need to create a secure connection access between the two Kuberenetes clusters. In this How-To we will use Skupper, but many multi-cluster network technology might do the job.
Install Skupper (only MacOS or Linux)
curl https://skupper.io/install.sh | sh
To initialize Skupper, on the main site run:
skupper init -n ryaxns --site-name main
skupper token create ~/main.token -n ryaxns
On the new site run:
skupper init -n ryaxns --ingress none --site-name site-1 # Change this name for something meaningful!
skupper link create ~/main.token -n ryaxns
For the Worker to connect with Ryax, we need to give it access to the following services:
registry: to pull action images
filestore: to read and write files (actions static parameters, execution I/O)
broker: to communicate with other Ryax services
Let’s expose these services with Skupper. On the main site run:
skupper -n ryaxns expose service minio --address minio-ext
skupper -n ryaxns expose service ryax-broker --address ryax-broker-ext
The registry is already exposed by on the main site so only the secrets are required
You can copy the secrets from the Ryax main site and put them your Kubernetes cluster. We provide a small helper script to copy the secrets and inject the right service name inside (with the “-ext”) suffix.
# On the main site
wget "https://gitlab.com/ryax-tech/ryax/ryax-runner/-/raw/master/k8s-ryax-config.py"
chmod +x ./k8s-ryax-config.py
./k8s-ryax-config.py
Now apply them to the new site:
# On the new site
kubectl apply -f ./secrets
Now that we have the configuration and a secure connection with the credentials. We will use Helm to install the latest Worker release with:
helm upgrade --install ryax-worker oci://registry.ryax.org/release-charts/worker --values worker.yaml -n ryaxns
That’s it! Once the worker is up and running, you should see a new site available in UI, in the workflow edition, in the Deploy tab of each action. Now you just have to select the KUBERNETES site in the Deploy configuration to tell Ryax to execute your Action there on the next run.
Trouble Shooting¶
Container pending on persistent volume claim.¶
Check if the pvc was created with the correct storage class.
kubectl get storageclass
kubectl get pvc -n ryaxns
The PVC status should be BOUND
, if they are not it is probably because the cloud provider require
some extra configuration to create. In our tests, with AWS it was necessary
to associate an EBS addon.
Create iam service account with the role for EBS driver.
eksctl create iamserviceaccount \
--region eu-west-3 \
--name ebs-csi-controller-sa \
--namespace kube-system \
--cluster multi-site-pre-release-test \
--attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
--approve \
--role-only \
--role-name AmazonEKS_EBS_CSI_DriverRole
Then you have to create the addon associating with the account.
eksctl create addon \
--name aws-ebs-csi-driver \
--cluster multi-site-pre-release-test \
--service-account-role-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AmazonEKS_EBS_CSI_DriverRole --force
Now make the storage class the default.
kubectl patch storageclass gp2 -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
Certificate Is Not Valid¶
For Ryax to have a valid TLS certificate, you need to have a DNS entry that point to you cluster. Please, check the section related to this process in the installation documentation.
You can check the state of the certificate request using:
kubectl get certificaterequests -A
kubectl get orders.acme.cert-manager.io -A
Logs not appear on the Ryax interface.¶
On the ryax main site if actions’ log running the worker site does not appear it is probably because loki is not configured with the correct service name created by skupper. You can find the correct service name by listing services available on the worker site with:
kubectl get services -n ryaxns
The output should be something like below, the loki service exposes port 3100.
Note
Tip: the loki service is helmreleasename-loki, where helmreleasename is the name of the release you choose upon calling helm upgrade –install.
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
...
worker-loki ClusterIP 10.100.77.222 <none> 3100/TCP,9095/TCP 3d
...
Now you need to change the values of your configuration to match the name of the correct loki service.
config:
site:
name: azure-kubernetes-cluster-2
spec:
nodePools:
- name: small
cpu: 2
memory: 4G
selector:
kubernetes.azure.com/agentpool: default
# Change this if your release name is differente then ryax-worker,
# rename releasename by your helm release below
#lokiAPIUrl: "releasename-loki:3100"
#promtail:
# config:
# clients:
# - url: "http://releasename-loki:3100/loki/api/v1/push"
Run again the helm upgrade command with the updated values file.