Enable Multi-Site on Ryax

Warning

This documentation assumes that you already have a working Ryax installation, with a public IP and a configured DNS. See [installation doc](./install_ryax_kubernetes.md) for more details.

Ryax is able to use multiple computing infrastructure at once, even during the single run of a workflow. To enable the multi-site mode you will need to install a Ryax Worker for each site. The Worker works with two type of sites SLURM_SSH and KUBERNETES.

This document explains how to install and configure the Workers.

SLURM_SSH Worker

Requirements

Because the SLURM_SSH worker uses SSH to connect to the SLURM cluster, the simplest way to deploy an SLURM_SSH worker is on the Ryax main site. So, the only things you’ll need are:

  • SLURM installed on the cluster

  • SSH access to the cluster with credentials to run SLURM commands

  • Pyhton3 available on the Slurm login node

  • (Recommended) Singularity install on the cluster to run Ryax Actions containers

    Note

    With the usage of custom script you can run commands directly on the cluster and avoid the usage of Singularity, but Action packaging will be completely bypassed.

Configuration

Ryax allows you to register one or more SLURM partitions to run your actions on. To do so, you need to define in the configuration partition name and node resources

Here is a simple example of configuration:

config:
  site:
    name: HPCSite-1
    type: SLURM_SSH
    spec:
      partitions:
        - name: default
          cpu: 16
          memory: 24G
          gpu: 1
          time: 2H
      credentials:
        server: my.hpc-site.com
        username: ryax

Each field explained in details:

  • site.name: the name of the site that identifies the site in Ryax

  • site.type: the type of the site (can SLURM_SSH or KUBERNETES)

  • site.spec.partitions: the partition definitions. Ryax only supports partition with homogeneous node for now. Each resource value is given by node.

    • name: name of the partition.

    • cpu: amount of allocatable cpu core per node.

    • memory: amount of allocatable memory in bytes per node.

  • site.credentials: Contains credential to SSH to HPC cluster login server.

To extract the partition information we provide a helper script to run on the Slurm login node:

wget "https://gitlab.com/ryax-tech/ryax/ryax-runner/-/raw/master/slurm-ryax-config.py"
chmod +x ./slurm-ryax-config.py
./slurm-ryax-config.py > worker.yaml

Now you can edit the file to add your credentials for Ryax to be able to SSH to the HPC loin node:

config:
  site:
    spec:
      # ...
      credentials:
        username: ryax
        server: hpc.example.com

The private key will be injected during the installation phase.

For more details about the Worker configuration please see the Worker reference documentation

Installation

Now you can install the Worker on the Ryax main site. To do so, we will use the configuration defined above.

Also, we will inject the SSH private key required to access the SSH cluster.

helm upgrade --install ryax \
  oci://registry.ryax.org/release-charts/worker  \
  --namespace ryaxns \
  --values worker.yaml \
  --set-file hpcPrivateKeyFile=./my-ssh-private-key

Once the worker is up and running, you should see a new site available in UI, in the workflow edition, in the Deploy tab of each action. Now you just have to select the SLURM_SSH site in the Deploy configuration to tell Ryax to execute your Action there on the next run.

If you want more control on the way Slurm deploys your action (run parallel jobs), add the HPC addon to you action. See the HPC offloading reference for more details.

Kubernetes Worker

Requirements

First, you’ll need a Kubernetes cluster, of course! Be sure that your cluster is able to provision Persistent Volumes (most of the Kubernets clusters do, by default).

kubectl get storageclass

This command should show you at least one storage class, if this is not the case you should install one, for a simple example you can use the Local Provisioner.

To install a Ryax Worker on Kubernetes we will use Helm.

Supported versions:

  • Kubernetes > 1.19

  • Helm > 3.x

Hardware:

  • At least 2 CPU core

  • 2GB of memory

  • 1GB of disk available

Note that resource requirements really depends on your usage of the cluster.

Configuration

In order to configure your Worker, you will need to select one of more node pool (set of homogeneous nodes) and give to the Worker some information about the nodes.

Note

Why we use node pools? Because it allows Ryax to leverage the Kubernetes node autoscaling with scale to zero !

Here is a simple example of configuration:

config:
  site:
    name: Azure-1
    spec:
      nodePools:
      - name: small
        cpu: 2
        memory: 4G
        selector:
          kubernetes.azure.com/agentpool: default

let’s exaplain each field:

  • site.name: the name of the site that identifies the site in Ryax

  • site.spec.nodePools: the node pools definitions (a node pool is a set of homogeneous node. Each resource value is given by node).

    • name: name of the node pool.

    • cpu: amount of allocatable cpu core per node.

    • memory: amount of allocatable memory in bytes per node.

    • selector: a Kubernetes node label(s) that allows Ryax to select the defined node pool to allocate your actions.

All node pool information can be obtained using a simple:

kubectl describe nodes

To obtain resources values, look for the Allocatable fields. Regarding the selector, you should find the label(s) that uniquely refers to your node pool (little hint: there is often pool in it ☺️).

For more details about the Worker configuration please see the Worker reference documentation

Once your configuration is ready save it for later to a worker.yaml file.

Installation

For the worker to communicate securely to the main Ryax site, we need to create a secure connection access between the two Kuberenetes clusters. In this How-To we will use Skupper, but many multi-cluster network technology might do the job.

Install Skupper (only MacOS or Linux)

curl https://skupper.io/install.sh | sh

To initialize Skupper, on the main site run:

skupper init -n ryaxns --site-name main

On the new site run:

skupper init -n ryaxns --site-name site-1 # Change this name for something meaningful!

For the Worker to connect with Ryax, we need to give it access to the following services:

  • registry: to pull action images

  • filestore: to read and write actions IO

  • broker: auth to communicate with the Runner

Let’s expose these services with Skupper. On the main site run:

skupper -n ryaxns expose service minio --address minio-ext
skupper -n ryaxns expose service ryax-broker --address ryax-broker-ext

The registry is already exposed by on the main site so only the secrets are required

You can copy the secrets from the Ryax main site and put them your Kubernetes cluster. We provide a small helper script to copy the secrets and inject the right service name inside (with the “-ext”) suffix.

# On the main site
wget "https://gitlab.com/ryax-tech/ryax/ryax-runner/-/raw/master/k8s-ryax-config.py" 
chmod +x ./k8s-ryax-config.py
./k8s-ryax-config.py

Now apply them to the new site:

# On the new site
kubectl apply -f ./secrets

Now that we have the configuration and a secure connection with the credentials. We will use Helm to install the Worker with:

helm upgrade --install ryax oci://registry.ryax.org/release-charts/worker --values worker.yaml -n ryaxns

That’s it! Once the worker is up and running, you should see a new site available in UI, in the workflow edition, in the Deploy tab of each action. Now you just have to select the KUBERNETES site in the Deploy configuration to tell Ryax to execute your Action there on the next run.