Ryax VPA-Pilot

What is Ryax VPA-Pilot

The AI-empowered resource management optimizations within Ryax are brought through Ryax VPA-Pilot.

Ryax VPA-Pilot is a Vertical Pod Autoscaler designed for the actions of workflows in Ryax. For each workflow, Ryax VPA-Pilot dynamically gives container size recommendation for each resource type of each action, based on its automatically collected utilization metrics of these containers.

Ryax VPA-Pilot has been implemented in such a way to simply enable different recommendation algorithms. Currently, the following choices are available:

  • Rule-based algorithm : where the recommendation is done by a pre-defined statistics based on historical consumption. This algorithm is lightweighted but the recommendation is not fit enough.

  • ML-driven algorithm : where the recommendation is done by machine learning based on historical resource consumption. This algorithm gives more suitable and adaptive recommendation, but heavier than rule-based.

Only one algorithm from the above can be configured as the global recommendation method, at the start of Ryax VPA-Pilot service.

How to install and config Ryax VPA-Pilot

Ryax VPA-Pilot is a deployment with 1 pod running in kube-system namespace.

After git pull the Ryax VPA-Pilot code, to install:

cd deployment
kubectl apply -f vpa-authorization.yaml
# Then choose one deployment yaml from the next 2, depend on which algorithm you want to use:
kubectl apply -f vpa-deployment-rule.yaml # Rule-based algorithm
# OR
kubectl apply -f vpa-deployment-ml.yaml # ML-driven algorithm

These 2 deployment yamls are samples. If not succeed to deploy, need to check and modify them, for example the nodeSelector.

The full list of arguments in deployment yaml:

Common args:

Args name (default value) : Description

-v, --v Level (0): Log level verbosity

--kube-api-burst float (10): QPS burst limit when making requests to Kubernetes apiserver (use default unless not work)

--kube-api-qps float (5): QPS limit when making requests to Kubernetes apiserver (use default unless not work)

--kubeconfig string (""): Path to a kubeconfig. Only required if out-of-cluster (use default unless not work)

--namespace-consider string (""): Namespace where the scaled actions locates ("" for considering all namespaces)

--enable-monitoring (false): Enable VPA to export Prometheus metrics. After enable this, Prometheus server can fetch metrics from port 8080. NOTICE: spec.template.spec.containers[0].ports.containerPort:8080 needed in yaml if enabled monitoring

--vpa-instance-expire-duration duration (48h0m0s): Expiration duration for unused VPA instance (1 per action). If for an action no new active pods after this duration, the VPA instance is deleted to save memory

--recommend-interval duration (5m0s): Interval of aggregation and recommendation

--record-interval duration (1s): Interval of recording data

--bumpup-ratio float (1.2): Memory recommendation bump-up ratio when OOM happens

--bumpup-timeout duration (10m0s): Timeout for finishing OOM bump-up. If no other OOM happens in this duration, the memory recommendation will recover to raw recommendation

--ap-cpu-coldstart-n int (0): Cold start, No recommendation before this amount of aggregations (time is N*5min)

--ap-memory-coldstart-n int (0): Cold start, No recommendation before this amount of aggregations (time is N*5min)

--ap-cpu-histogram-bucket-num int (400): Num of buckets in linear histogram of CPU

--ap-memory-histogram-bucket-num int (800): Num of buckets in linear histogram of memory

--vpa-algorithm string ("rule"): Recommendation algorithm: 'rule' for Rule-based, 'ml' for ML-driven

If we choose to use the Rule-based algorithm (--vpa-algorithm=rule), then we also need

Specific args to config this rule-based algorithm: (These should be appened to the common args)

--ap-cpu-histogram-decay-half-life duration (48h0m0s): Time for a historical CPU sample to lose half its weight

--ap-memory-histogram-decay-half-life duration (24h0m0s): Time for a historical memory sample to lose half its weight

--ap-cpu-lastsample-n int (5): Sliding window length N for CPU (for calculate maximum and decaying). See Autopilot Paper

--ap-memory-lastsample-n int (5): Sliding window length N for CPU (for calculate maximum and decaying). See Autopilot Paper

--ap-cpu-recommend-policy string ("sp_90"): CPU recommendation policy (see Autopilot Paper): 'avg', 'max', 'sp_xx', or 'spike'

--ap-memory-recommend-policy string ("sp_98"): Memory recommendation policy (see Autopilot Paper): 'avg', 'max', 'sp_xx', or 'spike'

--ap-fluctuation-reducer-duration duration (1h0m0s): Sliding window length of fluctuation reducer. See Autopilot Paper

If we choose to use the ML-driven algorithm (--vpa-algorithm=ml), then we also need

Specific args to config ML algorithm: (These should be appened to the common args)

--hparam-cpu-d float (0.5): Hyper-parameter d of the ML model for CPU

--hparam-cpu-wdl float (0.5): Hyper-parameter wdeltal of the ML model for CPU

--hparam-cpu-wdm float (0.5): Hyper-parameter wdeltam of the ML model for CPU

--hparam-cpu-wo float (0.5): Hyper-parameter wo of the ML model for CPU

--hparam-cpu-wu float (0.5): Hyper-parameter wu of the ML model for CPU

--hparam-memory-d float (0.5): Hyper-parameter d of the ML model for memory

--hparam-memory-wdl float (0.5): Hyper-parameter wdeltal of the ML model for memory

--hparam-memory-wdm float (0.5): Hyper-parameter wdeltam of the ML model for memory

--hparam-memory-wo float (0.5): Hyper-parameter wo of the ML model for memory

--hparam-memory-wu float (0.5): Hyper-parameter wu of the ML model for memory

--ml-cpu-num-dm int (50): Number of different d_m values in models for CPU. Total number of models = dm * mm. See Yuqiang master report.

--ml-cpu-num-mm int (400): Number of different m_m values in models for CPU. Total number of models = dm * mm. See Yuqiang master report.

--ml-cpu-size-buckets-mm int (1): 1 safety margin value aligned to how many bucket sizes. Usually 1. See Yuqiang master report.

--ml-memory-num-dm int (50): Number of different d_m values in models for Memory. Total number of models = dm * mm.

--ml-memory-num-mm int (400): Number of different m_m values in models for CPU. Total number of models = dm * mm.

--ml-memory-size-buckets-mm int (1): 1 safety margin value aligned to how many bucket sizes. Usually 1.

If ML algorithm is enabled (--vpa-algorithm=ml), we can enable online iteration functionality by the config below: (These should be appened to the common args and ML specific args)

--iteration (false): Enable online iteration. VPA will automatically train the hyperparams of the ML algorithm. (The initial hps still need to be declared in config)

--iteration-interval duration (24h0m0s): Interval between 2 executions of online iterations.

--iteration-trace-path string (""): Path that stores temporary traces for online iteration

NOTE: If we enabled online iteration(--iteration=true and assume we set trace path as --iteration-trace-path=/tmptraces). we should configure an emptyDir volume for VPA:

# Under spec.template.spec.containers[0] add
volumeMounts:
    - mountPath: /tmptraces
        name: tmptraces-volume

# Under spec.template.spec add
volumes:
    - name: tmptraces-volume
        emptyDir:
          sizeLimit: 1Gi

API

Ryax VPA-Pilot reads and writes annotations of the target workload deployment of actions, to communicate with Ryax worker.

To let a deployment being auto-scaled, we (or Ryax worker) should set these 2 annotations in deployment. VPA will automatically detect the ones with these 2 annotation:

# The unique key for an action. Inside VPA, all deployments(pods) with the same recommender_input are autoscaled together, giving the same recommendation values.
ryax.tech/recommender_input: 'sample_key'
# The initial amount of resources that users set. This value worked as container size of VPA is under cold-start, or VPA has temporary error to recommend a value.
ryax.tech/user_resources_request: '{"cpu": 0.1, "memory": 2147483648}'

The recommendation results will be patched to the annotation of every deployments in actions. The format is:

ryax.tech/recommendation: '{"requests":{"cpu":"0.1","memory":"2147483648"}}'

Ryax VPA-Pilot component does not apply the recommendation to the pods. This will be done by Ryax platform to read recommendation and schedule.

Ryax VPA-Pilot behaviour and architecture

After Ryax VPA-Pilot is deployed, we can directly apply the deployments of the to-be-scaled workload, with the annotations as given above. They can be automatically detected and processed as:

  • Every 1 second, Ryax VPA-Pilot fetches current existing workload instances and their resource consumption data.

  • Every 5 minutes since Ryax VPA-Pilot started, (either Rule or ML algorithm,) the recommendation is calculated and patched to each corresponding deployments. So when a workload to be scaled is deployed, we should wait at most 5 minutes to get the first recommendation.

  • For the ML-driven algorithm, if we enable online iteration, every 24 hours, the model will do an online retraining by itself with comsumption data since previous retraining. This will produce much more adaptive recommendation quality than if we disable iteration(in this case the model is fixed, pre-trained on general workloads).

More detailed architectures and implementations are shown in the figure and described below. The intervals above can all be configured, as described in temporal description below.

Architecture diagram The parts in yellow only runs when online iteration is enabled in ML algorithm.

Temporal description of the architechture

Main thread

Runs every 1 second to collect metrics from Kubernetes API (can be configured by --record-interval, but modify is not suggested).

Every 5 minutes (can be configured by --recommend-interval but modify is not suggested), the algorithm aggregates the histograms in the past 5min interval, run the algorithm and update recommendation to all deployments.

Iteration thread (only runs when online iteration is enabled)

The iterator is an online learning mechanism that dynamically updates the hyperparameters of ML model. It runs with interval of 24h (can be configured by --iteration-interval). It records the online metrics since the previous iteration, run a Bayesian Optimization algorithm to optimize hyper-parameters based on a simulator using the records. After each iteration, all existing ML model instances are restarted with new HPs.

If iteration-interval < 24h, to guarantee the quality of online optimization, the trace is dumplicated and scatted under some randomization, to generate synthetic 24h records for optimization.

Spatial description of the architechture

The dataio module is used for communicate with Kubernetes API, in the level of Kubernetes objects(deployments, pods…).

The vpainstance module is the logical model of an auto-scaled unit. 1 vpainstance serves 1 action, identified by the annotated key on deployment. vpainstance holds the algorithm instances for each resource type. They can be configured as Rule-based or ML.

As downstream of recommendation (either by Rule-based or ML), there is OOM processor for memory resource. It takes the OOM event and last pod size, bumps it up temporarily as new recommendation value. After bumpup-timeout, the bumped-up value will expire, and recovers the memory recommendation as the raw output of the algorithm.

The vpainstance pool holds the vpa instances, manages the pools (allocation, garbage collection) and the map between Kubernetes objects and VPA objects.

In the iterator, trace parser is used to generate synthetic records if the iteration interval is less than 24h. HP Sweep is the core Bayesian optimization algorithm (and the simulator). These components communicate with asynchoronized files to save memory.