Workflows

Ryax is more than just a tool, it is a complete framework to develop, deploy, and maintain data driven applications. To do so, Ryax proposes a paradigm, to fully understand the framework we must go through the concepts of workflows, modules, stream operators, and executions. All these are explained in details through this section. For didactically purposes, the content here is presented in a more theoretical way explaining the choices and the framework constraints. However, if you are in a hurry, we strongly advise you to go directly to our Crash Course, you can always go back here afterwards.

We chose to present the Ryax framework in a top-down approach starting with workflows and then shifting to modules, stream operators, and executions. This makes it easy to go from a big picture, from the data application as a whole perspective, to a zoomed detail of how to implement pieces of software on Ryax. In spite this work for some people it might not be the case for everyone. If you fill a huge gap of curiosity, feel free to visit the Modules section first, whatever order best suites you.

Workflow definition

Ryax enables to define an application in the form of a data pipeline. This data driven application is called a workflow and can be simply put as a graph that explicitly shows how data is handled in a flow from origin to outcomes. Each box on this graph, is called a module. Modules are stateless software portions that receive some data as input and produce some data as output, see below.

Data flow from Module A to module B

By definition, workflows link modules in a graph where each node is a module and edges define the relationship based on data flow. An arrow from Module A to Module B, as the workflow above shows, indicates that Module A produces data that is needed by Module B. In other words, Module B requires as input the data that Module A will output. There is an important temporal dependency here. If new data arrives at Module A this will trigger the data handler on Module A. However Module B will have to wait the output of Module A.

Transitivity is present but implicitly in the workflow graphical representation. In the workflow bellow Module A feeds Module B that feeds Module C. This means that C will receive data when both have finished A and B, so C must execute before A and B. In this case C can collect the output either from A or B or both. Transitivity makes that C depends also on A even though there is no arrow connecting directly A to C.

Module types

Sources, processors, and publishers are 3 roles found in many data analysis applications. For example, we may retrieve data from a data lake and data warehouse, then we organize data, produce valuable visualization, KPIs, and finally export those to an external service, a dashboard or perhaps send an e-mail.

Having a role associated to modules enable to re-use applications easily and focus on the data analysis itself. By following this framework, users can completely ignore system administration aspects. Deploying applications is transparent. Listen or running an application periodically is easier: just plug the right source and you have it. Export data for the outside world is a breeze, want to send an e-mail with an alert, just reuse the send e-mail module. Another example is to trigger the application on an http request, for instance.

Workflows have always at least one source. These are the boxes without any arrows arriving at it. Sources are the entry point of the workflow where data arrives from outside Ryax. Source modules can be graphically identified because there are only arrows going out of them. Examples are listen for an http request, or associated with a timer. Module A is a source in the example above.

On the opposite end, modules that only have arrows arriving at it are publishers, these will push data outside of Ryax. Publishing data could be sending an e-mail with an attachment, update an image on a dashboard, or push a file to an external storage.

A workflow can have multiple sources and multiple publishers. The modules between sources and publishers are called processors. Processors are typical building block for data driven applications. They would execute some operation like organizing raw data into dataframes, plot images, make animations, and such.

Workflow

The workflow graph has constrains. Formally it is a DAG (Directed Acyclic Graph). Directed, because edges are always arrows to explicitly show the flow of data. Acyclic because cycles are not allowed. Bellow is a more complex valid workflow with various modules.

Workflow

Replicate a module output to feed several modules is directly supported. In other words, multiple arrows going out of a module like Module B which feeds Module C and Module C' above. In this case each new output produced by B is replicated as input of C and C'.

Workflow

To join arrows from multiple flows together is rather complex when compared to fork flows and requires to make choices. For instance in the workflow above, B and C both need to feed their outputs to D. In this case D must know how to combine data from B and C. In case B and C are in complete sync this is not a problem, because their inputs would be available at the exact same time. Perhaps in a few cases this anachronism occurs, although the most common is to have modules asynchronous. For instance, if B produces data twice as fast as C then what should D receive as input? Let us say B produces output data b1 meanwhile we are still waiting for output from C, so imagine that before data c1 arrives, B receives a new data b2. What should D receive as input pair (b1,c1) or (b2,c1)? In a matter of fact any of the solutions above are correct but we need a special module role that implement the design of our choice. For this reason, we have also modules that have the role of streaming operators to merge multiple flows together.

Yaml

The workflow textual format is used by Ryax API to create the representation of the workflow internally. Below is an example of file that structure a workflow in yaml. The first line apiVersion: "ryax.tech/v1alpha5" defines the API type. Second line, kind: Workflows, defines the kind of the entity as a workflow.

apiVersion: "ryax.tech/v1alpha5"
kind: Workflows
spec:
  human_name: Print a parameter on the logs
  description: >
      Simply print a parameter on the logs.
  functions:
    - id: one-run
      from: one-run
      version: "1.0"
      position:
        x: 0
        y: 0
      inputs_values: {}
      streams_to: ["print-param"]
    - id: print-param
      from: print-param
      version: "1.0"
      position:
        x: 1
        y: 1
      inputs_values:
        input_str: "hello daddy"

The spec has entries for all practical information of a workflow, human_name is the information that will be displayed on the UI next to the description content. After follows a list of modules, functions:. Then an item per module it present. In this case the first module one-run is defined by.

    - id: one-run
      from: one-run
      version: "1.0"
      position:
        x: 0
        y: 0
      inputs_values: {}
      streams_to: ["print-param"]

The id of the function is the internal name and from is the link to the module’s definition. Version is used to keep multiple versions of the same module, it is safe to fix it on a certain value at first. For UI purposes the position with x and y will define the position on the UI grid for module one-run. x: 0 and y: 0 means one-run module box will be displayed in the first row and first column of the grid. Dictionary inputs_values define static or dynamic values for all expected input parameters of the module, one-run is a gateways and do not have any input parameters. Finally streams_to is a list of modules that will be fed with one-run input. Even though one-run has no output parameters streams_to field will trigger an execution of print-param once the source module produces data.

    - id: print-param
      from: print-param
      version: "1.0"
      position:
        x: 1
        y: 1
      inputs_values:
        input_str: "hello world"

print-param module workflow reference is very similar. The differences are its position on the grid x:1 and y:1. Also it as one entry in input_values, input_str that is set to the static value with the string hello world.

Modules

Modules are boxes on the workflow representation, i.e. nodes of the DAG. They are the core abstraction of Ryax. Modules are stateless applications that receive input data, do some computation, and produce output data. Even a source module or a publisher can have input so this definition can abroad all module type in Ryax. To normalize data module will use Ryax data types explained next.

I/Os

To transmit data between modules, every module has inputs and outputs (I/Os).

Workflow

Modules can output a kind of data using the Ryax I/O types.

Type

Description

string

String of characters

password

String hidden on the UI

integer

64-bit integer

float

floating-point number

longstring

Use for long text

enum

Enumeration with a list of possible values

file

File (imported and exported by Ryax)

directory

Directory containing a set of files (imported and exported by Ryax)

Note that the file and directory I/O types are fully managed by Ryax/ You don’t have to manage file transfers between modules by yourself, Ryax is taking care of it for you.

Note

Using file of directory as static value is of course possible, but the upload size is currently limited to 1 GB.

Modules’ inputs contain either raw static values or references to their parent modules’ outputs. Using reference values is the way you can transmit data processing results of one module to another.

Workflow

Types

Currently, there are 4 types of modules: Source, Processor, Publisher, and Stream Operator.

  • Sources ingest data from external resources (like a database, a clock, a platform, …), When data arrives a source module acts as an entry point that collect data and convert to Ryax format.

  • Processors process data

  • Publishers publish data to external services

  • Stream Operators perform complex operations in-between modules, to synchronize, buffer, or any other operation specific to stream computing. Thus, they are the only modules that can receive data from multiple modules, as they are the only ones capable of merging different streams

The following sections give more detail on each module type.

Sources

Sources ingest external data inside Ryax. They are long-lived programs that are able to look outside and dispatch executions inside Ryax from their observations. A source might be a service watching changes in a database, a web form, or a simple timer.

Processors

Data processors are executable modules (scripts, binaries, containers) that take some inputs, perform some computation, and produce meaningful outputs. Contrary to the Sources, the Processors are only running when a new event is received, they are stopped after that.

The purer a processor is, the better. What do we mean by pure? In functional programming, a pure function is a function that has no side effects (no mutation of some global variables, no I/O) and that from the same inputs always generates the same outputs. We know that some use cases fit into these restrictions however others do not, thus, Ryax supports a broader range of purity:

  • a processor may perform requests to external services, however, keep in mind that if the execution of a processor fails, we may restart it. This means that a request may be performed multiple times (if the execution fails after the request). We recommend all “external” requests to be idempotent.

  • we do not need your outputs to be the same for the same inputs. For example, you can use random generators.

  • However, you should not rely on some states kept locally. A processor in Ryax is stateless. There are several ways of overcoming this limitation. If you need a common state shared between multiple modules, you can use inputs and outputs to pass-on this state. If you need to share a common state between multiple executions within the same module, use an external database to store and consult it. Be careful, in some corner cases, the same request may be performed multiple times. Of course, if this database is unreachable, the module will not work.

Publishers

Publishers push data to external services, databases, platforms – any system that can receive data, generally related to persistent storage. They are conceptually alike processors. They mainly differ from their usage, they also diverge in the sense they are generally terminals on a workflow, no output arrows is a common practice on publishers although they can have output arrows.

Stream operators

Stream operators perform complex operations between modules of a workflow to synchronize streams, buffer size and run other stream computing specific operations.

Executions

In Ryax, links between a workflow’s modules are not a mapping of the outputs and the inputs. They are not electrical links in an electronic circuit where you map an output of an AND gate to input #3 of a seven segment display.

When you link a module to another in Ryax, you are just saying “this module will be executed after this one”. This means that a given module may access all outputs of all the previous modules.

We call stream flow or tempo the timings at which data is output from a given module. In theory, it is possible to have two tempos that are exactly the same, for example 2 sensors that emit data at exactly the same time. However, in practice it rarely happens this way. There is always a small jitter.

We have stated that is impossible to have two streams with the same tempo, so how can we merge them? The difficulty of performing such an action comes from the definition of merging, which changes depending on the situation. This is why we provide various stream operators to cover most needs in terms of stream merging.

We do not provide any guarantee on the order of execution of the modules in a workflow. For example, if we are in a workflow composed of 2 modules “A=>B”, and 2 executions 0 and 1 starts at the same time, we may execute the modules in any of the following order:

(A0, A1, B0, B1)
(A0, A1, B1, B0)
(A1, A0, B0, B1)
(A1, A0, B1, B0)
(A1, B1, A0, B0)
(A0, B0, A1, B1)