Workflows¶
Ryax is more than just a tool, it is a complete framework to develop, deploy, and maintain data driven applications. To do so, Ryax proposes a paradigm, to fully understand the framework we must go through the concepts of workflows, modules, stream operators, and executions. All these are explained in details through this section. For didactically purposes, the content here is presented in a more theoretical way explaining the choices and the framework constraints. However, if you are in a hurry, we strongly advise you to go directly to our Crash Course, you can always go back here afterwards.
We chose to present the Ryax framework in a top-down approach starting with workflows and then shifting to modules, stream operators, and executions. This makes it easy to go from a big picture, from the data application as a whole perspective, to a zoomed detail of how to implement pieces of software on Ryax. In spite this work for some people it might not be the case for everyone. If you fill a huge gap of curiosity, feel free to visit the Modules section first, whatever order best suites you.
Workflow definition¶
Ryax enables to define an application in the form of a data pipeline. This data driven application is called a workflow and can be simply put as a graph that explicitly shows how data is handled in a flow from origin to outcomes. Each box on this graph, is called a module. Modules are stateless software portions that receive some data as input and produce some data as output, see below.
By definition, workflows link modules in a graph where each node is a
module and edges define the relationship based on data flow.
An arrow from Module A
to Module B
, as the workflow above shows,
indicates that Module A
produces data that is needed by Module B
.
In other words, Module B
requires as input the data that Module A
will
output. There is an important temporal dependency here.
If new data arrives at Module A
this will trigger
the data handler on Module A
. However Module B
will have to wait
the output of Module A
.
Transitivity is present but implicitly in the workflow
graphical representation. In the workflow bellow
Module A
feeds Module B
that feeds Module C
. This
means that C
will receive data when both have finished
A
and B
, so C
must execute before A
and B
. In this
case C
can collect the output either from A
or B
or both.
Transitivity makes that C
depends also on A
even though there is no arrow connecting directly A
to C
.
Sources, processors, and publishers are 3 roles found in many data analysis applications. For example, we may retrieve data from a data lake and data warehouse, then we organize data, produce valuable visualization, KPIs, and finally export those to an external service, a dashboard or perhaps send an e-mail.
Having a role associated to modules enable to re-use applications easily and focus on the data analysis itself. By following this framework, users can completely ignore system administration aspects. Deploying applications is transparent. Listen or running an application periodically is easier: just plug the right source and you have it. Export data for the outside world is a breeze, want to send an e-mail with an alert, just reuse the send e-mail module. Another example is to trigger the application on an http request, for instance.
Workflows have always at least one source. These
are the boxes without any arrows arriving at it.
Sources are the entry point of
the workflow where data arrives from outside Ryax.
Source modules can be graphically identified because there are only
arrows going out of them. Examples are listen for an http request, or
associated with a timer. Module A
is a source in the example above.
On the opposite end, modules that only have arrows arriving at it are publishers, these will push data outside of Ryax. Publishing data could be sending an e-mail with an attachment, update an image on a dashboard, or push a file to an external storage.
A workflow can have multiple sources and multiple publishers. The modules between sources and publishers are called processors. Processors are typical building block for data driven applications. They would execute some operation like organizing raw data into dataframes, plot images, make animations, and such.
The workflow graph has constrains. Formally it is a DAG (Directed Acyclic Graph). Directed, because edges are always arrows to explicitly show the flow of data. Acyclic because cycles are not allowed. Bellow is a more complex valid workflow with various modules.
Replicate a module output to feed several modules is directly
supported. In other words, multiple arrows going out of a module
like Module B
which feeds Module C
and Module C'
above.
In this case each new output produced by B
is replicated as input
of C
and C'
.
To join arrows from multiple flows together is rather complex when
compared to fork flows and
requires to make choices. For instance in the workflow above,
B
and C
both need to feed their outputs to D
.
In this case D
must know how to combine data from B
and C
.
In case B
and C
are in complete sync this is not a problem, because
their inputs would be available at the exact same time. Perhaps
in a few cases this anachronism occurs, although the most common
is to have modules asynchronous. For instance, if B
produces data
twice as fast as C
then what should D
receive as input? Let us say
B
produces output data b1
meanwhile we are still waiting for output from
C
, so imagine that before data c1
arrives, B
receives a new data b2
.
What should D
receive as input pair (b1,c1)
or (b2,c1)
? In a matter
of fact any of the solutions above are correct but we need a special
module role that implement the design of our choice. For this reason,
we have also modules that have the role of
streaming operators to merge multiple flows together.
Yaml¶
The workflow textual format is used by Ryax API to create the
representation of the workflow internally.
Below is an example of file that structure
a workflow in yaml. The first line apiVersion: "ryax.tech/v1alpha5"
defines the API type. Second line, kind: Workflows
, defines the kind
of the entity as a workflow.
apiVersion: "ryax.tech/v1alpha5"
kind: Workflows
spec:
human_name: Print a parameter on the logs
description: >
Simply print a parameter on the logs.
functions:
- id: one-run
from: one-run
version: "1.0"
position:
x: 0
y: 0
inputs_values: {}
streams_to: ["print-param"]
- id: print-param
from: print-param
version: "1.0"
position:
x: 1
y: 1
inputs_values:
input_str: "hello daddy"
The spec
has entries for all practical information of a workflow,
human_name
is the information that will be displayed on the UI next
to the description
content. After follows a list of modules,
functions:
. Then an item per module it present. In this case
the first module one-run
is defined by.
- id: one-run
from: one-run
version: "1.0"
position:
x: 0
y: 0
inputs_values: {}
streams_to: ["print-param"]
The id
of the function is the internal name and from
is the link
to the module’s definition. Version is used to keep multiple versions of
the same module, it is safe to fix it on a certain value at first. For
UI purposes the position
with x
and y
will define the position
on the UI grid for module one-run
. x: 0
and y: 0
means one-run
module
box will be displayed in the first row and first column of the grid.
Dictionary inputs_values
define static or
dynamic values for all expected input parameters of the module,
one-run
is a gateways and do not have any input parameters. Finally
streams_to
is a list of modules that will be fed with one-run
input. Even though one-run
has no output parameters streams_to
field will trigger an execution of print-param
once the source
module produces data.
- id: print-param
from: print-param
version: "1.0"
position:
x: 1
y: 1
inputs_values:
input_str: "hello world"
print-param
module workflow reference is
very similar. The differences are its position on the grid x:1
and y:1
.
Also it as one entry in input_values
, input_str
that is set to
the static value with the string hello world
.
Modules¶
Modules are boxes on the workflow representation, i.e. nodes of the DAG. They are the core abstraction of Ryax. Modules are stateless applications that receive input data, do some computation, and produce output data. Even a source module or a publisher can have input so this definition can abroad all module type in Ryax. To normalize data module will use Ryax data types explained next.
I/Os¶
To transmit data between modules, every module has inputs and outputs (I/Os).
Modules can output a kind of data using the Ryax I/O types.
Type |
Description |
---|---|
string |
String of characters |
password |
String hidden on the UI |
integer |
64-bit integer |
float |
floating-point number |
longstring |
Use for long text |
enum |
Enumeration with a list of possible values |
file |
File (imported and exported by Ryax) |
directory |
Directory containing a set of files (imported and exported by Ryax) |
Note that the file and directory I/O types are fully managed by Ryax/ You don’t have to manage file transfers between modules by yourself, Ryax is taking care of it for you.
Ryax silently ignore symlinks and non-readable files and directories put inside a directory that should be transfered via a directory I/O.
Note
Using file of directory as static value is of course possible, but the upload size is currently limited to 1 GB.
Modules’ inputs contain either raw static values or references to their parent modules’ outputs. Using reference values is the way you can transmit data processing results of one module to another.
Types¶
Currently, there are 4 types of modules: Source, Processor, Publisher, and Stream Operator.
Sources ingest data from external resources (like a database, a clock, a platform, …), When data arrives a source module acts as an entry point that collect data and convert to Ryax format.
Processors process data
Publishers publish data to external services
Stream Operators perform complex operations in-between modules, to synchronize, buffer, or any other operation specific to stream computing. Thus, they are the only modules that can receive data from multiple modules, as they are the only ones capable of merging different streams
The following sections give more detail on each module type.
Sources¶
Sources ingest external data inside Ryax. They are long-lived programs that are able to look outside and dispatch executions inside Ryax from their observations. A source might be a service watching changes in a database, a web form, or a simple timer.
Processors¶
Data processors are executable modules (scripts, binaries, containers) that take some inputs, perform some computation, and produce meaningful outputs. Contrary to the Sources, the Processors are only running when a new event is received, they are stopped after that.
The purer a processor is, the better. What do we mean by pure? In functional programming, a pure function is a function that has no side effects (no mutation of some global variables, no I/O) and that from the same inputs always generates the same outputs. We know that some use cases fit into these restrictions however others do not, thus, Ryax supports a broader range of purity:
a processor may perform requests to external services, however, keep in mind that if the execution of a processor fails, we may restart it. This means that a request may be performed multiple times (if the execution fails after the request). We recommend all “external” requests to be idempotent.
we do not need your outputs to be the same for the same inputs. For example, you can use random generators.
However, you should not rely on some states kept locally. A processor in Ryax is stateless. There are several ways of overcoming this limitation. If you need a common state shared between multiple modules, you can use inputs and outputs to pass-on this state. If you need to share a common state between multiple executions within the same module, use an external database to store and consult it. Be careful, in some corner cases, the same request may be performed multiple times. Of course, if this database is unreachable, the module will not work.
Publishers¶
Publishers push data to external services, databases, platforms – any system that can receive data, generally related to persistent storage. They are conceptually alike processors. They mainly differ from their usage, they also diverge in the sense they are generally terminals on a workflow, no output arrows is a common practice on publishers although they can have output arrows.
Stream operators¶
Stream operators perform complex operations between modules of a workflow to synchronize streams, buffer size and run other stream computing specific operations.
Executions¶
An execution is a run of a module.
Thus, it has at least some input data (the inputs of the module filled with the static data and outputs of the previous modules) and a status.
The status may be running
, error
, done
…
The logs of a run are stored in an execution. Ryax keep the last 1MB of logs.
Execution order¶
In Ryax, links between a workflow’s modules are not a mapping of the outputs and the inputs. They are not electrical links in an electronic circuit where you map an output of an AND gate to input #3 of a seven segment display.
When you link a module to another in Ryax, you are just saying “this module will be executed after this one”. This means that a given module may access all outputs of all the previous modules.
We call stream flow or tempo the timings at which data is output from a given module. In theory, it is possible to have two tempos that are exactly the same, for example 2 sensors that emit data at exactly the same time. However, in practice it rarely happens this way. There is always a small jitter.
We have stated that is impossible to have two streams with the same tempo, so how can we merge them? The difficulty of performing such an action comes from the definition of merging, which changes depending on the situation. This is why we provide various stream operators to cover most needs in terms of stream merging.
We do not provide any guarantee on the order of execution of the modules in a workflow. For example, if we are in a workflow composed of 2 modules “A=>B”, and 2 executions 0 and 1 starts at the same time, we may execute the modules in any of the following order:
(A0, A1, B0, B1)
(A0, A1, B1, B0)
(A1, A0, B0, B1)
(A1, A0, B1, B0)
(A1, B1, A0, B0)
(A0, B0, A1, B1)