A scalable open source scheduling and orchestration platform

Kestra, a new open source orchestration and scheduling platform, helps developers build, run, schedule, and monitor complex pipelines.

It is built on well-known tools like Apache Kafka and ElasticSearch. The Kafka architecture provides scalability: each worker in the Kestra cluster is implemented as a Kafka consumer, and the state of a workflow’s execution is managed by an executor implemented with Kafka Streams. ElasticSearch is used as a database that allows displaying, searching and aggregating all data.

The concept of workflow, called Flow in Kestra, is at the heart of the platform. It is a list of tasks defined with a descriptive language based on yaml. It can be used to describe simple workflows, but it allows for more complex scenarios such as dynamic tasks and flow dependencies.

Streams can be event-based such as results from other streams, detection of files from Google Cloud Storage, or results from an SQL query. Streams can also be scheduled at regular intervals based on a cron expression. Additionally, Kestra exposes an API to trigger a workflow from any application or simply start it directly from the web UI.

Kestra, in fact, provides a rich web interface that allows developers to edit, run, and monitor streams in real time.

A Kestra web interface is shown below:

Kestra can be used as a data orchestrator: to manage complex workflows, move, transform and load large sets of data (ETL or ELT); as a distributed crontab to schedule work across multiple workers and monitor all these processes; or as an event-driven workflow to react to external events such as API calls.

It can be deployed anywhere, for example on Kubernetes, Cloud Compute, Docker or even on premise. And thanks to its pluggable architecture, additional functionality can be added with plugins such as integration with Amazon S3, Apache Avro, Google BigQuery and MongoDB.

The Kestra platform is similar to Apache Airflow, but the latter relies on workflows written in Python instead of yaml.

An example flow written in yaml is shown below:

The latest version improved overall performance by reducing CPU usage and latency and introduced a new JDBC plugin that allows bulk queries.

The software is still relatively new as the team announced the first public release in February 2022. The latest version, 0.4.2, is available on the Github repository, but it is already being used in production by Leroy Merlin, one of distribution leaders in Europe.

Source link

Steven L. Nielsen