Concepts
This page explains the key concepts behind AkôFlow: how workflows are modeled, how tasks are scheduled, how execution state is managed, and how provenance is captured. Understanding these concepts helps you write better workflows and make informed decisions about scheduling and deployment.
The workflow model
Workflows as DAGs
AkôFlow represents workflows as Directed Acyclic Graphs (DAGs). Formally, a workflow is a graph G = (V, A, a, ω) where:
V = N ∪ D— the node set, composed of transformationsN(tasks) and data itemsD(files/artifacts)A— directed edges encoding data dependencies between transformations and data itemsa_i— computational cost associated with each transformationi ∈ Nω_ij— communication cost associated with edge(i, j) ∈ A
In practice, this means:
- Every task in the graph is always preceded and succeeded by a data item, reflecting a structured dataflow
- A task cannot start until all its input data items are available (i.e., all upstream tasks have finished producing them)
- Tasks with no dependencies between them can run in parallel
YAML representation
You express this DAG in a YAML file. The activities list defines the transformation nodes, and the optional dependsOn field encodes the edges:
name: etl-pipeline
spec:
image: "python:3.11"
namespace: "akoflow"
mountPath: "/data"
activities:
- name: "ingest"
cpuLimit: 0.5
memoryLimit: 512Mi
run: |
python ingest.py --source /data/raw --output /data/ingested
- name: "deduplicate"
dependsOn: ["ingest"]
cpuLimit: 1.0
memoryLimit: 1Gi
run: |
python dedup.py --input /data/ingested --output /data/clean
- name: "predict-us"
dependsOn: ["deduplicate"]
cpuLimit: 2.0
memoryLimit: 4Gi
run: |
python predict.py --region us --input /data/clean --output /data/results-us
- name: "predict-eu"
dependsOn: ["deduplicate"]
cpuLimit: 2.0
memoryLimit: 4Gi
run: |
python predict.py --region eu --input /data/clean --output /data/results-eu
- name: "aggregate"
dependsOn: ["predict-us", "predict-eu"]
cpuLimit: 0.5
memoryLimit: 512Mi
run: |
python aggregate.py --inputs /data/results-us /data/results-eu --output /data/final
In this example:
ingesthas no dependencies — it runs immediatelydeduplicateruns afteringestfinishespredict-usandpredict-euboth depend ondeduplicatebut not on each other — they run in parallelaggregatewaits for both predictions to complete
Task lifecycle
Every task in AkôFlow moves through a well-defined sequence of states:
Pending
│
│ (all dependencies finished)
▼
Ready ──→ (enters scheduling queue)
│
│ (AkôScore selects a node)
▼
In Execution ──→ (container running on a Worker)
│
├──→ Finished (outputs available for downstream tasks)
│
└──→ Failed (fault-handling may apply)
| State | Meaning |
|---|---|
| Pending | Task is waiting for one or more upstream tasks to finish |
| Ready | All dependencies are satisfied; task is in the scheduling queue |
| In Execution | Task has been assigned to a node and its container is running |
| Finished | Task completed successfully; its outputs are available to downstream tasks |
| Failed | Task encountered an error during execution |
The Orchestrator runs on a continuous loop, checking for newly Ready tasks and dispatching them to Workers via the AkôScore scheduler.
Execution strategies
AkôFlow supports two execution strategies that control when tasks become eligible for scheduling:
First-Data-First (FDF)
Tasks are dispatched as soon as their input data becomes available. This enables pipelined execution — downstream tasks can start processing partial results while upstream tasks are still running on other data partitions. FDF maximizes resource utilization and is well-suited for streaming or data-parallel workflows.
First-Activity-First (FAF)
AkôFlow enforces synchronization at each level of the workflow. All tasks at a given DAG depth must complete before any task at the next depth starts. FAF provides simpler execution semantics and is useful when strict data consistency between levels is required.
AkôScore — the scheduling function
Why scheduling matters for containerized workflows
Containers executing on the same host share its CPU and memory. Poor scheduling leads to two failure modes:
- Underutilization — resources sit idle, making executions slower and more expensive (especially in pay-as-you-go cloud environments)
- Overutilization — too many memory-hungry tasks land on the same node, causing performance degradation or task failures
Scheduling is especially hard because the memory consumption of a task can vary substantially depending on the characteristics of its input data — making static, pre-computed schedules unreliable.
The AkôScore formulation
AkôScore is evaluated at scheduling time, for each (task, node) pair. For a task i ready to run, AkôFlow computes a score for every available node j:
S(i,j) = A(i,j) × [ α × (M_free(j) − M_req(i)) / M_max
+ (1−α) × 1 / T(i,j) ]
Where:
A(i,j)— feasibility indicator:1if nodejhas enough free CPU and memory for taski,0otherwiseα— user-defined weight (0to1): controls the trade-off between memory and speedM_free(j)— memory currently available on nodejM_req(i)— memory required by taskiM_max— maximum memory capacity among all nodes (used for normalization)T(i,j)— estimated execution time of taskion nodej, based on the node's processing capacity
The node with the highest AkôScore is selected. If all nodes score 0 (none has sufficient resources), the task stays in the Ready queue until the next scheduling cycle.
Tuning α
| α value | Effect |
|---|---|
0.0 | Minimize makespan — assign tasks to the fastest available node |
1.0 | Maximize memory fit — prefer nodes where memory availability closely matches task requirements |
0.5 | Balanced — equal weight to speed and memory |
Per-environment scheduling policies
A key feature of AkôScore is that it can be configured per environment. This means the same workflow can use different scheduling strategies across its deployment targets:
- In a cloud environment: optimize for monetary cost
- In an HPC cluster: optimize for energy consumption
- Locally: optimize for makespan (to finish quickly during development)
Custom scheduling via plugins
AkôScore is extensible. The Engine loads scheduling functions at runtime as Go plugins (compiled .so files). To implement a custom policy:
- Write a Go function with the signature
func AkoScore(input any) float64 - Compile it as a shared library
- Register it in AkôFlow and reference it by name in your workflow YAML
spec:
schedule: "my-cost-aware-policy"
The Engine will call your function with a map containing the task's resource requirements and the node's current state, and use the returned score for placement decisions.
Provenance
Why provenance matters
Provenance answers questions like:
- Which transformation produced this output file?
- What input data was used, and where did it come from?
- Which node ran this task, at what time, and with what resource utilization?
- If an output is wrong, which tasks and data artifacts are implicated?
AkôFlow treats provenance as a first-class component — not an optional logging feature. It is captured continuously at runtime, across all environments, and queryable at any point during or after execution.
The provenance data model
AkôFlow's provenance is stored in a five-table relational schema (SQLite per Engine instance, aggregated by the Control Plane):
| Table | What it records |
|---|---|
Workflow | Submission metadata: namespace, spec file path, execution state, timestamps |
Activity | Each task execution: parent workflow, assigned node, state, scheduling metadata |
Metrics | Resource utilization samples collected every 15 seconds per running task: CPU %, memory, wall-clock time |
Errors | Standard output and error streams for each task, preserving the full execution trace |
Files | All data artifacts produced during execution |
Data lineage graph
From the Files and Activity tables, AkôFlow constructs a data lineage graph — a W3C PROV-compliant directed graph:
[input-file] ──used──▶ [task-A] ──wasGeneratedBy──▶ [output-file-A]
│
used
▼
[task-B] ──wasGeneratedBy──▶ [final-output]
The graph is built by diffing the file system state before and after each task runs:
- Files present at the end but not at the start → produced by the task (
wasGeneratedBy) - Files present at the start → consumed by the task (
used)
This approach requires no instrumentation of user code — provenance is captured transparently at the storage layer.
Three levels of provenance
AkôFlow captures provenance at three levels:
| Level | What it tracks |
|---|---|
| Data-level | Relationships between data artifacts and transformations (lineage graph) |
| Execution-level | Runtime behavior: task states, container placement, scheduling decisions, resource utilization |
| Environment-level | Infrastructure context: node configuration, runtime type, scheduling parameters used |
Querying provenance
Provenance data can be explored through:
- The web UI — visual lineage graphs with filtering by task or artifact
- SQL queries — direct queries against the SQLite provenance database for targeted inspection
- W3C PROV export — export the full graph in PROV-compliant format for use with external tools
The computing continuum
AkôFlow is designed for execution across the computing continuum — the full spectrum of computational infrastructure from edge devices to supercomputers:
Local machine → On-premise cluster → HPC system → Private cloud → Public cloud (AWS, GCP, ...)
Most workflow engines are optimized for one point on this spectrum. AkôFlow abstracts infrastructure-specific dependencies through containerization and Terraform-based provisioning, allowing the same workflow to run across any combination of these environments.
A practical example: a workflow processing user data under GDPR must keep European data in EU-based infrastructure. AkôFlow can distribute the same DAG across environments — EU-region cloud for European data, US-region cloud for American data — with the Control Plane coordinating execution and aggregating results.
Supported environments
| Environment | Runtime |
|---|---|
| Local machine | Docker |
| On-premise cluster | Docker, Kubernetes |
| HPC system | Singularity (where Docker is unavailable) |
| AWS | Kubernetes (EKS), Docker |
| GCP | Kubernetes (GKE), Docker |
| Any Kubernetes cluster | Kubernetes |