Running jobs on Kubernetes
Setupβ
We will be using a virtual machine in the faculty's cloud.
When creating a virtual machine in the Launch Instance window:
- Name your VM using the following convention:
cc_lab<no>_<username>, where<no>is the lab number and<username>is your institutional account. - Select Boot from image in Instance Boot Source section
- Select CC Template in Image Name section
- Select the g.large flavor.
Creating a Kubernetes clusterβ
As in the previous laboratories, we will create a cluster on the lab machine, using the kind create cluster command:
student@lab-kubernetes:~$ kind create cluster
Creating cluster "kind" ...
β Ensuring node image (kindest/node:v1.23.4) πΌ
β Preparing nodes π¦
β Writing configuration π
β Starting control-plane πΉοΈ
β Installing CNI π
β Installing StorageClass πΎ
Set kubectl context to "kind-kind"
You can now use your cluster with:
kubectl cluster-info --context kind-kind
Thanks for using kind! π
It is recommended that you use port-forwarding instead of X11 forwarding to interact with the UI.
Kubernetes Jobsβ
Introduction to Batch Workloadsβ
In the context of cloud computing up until now we have only interacted with applications or services whose lifetime is infinite, which means that they are started and they are never stopped unless an error appears.
However, this does not cover most use cases in distributed computing. Many processing tasks are batch workloads - discrete units of work that:
- Run to completion
- Process a specific dataset or task
- Exit when finished
- Should not be automatically restarted after successful completion
Examples of batch workloads include:
- Data processing: ETL (Extract, Transform, Load) pipelines
- Machine learning: Training models, batch inference
- Report generation: Periodic analytics and exports
- Backup and archival: Database backups, log aggregation
- Video/image processing: Transcoding, thumbnail generation
- Scientific computing: Simulations, numerical analysis
Kubernetes by its nature works as a scheduler for jobs, which makes it well suited for scheduling processing jobs.
Jobs vs Podsβ
A Kubernetes Job should be used instead of a Pod when:
- The workload has a defined start and end
- You expect the action to finish successfully
- You don't want resources lingering in the cluster after completion
- You need guarantees about completion and retry behavior
Key differences:
| Feature | Pod | Job |
|---|---|---|
| Lifecycle | Long-running | Run-to-completion |
| Restart behavior | Restarts indefinitely on failure | Controlled retry with backoffLimit |
| Completion tracking | N/A | Tracks successful completions |
| Resource cleanup | Runs forever unless deleted | Can be automatically cleaned up |
| Use case | Services, daemons | Batch processing, one-time tasks |
The object which manages a discrete work item in Kubernetes is called a Job and it contains a specification for a container, as we are used to from Pod specifications.
The example bellow displays a job which displays a debug message:
apiVersion: batch/v1
kind: Job
metadata:
name: hello-world-job
spec:
template:
spec:
containers:
- name: hello-world
image: ghcr.io/containerd/busybox
command: ["echo", "Hello from Kubernetes batch job!"]
restartPolicy: Never
backoffLimit: 4
When applying the above manifest, we can see that the Job is created, and we can inspect its output as follows:
student@lab-jobs:~/$ kubectl apply -f hello-world.yaml
kubectl get jobsjob.batch/hello-world-job created
student@lab-jobs:~/$ kubectl get jobs
NAME COMPLETIONS DURATION AGE
hello-world-job 0/1 0s 0s
student@lab-jobs:~/$ kubectl logs job/hello-world-job
Hello from Kubernetes batch job!
The above example is useful for quick and dirty jobs, but when running in an actual batch environment there are some other factors which have to be involved:
- the increase scheduling accuracy and system cohesion you would add resource limits;
- use a custom job script;
- add fail conditions;
- limit job duration.
The following example is used for creating a complex job which runs a custom python script, limits its resources and requests a restart of the application fails:
apiVersion: batch/v1
kind: Job
metadata:
name: matrix-multiplication-job
spec:
template:
spec:
containers:
- name: matrix-multiply
image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/python:3.9-slim
command: ["bash", "-c"]
args:
- |
pip install numpy && python /scripts/matrix_multiply.py
volumeMounts:
- name: script-volume
mountPath: /scripts
- name: pip-local
mountPath: /.local
- name: pip-local
mountPath: /.cache
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
volumes:
- name: script-volume
configMap:
name: matrix-multiplication-script
- name: pip-local
emptyDir: {}
restartPolicy: OnFailure
backoffLimit: 2
---
apiVersion: v1
kind: ConfigMap
metadata:
name: matrix-multiplication-script
data:
matrix_multiply.py: |
import numpy as np
import time
import os
# Create large matrices
size = 5000
print(f'Creating {size}x{size} matrices...')
a = np.random.rand(size, size)
b = np.random.rand(size, size)
# Perform CPU-intensive matrix multiplication
print('Starting matrix multiplication...')
start_time = time.time()
result = np.matmul(a, b)
duration = time.time() - start_time
print(f'Matrix multiplication complete in {duration:.2f} seconds')
print(f'Result matrix shape: {result.shape}')
The requests dict is used for scheduling purposes, it is used as a minimum resource specification used for the container when choosing a node for placement.
The limits dict is used to specify the actual limits imposed on the container which it can't surpass.
As with a regular Pod, ConfigMaps, Secrets and other kubernetes objects can be mounted into the container.
Let's run it and see its output:
student@lab-jobs:~/$ kubectl logs job/matrix-multiplication-job
Collecting numpy
Downloading numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB)
βββββββββββββββββββββββββββββββββββββββ 19.5/19.5 MB 101.7 MB/s eta 0:00:00
Installing collected packages: numpy
Successfully installed numpy-2.0.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[notice] A new release of pip is available: 23.0.1 -> 25.1.1
[notice] To update, run: pip install --upgrade pip
Creating 5000x5000 matrices...
Starting matrix multiplication...
Matrix multiplication complete in 14.20 seconds
Result matrix shape: (5000, 5000)
Job Configuration Optionsβ
Jobs provide several configuration options to control their behavior:
completionsβ
Specifies the number of successful pod completions needed for the job to be considered complete.
spec:
completions: 5 # Job completes after 5 successful pod runs
parallelismβ
Controls how many pods run simultaneously. Useful for processing large datasets in parallel.
spec:
completions: 10
parallelism: 3 # Run 3 pods at a time until 10 completions
activeDeadlineSecondsβ
Sets a timeout for the entire job. If the job doesn't complete within this time, it's terminated.
spec:
activeDeadlineSeconds: 3600 # Job fails if not done in 1 hour
backoffLimitβ
Number of retries before marking the job as failed. Default is 6.
spec:
backoffLimit: 3 # Retry up to 3 times on failure
ttlSecondsAfterFinishedβ
Automatically cleans up the job after completion or failure.
spec:
ttlSecondsAfterFinished: 86400 # Delete job 24 hours after completion
Job Patternsβ
Kubernetes Jobs support several common patterns for batch processing:
Pattern 1: Single Completion Jobβ
The simplest pattern - run one pod to completion.
apiVersion: batch/v1
kind: Job
metadata:
name: single-task
spec:
template:
spec:
containers:
- name: task
image: busybox
command: ["sh", "-c", "echo Processing task && sleep 10"]
restartPolicy: Never
Pattern 2: Parallel Jobs with Fixed Completion Countβ
Process multiple items by running multiple pods in parallel.
apiVersion: batch/v1
kind: Job
metadata:
name: parallel-processing
spec:
completions: 10 # Need 10 successful completions
parallelism: 3 # Run 3 pods at a time
template:
spec:
containers:
- name: processor
image: busybox
command: ["sh", "-c", "echo Processing item $RANDOM && sleep 5"]
restartPolicy: Never
Use case: Processing a known set of tasks (e.g., generating 10 reports, processing 100 images in batches).
Pattern 3: Work Queue Patternβ
Multiple workers processing tasks from a shared queue. Workers continue until the queue is empty.
apiVersion: batch/v1
kind: Job
metadata:
name: work-queue
spec:
parallelism: 5 # 5 workers processing in parallel
# No completions - workers exit when queue is empty
template:
spec:
containers:
- name: worker
image: my-worker:latest
env:
- name: QUEUE_URL
value: "redis://queue:6379"
restartPolicy: Never
Use case: Processing an unknown number of tasks from a message queue (RabbitMQ, Redis, SQS).
Best Practices for Jobsβ
- Set resource limits: Always specify requests and limits to prevent resource starvation
- Use
ttlSecondsAfterFinished: Automatically clean up completed jobs to avoid clutter - Choose appropriate backoffLimit: Balance between retry attempts and fast failure
- Monitor job status: Use
kubectl get jobsandkubectl describe jobto track progress - Use init containers: Separate setup (downloading data) from processing
- Consider parallelism: Use parallel jobs for independent tasks that can run simultaneously
- Handle failures gracefully: Ensure your container exits with proper exit codes
Case study: zip crackingβ
Let's look at a real world example of cracking a password using fcrackzip and jobs in Kubernetes.
The decrypt-zip.yaml is the basis for our job.
It contains the commands used for cracking the password for a zip file.
The fcrackzip tool can brute-force a ZIP archive's password.
Our task is to download the archive, and crack its password.
The following manifest will define our job and Persistent Volume:
apiVersion: batch/v1
kind: Job
metadata:
name: zip-decryption-job
labels:
app: zip-decryption
spec:
ttlSecondsAfterFinished: 86400 # Automatically delete job 24h after completion
backoffLimit: 2 # Number of retries before considering job failed
template:
metadata:
labels:
app: zip-decryption
spec:
restartPolicy: OnFailure
initContainers:
- name: download-zip
image: ghcr.io/curl/curl-container/curl:master # Lightweight curl image
command: ["/bin/sh", "-c"]
volumeMounts:
- name: data-volume
mountPath: /data
args:
- >
echo "Downloading ZIP file from remote source..." &&
curl http://swarm.cs.pub.ro/~sweisz/encrypted.zip -o /data/encrypted.zip
containers:
- name: hashcat-container
image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/fcrackzip # Replace with appropriate hashcat image
command: ["/bin/sh"]
args:
- "-c"
- >
cd /data &&
fcrackzip -v -b -c a -l 5-5 -u encrypted.zip > results_lowercase.txt &&
cat results_lowercase.txt
volumeMounts:
- name: data-volume
mountPath: /data
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
volumes:
- name: data-volume
emptyDir: {}
- name: wordlist-volume
configMap:
name: zip-decrypt-config
We know that the file has a password made up of 5 letters, which led us to use the -l 5-5 option, together with -b to do brute-forcing.
We use the initContainer to download the archive and the main container to run fcrackzip.
Exercise: Crack using wordlistβ
Change the above job in order to run fcrackzip using the wordlist from the following link: http://swarm.cs.pub.ro/~sweisz/wordlist.txt.
You can attach the wordlist as a ConfigMap as you've seen in the matrix multiplication example.
You can see how to configure fcrackzip to use wordlists in the following link: https://sohvaxus.github.io/content/fcrackzip-bruteforce-tutorial.html.
Cronjobsβ
While regular Jobs are useful from a scheduling point of view, they cannot be set to run periodically or on a set timer. CronJobs are a mechanism implemented in Kubernetes to enhance the regular Jobs feature. They are a type of Job which are managed and scheduled by Kubernetes to run at a specific time based on a user-defined rule.
Some use cases which we can define for CronJobs are:
- scheduling regular data exports or backups to off-site facilities
- periodic environment cleanup jobs, for example deleting temporary files or files which have been generated and haven't been used for some time
- crawling endpoint for new data or information
The following is an example manifest for a job:
apiVersion: batch/v1
kind: CronJob
metadata:
name: first-job
spec:
schedule: "0 2 8 * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: first-job
image: busybox
command: ["echo", "First job"]
restartPolicy: OnFailure
The jobTemplate specification works as a job specification field, in which we add the requirements for a job.
The schedule value is specified using the following convention from the cron manual:
# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').
This means that the above job will run on the 8th day of the month at 2:00 AM. If we want to specify a job which would run for every minute we could to the following change:
- schedule: "0 2 8 * *"
+ schedule: "*/1 * * * *"
The */x means the job will run every x minutes.
For an easy way to define the cron schedule, you can use https://crontab.guru/.
Case study: Database backupβ
For this case study we will pe running a PostgreSQL defined by the following manifest:
# PostgreSQL Pod
apiVersion: v1
kind: Pod
metadata:
name: postgres-db
labels:
app: postgres
spec:
containers:
- name: postgres
image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/postgres:14-alpine
ports:
- containerPort: 5432
name: postgres
env:
- name: PGDATA
value: /var/lib/postgresql/data/pg/
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: postgres-credentials
key: username
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
- name: POSTGRES_DB
valueFrom:
secretKeyRef:
name: postgres-credentials
key: database
volumeMounts:
- name: postgres-data
mountPath: /var/lib/postgresql/data/
volumes:
- name: postgres-data
emptyDir: {}
---
# Service for PostgreSQL
apiVersion: v1
kind: Service
metadata:
name: postgres-service
spec:
ports:
- port: 5432
targetPort: 5432
selector:
app: postgres
The pgsql.yaml file deploys a database server.
For this database server we need to create backups which will be store in another volume which will them be deployed off-site.
In order to prepare the setup we first need to create the database that we will be creating. Run the following command to setup the database deployment and service in the lab directory:
kubectl apply -f pgsql.yaml
We will start from the following already created CronJob:
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup-container
image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/postgres:14-alpine
command:
- /bin/sh
- -c
- |
# Set date format for backup filename
BACKUP_DATE=$(date +\%Y-\%m-\%d-\%H\%M)
# Create backup
echo "Starting PostgreSQL backup at $(date)"
mkdir /tmp/backups
pg_dump \
-h ${DB_HOST} \
-U ${DB_USER} \
-d ${DB_NAME} \
-F custom \
-Z 9 \
-f /tmp/backups/${DB_NAME}-${BACKUP_DATE}.pgdump
env:
- name: DB_HOST
valueFrom:
secretKeyRef:
name: postgres-credentials
key: host
- name: DB_USER
valueFrom:
secretKeyRef:
name: postgres-credentials
key: username
- name: DB_NAME
valueFrom:
secretKeyRef:
name: postgres-credentials
key: database
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
restartPolicy: OnFailure
---
# Secret for database credentials
apiVersion: v1
kind: Secret
metadata:
name: postgres-credentials
type: Opaque
data:
host: cG9zdGdyZXMtc2VydmljZQ== # postgres-service (base64 encoded)
username: YmFja3VwX3VzZXI= # backup_user (base64 encoded)
password: c2VjdXJlUGFzc3dvcmQxMjM= # securePassword123 (base64 encoded)
database: cHJvZHVjdGlvbl9kYg== # production_db (base64 encoded)
The above CronJob creates a backup of the database using pg_dump and puts it in a temporary location.
Apply them so we can see the backup in action.
student@lab-jobs:~/ocp/upgrade$ kubectl get cronjobs
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
postgres-backup */1 * * * * False 0 35s 39m
The issue with the above CronJob is that although it creates a backup file, it doesn't add it to any kind of persistent storage.
Create an emptyDir volume mount, mount it to the /backup path and change the backup script so that it copies the backup files to the backup volume.
Change the backup schedule so that it only does a backup every hour.
Change the policy so that it can only run one backup job in parallel. Look into the documentation so that you will not allow concurrent jobs: https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/.
Argo Workflowsβ
What is Argo Workflows?β
Argo Workflows is a cloud-native workflow engine for Kubernetes that orchestrates parallel jobs. It's designed for compute-intensive workflows where each step is performed by a container.
Key features:
- Native Kubernetes CRDs: Workflows are defined as Custom Resource Definitions
- DAG-based workflows: Define complex dependencies between tasks
- Container-native: Each step runs in its own container
- Artifact management: Pass files and data between workflow steps
- Parameter passing: Share variables between workflow steps
- Parallel execution: Run multiple tasks simultaneously
- Web UI: Visualize workflow execution in real-time
- Scalable: Leverages Kubernetes for scheduling and resource management
Why Use Argo Workflows?β
Kubernetes Jobs are great for simple batch workloads, but they have limitations:
| Feature | Kubernetes Jobs | Argo Workflows |
|---|---|---|
| Multi-step workflows | Manual orchestration | Built-in DAG support |
| Dependencies | No native support | Declare dependencies easily |
| Parameter passing | Manual (ConfigMaps/Secrets) | Native input/output parameters |
| File passing | Manual (volumes) | Native artifact management |
| Parallel execution | Limited | Advanced parallelism patterns |
| Conditional logic | Not supported | Conditionals, loops, recursion |
| Visualization | Basic kubectl output | Rich web UI |
| Retry logic | Job-level only | Step-level with custom strategies |
Use Argo Workflows when you need:
- Multi-step data pipelines
- Complex dependencies between tasks
- Passing data (files, parameters) between steps
- Parallel processing with aggregation
- Machine learning pipelines
- CI/CD workflows
- Data science workflows (ETL, training, inference)
Installationβ
Install Argo Workflows Serverβ
Create the Argo namespace and install the server components:
$ kubectl create namespace argo
namespace/argo created
$ kubectl apply --server-side -n argo -f "https://github.com/argoproj/argo-workflows/releases/download/v3.7.14/quick-start-minimal.yaml"
customresourcedefinition.apiextensions.k8s.io/clusterworkflowtemplates.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/cronworkflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workfloweventbindings.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtaskresults.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtasksets.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtemplates.argoproj.io created
serviceaccount/argo created
serviceaccount/argo-server created
role.rbac.authorization.k8s.io/argo-role created
...
deployment.apps/workflow-controller created
deployment.apps/argo-server created
Verify the installation:
$ kubectl get pods -n argo
NAME READY STATUS RESTARTS AGE
argo-server-65f9588cf6-jgtj7 1/1 Running 0 51s
workflow-controller-7df5f5d5c8-vrk85 1/1 Running 0 51s
Both pods should be in Running status.
Install Argo CLIβ
The Argo CLI makes it easier to submit and manage workflows:
$ curl -sLO "https://github.com/argoproj/argo-workflows/releases/download/v3.7.14/argo-linux-amd64.gz"
$ gunzip argo-linux-amd64.gz
$ chmod +x argo-linux-amd64
$ sudo mv argo-linux-amd64 /usr/local/bin/argo
$ argo version
argo: v3.7.14
Argo Workflows provides a dashboard to interact with the workflows on localhost:2746.
There are two options for connecting to the Argo user interface: SSH tunneling or Chrome Remote Desktop.
Option 1: SSH tunneling
Follow this tutorial to configure the SSH service to bind and forward the 2746 port to your machine:
ssh -J fep -L 2746:127.0.0.1:2746 -i ~/.ssh/id_fep student@10.9.X.Y
Option 2: Chrome Remote Desktop
An alternative to SSH tunneling or X11 forwarding is Chrome Remote Desktop, which allows you to connect to the graphical interface of your VM.
If you want to use this method, follow the steps from here.
Start a kubectl port-forward on the VM:
$ kubectl -n argo port-forward deployment/argo-server 2746:2746
Forwarding from 127.0.0.1:2746 -> 2746
Open your browser to https://localhost:2746 (accept the self-signed certificate warning).
To authenticate to the webserver you must run the following commands and paste the resulting token on the login screen.
$ kubectl -n argo create sa argo-admin
$ kubectl -n argo create clusterrolebinding argo-admin \
--clusterrole=cluster-admin \
--serviceaccount=argo:argo-admin
$ kubectl -n argo create token argo-admin
Add the prefix Bearer <token> to the token when pasting it in the login screen.
The Argo UI is extremely useful for:
- Visualizing workflow DAGs
- Monitoring workflow execution in real-time
- Viewing logs from each step
- Debugging failed workflows
- Downloading artifacts
Argo Workflow Conceptsβ
An Argo Workflow is a Kubernetes resource that defines a sequence of steps to execute.
It uses templates as a reusable component that defines what to execute. Templates can be:
- Container template: Runs a container
- Script template: Runs a script in a container
- Steps template: Defines a sequence of sub-templates
- DAG template: Defines tasks with dependencies
A workflow receives input parameters that allow you to pass values into templates. Output parameters allow you to pass values out of templates and next steps in a workflows
Instead of using parametes for output handling, you can use artifacts, which are files that are passed between workflow steps. Argo manages uploading and downloading artifacts automatically.
Working Examplesβ
Let's explore working examples that demonstrate Argo Workflows capabilities. Run each of these to understand how workflows work before attempting the exercises.
Example 1: Hello World Workflowβ
The simplest workflow - run a single container.
Create a file hello-world.yaml:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: hello-world-
spec:
entrypoint: hello
serviceAccountName: argo-admin
templates:
- name: hello
container:
image: busybox
command: [echo]
args: ["Hello World from Argo Workflows!"]
We notice the following:
generateNamecreates a unique workflow nameentrypointspecifies which template to start with- The workflow runs a single container
Submit the workflow:
$ argo submit -n argo hello-world.yaml --watch
Name: hello-world-xxxxx
Namespace: argo
ServiceAccount: unset
Status: Succeeded
Created: Mon Jan 01 12:00:00 +0000 (10 seconds ago)
Started: Mon Jan 01 12:00:00 +0000 (10 seconds ago)
Finished: Mon Jan 01 12:00:05 +0000 (5 seconds ago)
Duration: 5 seconds
STEP TEMPLATE PODNAME DURATION MESSAGE
β hello-world-xxxxx hello hello-world-xxxxx 3s
View the logs:
$ argo logs -n argo hello-world-xxxxx
hello-world-xxxxx: Hello World from Argo Workflows!
We see that there was a mod that ran in the argo namespace:
student@lab-jobs:~$ kubectl get pods -n argo
NAME READY STATUS RESTARTS AGE
argo-server-5549677b6-f5hm6 1/1 Running 0 4m47s
hello-world-x8m96 0/2 Completed 0 66s
httpbin-f5ccc9c6-t47d6 1/1 Running 0 4m47s
minio-5877d79784-zph9x 1/1 Running 0 4m47s
workflow-controller-7df5f5d5c8-qj8vd 1/1 Running 0 4m47s
Example 2: Sequential Multi-Step Workflowβ
Run multiple steps in sequence, passing parameters between them.
Create sequential-steps.yaml:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: sequential-
spec:
entrypoint: main
serviceAccountName: argo-admin
templates:
- name: main
steps:
- - name: step1
template: print-message
arguments:
parameters:
- name: message
value: "Step 1: Starting workflow"
- - name: step2
template: print-message
arguments:
parameters:
- name: message
value: "Step 2: Processing data"
- - name: step3
template: print-message
arguments:
parameters:
- name: message
value: "Step 3: Workflow complete"
- name: print-message
inputs:
parameters:
- name: message
container:
image: busybox
command: [sh, -c]
args: ["echo '{{inputs.parameters.message}}' && date"]
We see that we have defined a template for a container. This receives a parameter called message and runs a container to print it. We then devine three steps that run the print-message template.
Notice the following:
stepstemplate defines sequential execution- Each step is an array
- -(double dash) - Parameters are passed to templates via
arguments - The
{{inputs.parameters.message}}syntax accesses parameters
Submit and watch:
$ argo submit -n argo sequential-steps.yaml --watch
STEP TEMPLATE PODNAME DURATION
β sequential-xxxxx main
βββ step1 print-message sequential-xxxxx-step1 5s
βββ step2 print-message sequential-xxxxx-step2 4s
βββ step3 print-message sequential-xxxxx-step3 4s
We see the logs where each step prints the message:
student@lab-jobs:~$ argo logs -n argo sequential-fq6mr
sequential-fq6mr-print-message-1642373302: time="2026-05-12T19:38:56.328Z" level=info msg="capturing logs" argo=true
sequential-fq6mr-print-message-1642373302: Step 1: Starting workflow
sequential-fq6mr-print-message-1642373302: Tue May 12 19:38:56 UTC 2026
sequential-fq6mr-print-message-1642373302: time="2026-05-12T19:38:57.329Z" level=info msg="sub-process exited" argo=true error="<nil>"
sequential-fq6mr-print-message-700546422: time="2026-05-12T19:39:06.138Z" level=info msg="capturing logs" argo=true
sequential-fq6mr-print-message-700546422: Step 2: Processing data
sequential-fq6mr-print-message-700546422: Tue May 12 19:39:06 UTC 2026
sequential-fq6mr-print-message-700546422: time="2026-05-12T19:39:07.139Z" level=info msg="sub-process exited" argo=true error="<nil>"
sequential-fq6mr-print-message-1865731698: time="2026-05-12T19:39:16.157Z" level=info msg="capturing logs" argo=true
sequential-fq6mr-print-message-1865731698: Step 3: Workflow complete
sequential-fq6mr-print-message-1865731698: Tue May 12 19:39:16 UTC 2026
sequential-fq6mr-print-message-1865731698: time="2026-05-12T19:39:17.158Z" level=info msg="sub-process exited" argo=true error="<nil>"
Example 3: Parallel Executionβ
Run multiple tasks simultaneously and wait for all to complete.
Create parallel-tasks.yaml:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: parallel-
spec:
entrypoint: main
serviceAccountName: argo-admin
templates:
- name: main
steps:
# All three tasks run in parallel (single dash means parallel)
- - name: task-a
template: process-task
arguments:
parameters:
- name: task-name
value: "Task A"
- name: duration
value: "10"
- name: task-b
template: process-task
arguments:
parameters:
- name: task-name
value: "Task B"
- name: duration
value: "15"
- name: task-c
template: process-task
arguments:
parameters:
- name: task-name
value: "Task C"
- name: duration
value: "12"
# This step runs after all parallel tasks complete
- - name: summary
template: print-message
arguments:
parameters:
- name: message
value: "All parallel tasks completed!"
- name: process-task
inputs:
parameters:
- name: task-name
- name: duration
container:
image: busybox
command: [sh, -c]
args: ["echo 'Processing {{inputs.parameters.task-name}}'; sleep {{inputs.parameters.duration}}; echo '{{inputs.parameters.task-name}} done'"]
- name: print-message
inputs:
parameters:
- name: message
container:
image: busybox
command: [echo]
args: ["{{inputs.parameters.message}}"]
Notice the following:
- Single dash
- name:(within one- -block) means parallel execution - All three tasks start simultaneously
- The
summarystep waits for all parallel tasks to complete
We'll notice the tasks run in parallel when we watch them run:
$ argo submit -n argo parallel-tasks.yaml --watch
STEP TEMPLATE PODNAME DURATION
β parallel-xxxxx main
βββ task-a process-task parallel-xxxxx-taska 12s
βββ task-b process-task parallel-xxxxx-taskb 17s
βββ task-c process-task parallel-xxxxx-taskc 14s
βββ summary print-message parallel-xxxxx-sum 2s
Let's look at the logs:
student@lab-jobs:~$ argo logs -n argo parallel-724fr
parallel-724fr-process-task-2419718903: time="2026-05-12T19:50:11.579Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2419718903: Processing Task A
parallel-724fr-process-task-2453274141: time="2026-05-12T19:50:12.332Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2453274141: Processing Task C
parallel-724fr-process-task-2436496522: time="2026-05-12T19:50:13.059Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2436496522: Processing Task B
parallel-724fr-process-task-2419718903: Task A done
parallel-724fr-process-task-2419718903: time="2026-05-12T19:50:22.581Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-process-task-2453274141: Task C done
parallel-724fr-process-task-2453274141: time="2026-05-12T19:50:24.338Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-process-task-2436496522: Task B done
parallel-724fr-process-task-2436496522: time="2026-05-12T19:50:28.066Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-print-message-2792551925: time="2026-05-12T19:50:40.653Z" level=info msg="capturing logs" argo=true
parallel-724fr-print-message-2792551925: All parallel tasks completed!
parallel-724fr-print-message-2792551925: time="2026-05-12T19:50:41.654Z" level=info msg="sub-process exited" argo=true error="<nil>"
Example 4: Passing Artifacts (Files) Between Stepsβ
Generate a file in one step and consume it in another.
Create artifact-passing.yaml:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: artifact-passing-
spec:
entrypoint: main
serviceAccountName: argo-admin
templates:
- name: main
steps:
- - name: generate-data
template: generate-artifact
- - name: process-data
template: process-artifact
arguments:
artifacts:
- name: input-file
from: "{{steps.generate-data.outputs.artifacts.result}}"
- - name: analyze-data
template: analyze-artifact
arguments:
artifacts:
- name: input-file
from: "{{steps.process-data.outputs.artifacts.result}}"
- name: generate-artifact
container:
image: busybox
command: [sh, -c]
args:
- |
echo "Generating data at $(date)" > /tmp/data.txt
echo "Line 1: Sample data" >> /tmp/data.txt
echo "Line 2: More data" >> /tmp/data.txt
echo "Line 3: Final data" >> /tmp/data.txt
cat /tmp/data.txt
outputs:
artifacts:
- name: result
path: /tmp/data.txt
- name: process-artifact
inputs:
artifacts:
- name: input-file
path: /tmp/input.txt
container:
image: busybox
command: [sh, -c]
args:
- |
echo "Processing input file:"
cat /tmp/input.txt
echo "---"
echo "Processed at $(date)" > /tmp/output.txt
cat /tmp/input.txt | tr '[:lower:]' '[:upper:]' >> /tmp/output.txt
cat /tmp/output.txt
outputs:
artifacts:
- name: result
path: /tmp/output.txt
- name: analyze-artifact
inputs:
artifacts:
- name: input-file
path: /tmp/final.txt
container:
image: busybox
command: [sh, -c]
args:
- |
echo "Final analysis:"
cat /tmp/final.txt
echo "---"
wc -l /tmp/final.txt
Notice the following points:
outputs.artifactsdefines files to pass to next stepsinputs.artifactsdefines where to receive files- Argo automatically handles file transfer between steps
- Use
from: "{{steps.XXX.outputs.artifacts.YYY}}", where XXX is the step name and YYY is the artifact name, to reference artifacts - Each step can read, transform, and output new artifacts
Submit and watch:
$ argo submit -n argo artifact-passing.yaml --watch
STEP TEMPLATE PODNAME DURATION
β artifact-passing-xxxxx main
βββ generate-data generate-artifact artifact-passing-xxxxx-gen 5s
βββ process-data process-artifact artifact-passing-xxxxx-proc 4s
βββ analyze-data analyze-artifact artifact-passing-xxxxx-anal 3s
View logs from the final step:
$ argo logs -n argo artifact-passing-j84t8 artifact-passing-j84t8-analyze-artifact-99164579
artifact-passing-j84t8-analyze-artifact-99164579: time="2026-05-12T20:01:28.155Z" level=info msg="capturing logs" argo=true
artifact-passing-j84t8-analyze-artifact-99164579: Final analysis:
artifact-passing-j84t8-analyze-artifact-99164579: Processed at Tue May 12 20:01:18 UTC 2026
artifact-passing-j84t8-analyze-artifact-99164579: GENERATING DATA AT TUE MAY 12 20:01:08 UTC 2026
artifact-passing-j84t8-analyze-artifact-99164579: LINE 1: SAMPLE DATA
artifact-passing-j84t8-analyze-artifact-99164579: LINE 2: MORE DATA
artifact-passing-j84t8-analyze-artifact-99164579: LINE 3: FINAL DATA
artifact-passing-j84t8-analyze-artifact-99164579: ---
artifact-passing-j84t8-analyze-artifact-99164579: 5 /tmp/final.txt
artifact-passing-j84t8-analyze-artifact-99164579: time="2026-05-12T20:01:29.156Z" level=info msg="sub-process exited" argo=true error="<nil>"
Understanding Workflow Executionβ
When you submit a workflow:
- Workflow Controller watches for new Workflow resources
- Scheduler creates pods for each step based on dependencies
- Executor runs containers and manages artifacts
- Outputs are collected (parameters, artifacts)
- Next steps are triggered based on dependencies
- Status is updated continuously
You can monitor workflows using:
# List workflows
$ argo list -n argo
NAME STATUS AGE DURATION PRIORITY MESSAGE
artifact-passing-j84t8 Succeeded 15m 30s 0
parallel-724fr Succeeded 26m 40s 0
sequential-fq6mr Succeeded 37m 30s 0
hello-world-x8m96 Succeeded 39m 10s 0
# Get workflow details
$ argo get -n argo parallel-724fr
Name: parallel-724fr
Namespace: argo
ServiceAccount: unset (will run with the default ServiceAccount)
Status: Succeeded
Conditions:
PodRunning False
Completed True
Created: Tue May 12 19:50:07 +0000 (27 minutes ago)
Started: Tue May 12 19:50:07 +0000 (27 minutes ago)
Finished: Tue May 12 19:50:47 +0000 (26 minutes ago)
Duration: 40 seconds
Progress: 4/4
ResourcesDuration: 1m6s*(100Mi memory),3s*(1 cpu)
STEP TEMPLATE PODNAME DURATION MESSAGE
β parallel-724fr main
βββ¬ββ task-a process-task parallel-724fr-process-task-2419718903 15s
β βββ task-b process-task parallel-724fr-process-task-2436496522 21s
β βββ task-c process-task parallel-724fr-process-task-2453274141 17s
βββββ summary print-message parallel-724fr-print-message-2792551925 4s
# Watch workflow execution
$ argo watch -n argo <workflow-name>
# View logs
$ argo logs -n argo parallel-724fr
parallel-724fr-process-task-2419718903: time="2026-05-12T19:50:11.579Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2419718903: Processing Task A
parallel-724fr-process-task-2453274141: time="2026-05-12T19:50:12.332Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2453274141: Processing Task C
parallel-724fr-process-task-2436496522: time="2026-05-12T19:50:13.059Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2436496522: Processing Task B
parallel-724fr-process-task-2419718903: Task A done
parallel-724fr-process-task-2419718903: time="2026-05-12T19:50:22.581Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-process-task-2453274141: Task C done
parallel-724fr-process-task-2453274141: time="2026-05-12T19:50:24.338Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-process-task-2436496522: Task B done
parallel-724fr-process-task-2436496522: time="2026-05-12T19:50:28.066Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-print-message-2792551925: time="2026-05-12T19:50:40.653Z" level=info msg="capturing logs" argo=true
parallel-724fr-print-message-2792551925: All parallel tasks completed!
parallel-724fr-print-message-2792551925: time="2026-05-12T19:50:41.654Z" level=info msg="sub-process exited" argo=true error="<nil>"
# Delete workflow
$ argo delete -n argo parallel-724fr
Argo Workflows Practiceβ
Now that you've learned how Argo Workflows work by running the examples, it's time to build your own workflows! You'll create two multi-step workflows that demonstrate real-world batch processing scenarios.
In these exercises, you'll create the Argo Workflow YAML yourself. We provide ready-made containerized applications - your task is to orchestrate them using Argo Workflows.
Exercise 1: Image Processing Pipelineβ
Objectiveβ
Create an Argo Workflow that orchestrates a multi-step image processing pipeline:
- Download an image from a URL
- Convert the image to black and white (grayscale)
- Detect faces in the image and draw rectangles around them
Your Task: Create the Workflowβ
You will start from the image-processing.yaml file bellow:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: image-processing-
spec:
entrypoint: image-pipeline
serviceAccountName: argo-admin
templates:
- name: image-pipeline
steps:
#TODO-1: Call the download-image template with the following parameter: https://raw.githubusercontent.com/opencv/opencv/master/samples/data/lena.jpg
- - name: download
template: download-image
#TODO-2: Call the convert greyscale
#- - name: grayscale
# template: convert-grayscale
# arguments:
#TODO-2: Add input artifacts
#TODO-3: Call the detect-faces template
#- - name: detect
# template: detect-faces
#TODO-3: Add input artifacts
- name: download-image
container:
#TODO-1: Add container to download image from paramenter
outputs:
artifacts:
- name: image
path: /tmp/image.jpg
#TODO-2: Uncomment the following lines
#- name: convert-grayscale
# inputs:
#TODO-2: Add correct inputs
# container:
# image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/image-processor:latest
#TODO-2: Call the application with the correct command and args
#TODO-2: Add an output artifact that gets the grayscale output image
#TODO-3: Uncomment the following lines
#- name: detect-faces
# inputs:
#TODO-3: Add input paths
# container:
# image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/image-processor:latest
#TODO-3: Call the application with the correct command and args
#TODO-3: Add an output artifact that gets the grayscale output image
Use the image-processing.yaml file as a starting point to accomplish the
following:
- Use the
gitlab.cs.pub.ro:5050/scgc/cloud-courses/image-processor:latestcontainer image. In it you can run theapp.pyapplication as such:app.py grayscale /tmp/input.jpg /tmp/gray.jpgoutputs a grayscale file in thegray.jpgfile starting from theinput.jpgfileapp.py detect-faces /tmp/input.jpg, /tmp/argo-results/result.jpgoutputs a file with a face detection algorithm applied
- The workflow will run three steps based on three templates:
- download: Downloads image from URL using operation "download"
- grayscale: Converts image to grayscale using operation "grayscale"
- detect: Detects faces using operation "detect-faces"
- Pass the image file as an artifact between steps
- Uses this test image URL:
https://raw.githubusercontent.com/opencv/opencv/master/samples/data/lena.jpg
Follow the TODOs marked in the file for a step-by-step implementation.
- Review Example 4 (Artifact Passing) from the Argo Workflows guide
- Each template should use
containerwith the image-processor image or busybox - Use
command: [python, /app.py]andargs: [...]to specify the operation - Remember to define
outputs.artifactsto pass files to the next step - Remember to define
inputs.artifactsto receive files from the previous step - You will be able to see the artifacts from the Argo UI dashboard to check on your work
Phase 1: TODO-1β
- The step is already created, you will have to design a container spec to download the input file and pass it as an artifact
- Follow Example 4 for Artifact Passing and use the
busyboximage to runcurlinside a container - Check that the output file for the curl matches the artifact path
Phase 2: TODO-2β
- Uncomment the YAML lines
- Add inputs artifacts for the step
- Add inputs artifact paths for the pipeline
- Call the application with the correct parameters
- Add output artifact to the step template
Phase 3: TODO-3β
- Uncomment the YAML lines
- Add inputs artifacts for the step
- Add inputs artifact paths for the pipeline
- Call the application with the correct parameters
- Add output artifact to the step template
- Download the image from the Argo dashboard
Exercise 2: Web Scraping and Link Analysisβ
Objectiveβ
Create an Argo Workflow that:
- Scrapes two websites in parallel (Hacker News and Reddit)
- Aggregates the scraped links and ranks them by frequency
- Outputs the top 10 most frequently linked pages
Your Task: Create the Workflowβ
Create an Argo Workflow file named web-scraping.yaml that:
We will be using a python application embedded in the image that works thusly:
app.py scrape <url> <output.json>scrapes a url and saves the output to a jsonapp.py aggregate <links1.json> <links2.json> <aggregate.json>aggregates the two jsons and counts the top linked pages
What you have to do:
- Use the
gitlab.cs.pub.ro:5050/scgc/cloud-courses/web-scraper:latestcontainer image - Run two steps:
- Phase 1 (parallel): Two tasks that scrape websites concurrently:
- Scrape
https://news.ycombinator.com - Scrape
https://old.reddit.com/r/programming
- Scrape
- Phase 2 (after phase 1 completes): One task that aggregates the results
- Phase 1 (parallel): Two tasks that scrape websites concurrently:
- Pass scraped links as artifacts (JSON files) between steps
- The aggregate step receives both scraped link files and outputs a result
Hints:
- Review Example 3 (Parallel Execution) from the Argo Workflows guide
- Review Example 4 (Artifact Passing) from the Argo Workflows guide
- Use single dash for parallel steps:
- - name: scrape1and- name: scrape2(both under one- -) - Use double dash for the next sequential step:
- - name: aggregate - Remember that the aggregate operation takes 2 input files and 1 output file