Skip to main content

Running jobs on Kubernetes

Setup​

We will be using a virtual machine in the faculty's cloud.

When creating a virtual machine in the Launch Instance window:

  • Name your VM using the following convention: cc_lab<no>_<username>, where <no> is the lab number and <username> is your institutional account.
  • Select Boot from image in Instance Boot Source section
  • Select CC Template in Image Name section
  • Select the g.large flavor.

Creating a Kubernetes cluster​

As in the previous laboratories, we will create a cluster on the lab machine, using the kind create cluster command:

student@lab-kubernetes:~$ kind create cluster
Creating cluster "kind" ...
βœ“ Ensuring node image (kindest/node:v1.23.4) πŸ–Ό
βœ“ Preparing nodes πŸ“¦
βœ“ Writing configuration πŸ“œ
βœ“ Starting control-plane πŸ•ΉοΈ
βœ“ Installing CNI πŸ”Œ
βœ“ Installing StorageClass πŸ’Ύ
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Thanks for using kind! 😊
note

It is recommended that you use port-forwarding instead of X11 forwarding to interact with the UI.

Kubernetes Jobs​

Introduction to Batch Workloads​

In the context of cloud computing up until now we have only interacted with applications or services whose lifetime is infinite, which means that they are started and they are never stopped unless an error appears.

However, this does not cover most use cases in distributed computing. Many processing tasks are batch workloads - discrete units of work that:

  • Run to completion
  • Process a specific dataset or task
  • Exit when finished
  • Should not be automatically restarted after successful completion

Examples of batch workloads include:

  • Data processing: ETL (Extract, Transform, Load) pipelines
  • Machine learning: Training models, batch inference
  • Report generation: Periodic analytics and exports
  • Backup and archival: Database backups, log aggregation
  • Video/image processing: Transcoding, thumbnail generation
  • Scientific computing: Simulations, numerical analysis

Kubernetes by its nature works as a scheduler for jobs, which makes it well suited for scheduling processing jobs.

Jobs vs Pods​

A Kubernetes Job should be used instead of a Pod when:

  • The workload has a defined start and end
  • You expect the action to finish successfully
  • You don't want resources lingering in the cluster after completion
  • You need guarantees about completion and retry behavior

Key differences:

FeaturePodJob
LifecycleLong-runningRun-to-completion
Restart behaviorRestarts indefinitely on failureControlled retry with backoffLimit
Completion trackingN/ATracks successful completions
Resource cleanupRuns forever unless deletedCan be automatically cleaned up
Use caseServices, daemonsBatch processing, one-time tasks

The object which manages a discrete work item in Kubernetes is called a Job and it contains a specification for a container, as we are used to from Pod specifications.

The example bellow displays a job which displays a debug message:

apiVersion: batch/v1
kind: Job
metadata:
name: hello-world-job
spec:
template:
spec:
containers:
- name: hello-world
image: ghcr.io/containerd/busybox
command: ["echo", "Hello from Kubernetes batch job!"]
restartPolicy: Never
backoffLimit: 4

When applying the above manifest, we can see that the Job is created, and we can inspect its output as follows:

student@lab-jobs:~/$ kubectl apply -f hello-world.yaml
kubectl get jobsjob.batch/hello-world-job created
student@lab-jobs:~/$ kubectl get jobs
NAME COMPLETIONS DURATION AGE
hello-world-job 0/1 0s 0s
student@lab-jobs:~/$ kubectl logs job/hello-world-job
Hello from Kubernetes batch job!

The above example is useful for quick and dirty jobs, but when running in an actual batch environment there are some other factors which have to be involved:

  • the increase scheduling accuracy and system cohesion you would add resource limits;
  • use a custom job script;
  • add fail conditions;
  • limit job duration.

The following example is used for creating a complex job which runs a custom python script, limits its resources and requests a restart of the application fails:

apiVersion: batch/v1
kind: Job
metadata:
name: matrix-multiplication-job
spec:
template:
spec:
containers:
- name: matrix-multiply
image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/python:3.9-slim
command: ["bash", "-c"]
args:
- |
pip install numpy && python /scripts/matrix_multiply.py
volumeMounts:
- name: script-volume
mountPath: /scripts
- name: pip-local
mountPath: /.local
- name: pip-local
mountPath: /.cache
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
volumes:
- name: script-volume
configMap:
name: matrix-multiplication-script
- name: pip-local
emptyDir: {}
restartPolicy: OnFailure
backoffLimit: 2
---
apiVersion: v1
kind: ConfigMap
metadata:
name: matrix-multiplication-script
data:
matrix_multiply.py: |
import numpy as np
import time
import os

# Create large matrices
size = 5000
print(f'Creating {size}x{size} matrices...')
a = np.random.rand(size, size)
b = np.random.rand(size, size)

# Perform CPU-intensive matrix multiplication
print('Starting matrix multiplication...')
start_time = time.time()
result = np.matmul(a, b)
duration = time.time() - start_time

print(f'Matrix multiplication complete in {duration:.2f} seconds')
print(f'Result matrix shape: {result.shape}')

The requests dict is used for scheduling purposes, it is used as a minimum resource specification used for the container when choosing a node for placement. The limits dict is used to specify the actual limits imposed on the container which it can't surpass. As with a regular Pod, ConfigMaps, Secrets and other kubernetes objects can be mounted into the container.

Let's run it and see its output:

student@lab-jobs:~/$ kubectl logs job/matrix-multiplication-job
Collecting numpy
Downloading numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.5/19.5 MB 101.7 MB/s eta 0:00:00
Installing collected packages: numpy
Successfully installed numpy-2.0.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.0.1 -> 25.1.1
[notice] To update, run: pip install --upgrade pip
Creating 5000x5000 matrices...
Starting matrix multiplication...
Matrix multiplication complete in 14.20 seconds
Result matrix shape: (5000, 5000)

Job Configuration Options​

Jobs provide several configuration options to control their behavior:

completions​

Specifies the number of successful pod completions needed for the job to be considered complete.

spec:
completions: 5 # Job completes after 5 successful pod runs

parallelism​

Controls how many pods run simultaneously. Useful for processing large datasets in parallel.

spec:
completions: 10
parallelism: 3 # Run 3 pods at a time until 10 completions

activeDeadlineSeconds​

Sets a timeout for the entire job. If the job doesn't complete within this time, it's terminated.

spec:
activeDeadlineSeconds: 3600 # Job fails if not done in 1 hour

backoffLimit​

Number of retries before marking the job as failed. Default is 6.

spec:
backoffLimit: 3 # Retry up to 3 times on failure

ttlSecondsAfterFinished​

Automatically cleans up the job after completion or failure.

spec:
ttlSecondsAfterFinished: 86400 # Delete job 24 hours after completion

Job Patterns​

Kubernetes Jobs support several common patterns for batch processing:

Pattern 1: Single Completion Job​

The simplest pattern - run one pod to completion.

apiVersion: batch/v1
kind: Job
metadata:
name: single-task
spec:
template:
spec:
containers:
- name: task
image: busybox
command: ["sh", "-c", "echo Processing task && sleep 10"]
restartPolicy: Never

Pattern 2: Parallel Jobs with Fixed Completion Count​

Process multiple items by running multiple pods in parallel.

apiVersion: batch/v1
kind: Job
metadata:
name: parallel-processing
spec:
completions: 10 # Need 10 successful completions
parallelism: 3 # Run 3 pods at a time
template:
spec:
containers:
- name: processor
image: busybox
command: ["sh", "-c", "echo Processing item $RANDOM && sleep 5"]
restartPolicy: Never

Use case: Processing a known set of tasks (e.g., generating 10 reports, processing 100 images in batches).

Pattern 3: Work Queue Pattern​

Multiple workers processing tasks from a shared queue. Workers continue until the queue is empty.

apiVersion: batch/v1
kind: Job
metadata:
name: work-queue
spec:
parallelism: 5 # 5 workers processing in parallel
# No completions - workers exit when queue is empty
template:
spec:
containers:
- name: worker
image: my-worker:latest
env:
- name: QUEUE_URL
value: "redis://queue:6379"
restartPolicy: Never

Use case: Processing an unknown number of tasks from a message queue (RabbitMQ, Redis, SQS).

Best Practices for Jobs​

  1. Set resource limits: Always specify requests and limits to prevent resource starvation
  2. Use ttlSecondsAfterFinished: Automatically clean up completed jobs to avoid clutter
  3. Choose appropriate backoffLimit: Balance between retry attempts and fast failure
  4. Monitor job status: Use kubectl get jobs and kubectl describe job to track progress
  5. Use init containers: Separate setup (downloading data) from processing
  6. Consider parallelism: Use parallel jobs for independent tasks that can run simultaneously
  7. Handle failures gracefully: Ensure your container exits with proper exit codes

Case study: zip cracking​

Let's look at a real world example of cracking a password using fcrackzip and jobs in Kubernetes. The decrypt-zip.yaml is the basis for our job. It contains the commands used for cracking the password for a zip file. The fcrackzip tool can brute-force a ZIP archive's password.

Our task is to download the archive, and crack its password.

The following manifest will define our job and Persistent Volume:

apiVersion: batch/v1
kind: Job
metadata:
name: zip-decryption-job
labels:
app: zip-decryption
spec:
ttlSecondsAfterFinished: 86400 # Automatically delete job 24h after completion
backoffLimit: 2 # Number of retries before considering job failed
template:
metadata:
labels:
app: zip-decryption
spec:
restartPolicy: OnFailure
initContainers:
- name: download-zip
image: ghcr.io/curl/curl-container/curl:master # Lightweight curl image
command: ["/bin/sh", "-c"]
volumeMounts:
- name: data-volume
mountPath: /data
args:
- >
echo "Downloading ZIP file from remote source..." &&
curl http://swarm.cs.pub.ro/~sweisz/encrypted.zip -o /data/encrypted.zip
containers:
- name: hashcat-container
image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/fcrackzip # Replace with appropriate hashcat image
command: ["/bin/sh"]
args:
- "-c"
- >
cd /data &&
fcrackzip -v -b -c a -l 5-5 -u encrypted.zip > results_lowercase.txt &&
cat results_lowercase.txt
volumeMounts:
- name: data-volume
mountPath: /data
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
volumes:
- name: data-volume
emptyDir: {}
- name: wordlist-volume
configMap:
name: zip-decrypt-config

We know that the file has a password made up of 5 letters, which led us to use the -l 5-5 option, together with -b to do brute-forcing. We use the initContainer to download the archive and the main container to run fcrackzip.

Exercise: Crack using wordlist​

Change the above job in order to run fcrackzip using the wordlist from the following link: http://swarm.cs.pub.ro/~sweisz/wordlist.txt. You can attach the wordlist as a ConfigMap as you've seen in the matrix multiplication example. You can see how to configure fcrackzip to use wordlists in the following link: https://sohvaxus.github.io/content/fcrackzip-bruteforce-tutorial.html.

Cronjobs​

While regular Jobs are useful from a scheduling point of view, they cannot be set to run periodically or on a set timer. CronJobs are a mechanism implemented in Kubernetes to enhance the regular Jobs feature. They are a type of Job which are managed and scheduled by Kubernetes to run at a specific time based on a user-defined rule.

Some use cases which we can define for CronJobs are:

  • scheduling regular data exports or backups to off-site facilities
  • periodic environment cleanup jobs, for example deleting temporary files or files which have been generated and haven't been used for some time
  • crawling endpoint for new data or information

The following is an example manifest for a job:

apiVersion: batch/v1
kind: CronJob
metadata:
name: first-job
spec:
schedule: "0 2 8 * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: first-job
image: busybox
command: ["echo", "First job"]
restartPolicy: OnFailure

The jobTemplate specification works as a job specification field, in which we add the requirements for a job.

The schedule value is specified using the following convention from the cron manual:

# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').

This means that the above job will run on the 8th day of the month at 2:00 AM. If we want to specify a job which would run for every minute we could to the following change:

- schedule: "0 2 8 * *"
+ schedule: "*/1 * * * *"

The */x means the job will run every x minutes.

tip

For an easy way to define the cron schedule, you can use https://crontab.guru/.

Case study: Database backup​

For this case study we will pe running a PostgreSQL defined by the following manifest:

# PostgreSQL Pod
apiVersion: v1
kind: Pod
metadata:
name: postgres-db
labels:
app: postgres
spec:
containers:
- name: postgres
image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/postgres:14-alpine
ports:
- containerPort: 5432
name: postgres
env:
- name: PGDATA
value: /var/lib/postgresql/data/pg/
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: postgres-credentials
key: username
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
- name: POSTGRES_DB
valueFrom:
secretKeyRef:
name: postgres-credentials
key: database
volumeMounts:
- name: postgres-data
mountPath: /var/lib/postgresql/data/
volumes:
- name: postgres-data
emptyDir: {}
---
# Service for PostgreSQL
apiVersion: v1
kind: Service
metadata:
name: postgres-service
spec:
ports:
- port: 5432
targetPort: 5432
selector:
app: postgres

The pgsql.yaml file deploys a database server. For this database server we need to create backups which will be store in another volume which will them be deployed off-site.

In order to prepare the setup we first need to create the database that we will be creating. Run the following command to setup the database deployment and service in the lab directory:

kubectl apply -f pgsql.yaml

We will start from the following already created CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup-container
image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/postgres:14-alpine
command:
- /bin/sh
- -c
- |
# Set date format for backup filename
BACKUP_DATE=$(date +\%Y-\%m-\%d-\%H\%M)

# Create backup
echo "Starting PostgreSQL backup at $(date)"
mkdir /tmp/backups
pg_dump \
-h ${DB_HOST} \
-U ${DB_USER} \
-d ${DB_NAME} \
-F custom \
-Z 9 \
-f /tmp/backups/${DB_NAME}-${BACKUP_DATE}.pgdump

env:
- name: DB_HOST
valueFrom:
secretKeyRef:
name: postgres-credentials
key: host
- name: DB_USER
valueFrom:
secretKeyRef:
name: postgres-credentials
key: username
- name: DB_NAME
valueFrom:
secretKeyRef:
name: postgres-credentials
key: database
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
restartPolicy: OnFailure
---
# Secret for database credentials
apiVersion: v1
kind: Secret
metadata:
name: postgres-credentials
type: Opaque
data:
host: cG9zdGdyZXMtc2VydmljZQ== # postgres-service (base64 encoded)
username: YmFja3VwX3VzZXI= # backup_user (base64 encoded)
password: c2VjdXJlUGFzc3dvcmQxMjM= # securePassword123 (base64 encoded)
database: cHJvZHVjdGlvbl9kYg== # production_db (base64 encoded)

The above CronJob creates a backup of the database using pg_dump and puts it in a temporary location.

Apply them so we can see the backup in action.

student@lab-jobs:~/ocp/upgrade$ kubectl get cronjobs
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
postgres-backup */1 * * * * False 0 35s 39m

The issue with the above CronJob is that although it creates a backup file, it doesn't add it to any kind of persistent storage.

Create an emptyDir volume mount, mount it to the /backup path and change the backup script so that it copies the backup files to the backup volume.

Change the backup schedule so that it only does a backup every hour.

Change the policy so that it can only run one backup job in parallel. Look into the documentation so that you will not allow concurrent jobs: https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/.

Argo Workflows​

What is Argo Workflows?​

Argo Workflows is a cloud-native workflow engine for Kubernetes that orchestrates parallel jobs. It's designed for compute-intensive workflows where each step is performed by a container.

Key features:

  • Native Kubernetes CRDs: Workflows are defined as Custom Resource Definitions
  • DAG-based workflows: Define complex dependencies between tasks
  • Container-native: Each step runs in its own container
  • Artifact management: Pass files and data between workflow steps
  • Parameter passing: Share variables between workflow steps
  • Parallel execution: Run multiple tasks simultaneously
  • Web UI: Visualize workflow execution in real-time
  • Scalable: Leverages Kubernetes for scheduling and resource management

Why Use Argo Workflows?​

Kubernetes Jobs are great for simple batch workloads, but they have limitations:

FeatureKubernetes JobsArgo Workflows
Multi-step workflowsManual orchestrationBuilt-in DAG support
DependenciesNo native supportDeclare dependencies easily
Parameter passingManual (ConfigMaps/Secrets)Native input/output parameters
File passingManual (volumes)Native artifact management
Parallel executionLimitedAdvanced parallelism patterns
Conditional logicNot supportedConditionals, loops, recursion
VisualizationBasic kubectl outputRich web UI
Retry logicJob-level onlyStep-level with custom strategies

Use Argo Workflows when you need:

  • Multi-step data pipelines
  • Complex dependencies between tasks
  • Passing data (files, parameters) between steps
  • Parallel processing with aggregation
  • Machine learning pipelines
  • CI/CD workflows
  • Data science workflows (ETL, training, inference)

Installation​

Install Argo Workflows Server​

Create the Argo namespace and install the server components:

$ kubectl create namespace argo
namespace/argo created

$ kubectl apply --server-side -n argo -f "https://github.com/argoproj/argo-workflows/releases/download/v3.7.14/quick-start-minimal.yaml"
customresourcedefinition.apiextensions.k8s.io/clusterworkflowtemplates.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/cronworkflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workfloweventbindings.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtaskresults.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtasksets.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtemplates.argoproj.io created
serviceaccount/argo created
serviceaccount/argo-server created
role.rbac.authorization.k8s.io/argo-role created
...
deployment.apps/workflow-controller created
deployment.apps/argo-server created

Verify the installation:

$ kubectl get pods -n argo
NAME READY STATUS RESTARTS AGE
argo-server-65f9588cf6-jgtj7 1/1 Running 0 51s
workflow-controller-7df5f5d5c8-vrk85 1/1 Running 0 51s

Both pods should be in Running status.

Install Argo CLI​

The Argo CLI makes it easier to submit and manage workflows:

$ curl -sLO "https://github.com/argoproj/argo-workflows/releases/download/v3.7.14/argo-linux-amd64.gz"
$ gunzip argo-linux-amd64.gz
$ chmod +x argo-linux-amd64
$ sudo mv argo-linux-amd64 /usr/local/bin/argo

$ argo version
argo: v3.7.14
info

Argo Workflows provides a dashboard to interact with the workflows on localhost:2746.

There are two options for connecting to the Argo user interface: SSH tunneling or Chrome Remote Desktop.

info

Option 1: SSH tunneling

Follow this tutorial to configure the SSH service to bind and forward the 2746 port to your machine:

ssh -J fep -L 2746:127.0.0.1:2746 -i ~/.ssh/id_fep student@10.9.X.Y
info

Option 2: Chrome Remote Desktop

An alternative to SSH tunneling or X11 forwarding is Chrome Remote Desktop, which allows you to connect to the graphical interface of your VM.

If you want to use this method, follow the steps from here.

tip

Start a kubectl port-forward on the VM:

$ kubectl -n argo port-forward deployment/argo-server 2746:2746
Forwarding from 127.0.0.1:2746 -> 2746

Open your browser to https://localhost:2746 (accept the self-signed certificate warning).

To authenticate to the webserver you must run the following commands and paste the resulting token on the login screen.

$ kubectl -n argo create sa argo-admin
$ kubectl -n argo create clusterrolebinding argo-admin \
--clusterrole=cluster-admin \
--serviceaccount=argo:argo-admin
$ kubectl -n argo create token argo-admin
warning

Add the prefix Bearer <token> to the token when pasting it in the login screen.

The Argo UI is extremely useful for:

  • Visualizing workflow DAGs
  • Monitoring workflow execution in real-time
  • Viewing logs from each step
  • Debugging failed workflows
  • Downloading artifacts

Argo Workflow Concepts​

An Argo Workflow is a Kubernetes resource that defines a sequence of steps to execute.

It uses templates as a reusable component that defines what to execute. Templates can be:

  • Container template: Runs a container
  • Script template: Runs a script in a container
  • Steps template: Defines a sequence of sub-templates
  • DAG template: Defines tasks with dependencies

A workflow receives input parameters that allow you to pass values into templates. Output parameters allow you to pass values out of templates and next steps in a workflows

Instead of using parametes for output handling, you can use artifacts, which are files that are passed between workflow steps. Argo manages uploading and downloading artifacts automatically.

Working Examples​

Let's explore working examples that demonstrate Argo Workflows capabilities. Run each of these to understand how workflows work before attempting the exercises.

Example 1: Hello World Workflow​

The simplest workflow - run a single container.

Create a file hello-world.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: hello-world-
spec:
entrypoint: hello
serviceAccountName: argo-admin
templates:
- name: hello
container:
image: busybox
command: [echo]
args: ["Hello World from Argo Workflows!"]

We notice the following:

  • generateName creates a unique workflow name
  • entrypoint specifies which template to start with
  • The workflow runs a single container

Submit the workflow:

$ argo submit -n argo hello-world.yaml --watch
Name: hello-world-xxxxx
Namespace: argo
ServiceAccount: unset
Status: Succeeded
Created: Mon Jan 01 12:00:00 +0000 (10 seconds ago)
Started: Mon Jan 01 12:00:00 +0000 (10 seconds ago)
Finished: Mon Jan 01 12:00:05 +0000 (5 seconds ago)
Duration: 5 seconds

STEP TEMPLATE PODNAME DURATION MESSAGE
βœ” hello-world-xxxxx hello hello-world-xxxxx 3s

View the logs:

$ argo logs -n argo hello-world-xxxxx
hello-world-xxxxx: Hello World from Argo Workflows!

We see that there was a mod that ran in the argo namespace:

student@lab-jobs:~$ kubectl get pods -n argo
NAME READY STATUS RESTARTS AGE
argo-server-5549677b6-f5hm6 1/1 Running 0 4m47s
hello-world-x8m96 0/2 Completed 0 66s
httpbin-f5ccc9c6-t47d6 1/1 Running 0 4m47s
minio-5877d79784-zph9x 1/1 Running 0 4m47s
workflow-controller-7df5f5d5c8-qj8vd 1/1 Running 0 4m47s

Example 2: Sequential Multi-Step Workflow​

Run multiple steps in sequence, passing parameters between them.

Create sequential-steps.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: sequential-
spec:
entrypoint: main
serviceAccountName: argo-admin
templates:
- name: main
steps:
- - name: step1
template: print-message
arguments:
parameters:
- name: message
value: "Step 1: Starting workflow"

- - name: step2
template: print-message
arguments:
parameters:
- name: message
value: "Step 2: Processing data"

- - name: step3
template: print-message
arguments:
parameters:
- name: message
value: "Step 3: Workflow complete"

- name: print-message
inputs:
parameters:
- name: message
container:
image: busybox
command: [sh, -c]
args: ["echo '{{inputs.parameters.message}}' && date"]

We see that we have defined a template for a container. This receives a parameter called message and runs a container to print it. We then devine three steps that run the print-message template.

Notice the following:

  • steps template defines sequential execution
  • Each step is an array - - (double dash)
  • Parameters are passed to templates via arguments
  • The {{inputs.parameters.message}} syntax accesses parameters

Submit and watch:

$ argo submit -n argo sequential-steps.yaml --watch
STEP TEMPLATE PODNAME DURATION
βœ” sequential-xxxxx main
β”œβ”€βœ” step1 print-message sequential-xxxxx-step1 5s
β”œβ”€βœ” step2 print-message sequential-xxxxx-step2 4s
β””β”€βœ” step3 print-message sequential-xxxxx-step3 4s

We see the logs where each step prints the message:

student@lab-jobs:~$ argo logs -n argo sequential-fq6mr
sequential-fq6mr-print-message-1642373302: time="2026-05-12T19:38:56.328Z" level=info msg="capturing logs" argo=true
sequential-fq6mr-print-message-1642373302: Step 1: Starting workflow
sequential-fq6mr-print-message-1642373302: Tue May 12 19:38:56 UTC 2026
sequential-fq6mr-print-message-1642373302: time="2026-05-12T19:38:57.329Z" level=info msg="sub-process exited" argo=true error="<nil>"
sequential-fq6mr-print-message-700546422: time="2026-05-12T19:39:06.138Z" level=info msg="capturing logs" argo=true
sequential-fq6mr-print-message-700546422: Step 2: Processing data
sequential-fq6mr-print-message-700546422: Tue May 12 19:39:06 UTC 2026
sequential-fq6mr-print-message-700546422: time="2026-05-12T19:39:07.139Z" level=info msg="sub-process exited" argo=true error="<nil>"
sequential-fq6mr-print-message-1865731698: time="2026-05-12T19:39:16.157Z" level=info msg="capturing logs" argo=true
sequential-fq6mr-print-message-1865731698: Step 3: Workflow complete
sequential-fq6mr-print-message-1865731698: Tue May 12 19:39:16 UTC 2026
sequential-fq6mr-print-message-1865731698: time="2026-05-12T19:39:17.158Z" level=info msg="sub-process exited" argo=true error="<nil>"

Example 3: Parallel Execution​

Run multiple tasks simultaneously and wait for all to complete.

Create parallel-tasks.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: parallel-
spec:
entrypoint: main
serviceAccountName: argo-admin
templates:
- name: main
steps:
# All three tasks run in parallel (single dash means parallel)
- - name: task-a
template: process-task
arguments:
parameters:
- name: task-name
value: "Task A"
- name: duration
value: "10"

- name: task-b
template: process-task
arguments:
parameters:
- name: task-name
value: "Task B"
- name: duration
value: "15"

- name: task-c
template: process-task
arguments:
parameters:
- name: task-name
value: "Task C"
- name: duration
value: "12"

# This step runs after all parallel tasks complete
- - name: summary
template: print-message
arguments:
parameters:
- name: message
value: "All parallel tasks completed!"

- name: process-task
inputs:
parameters:
- name: task-name
- name: duration
container:
image: busybox
command: [sh, -c]
args: ["echo 'Processing {{inputs.parameters.task-name}}'; sleep {{inputs.parameters.duration}}; echo '{{inputs.parameters.task-name}} done'"]

- name: print-message
inputs:
parameters:
- name: message
container:
image: busybox
command: [echo]
args: ["{{inputs.parameters.message}}"]

Notice the following:

  • Single dash - name: (within one - - block) means parallel execution
  • All three tasks start simultaneously
  • The summary step waits for all parallel tasks to complete

We'll notice the tasks run in parallel when we watch them run:

$ argo submit -n argo parallel-tasks.yaml --watch
STEP TEMPLATE PODNAME DURATION
βœ” parallel-xxxxx main
β”œβ”€βœ” task-a process-task parallel-xxxxx-taska 12s
β”œβ”€βœ” task-b process-task parallel-xxxxx-taskb 17s
β”œβ”€βœ” task-c process-task parallel-xxxxx-taskc 14s
β””β”€βœ” summary print-message parallel-xxxxx-sum 2s

Let's look at the logs:

student@lab-jobs:~$ argo logs -n argo parallel-724fr
parallel-724fr-process-task-2419718903: time="2026-05-12T19:50:11.579Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2419718903: Processing Task A
parallel-724fr-process-task-2453274141: time="2026-05-12T19:50:12.332Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2453274141: Processing Task C
parallel-724fr-process-task-2436496522: time="2026-05-12T19:50:13.059Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2436496522: Processing Task B
parallel-724fr-process-task-2419718903: Task A done
parallel-724fr-process-task-2419718903: time="2026-05-12T19:50:22.581Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-process-task-2453274141: Task C done
parallel-724fr-process-task-2453274141: time="2026-05-12T19:50:24.338Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-process-task-2436496522: Task B done
parallel-724fr-process-task-2436496522: time="2026-05-12T19:50:28.066Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-print-message-2792551925: time="2026-05-12T19:50:40.653Z" level=info msg="capturing logs" argo=true
parallel-724fr-print-message-2792551925: All parallel tasks completed!
parallel-724fr-print-message-2792551925: time="2026-05-12T19:50:41.654Z" level=info msg="sub-process exited" argo=true error="<nil>"

Example 4: Passing Artifacts (Files) Between Steps​

Generate a file in one step and consume it in another.

Create artifact-passing.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: artifact-passing-
spec:
entrypoint: main
serviceAccountName: argo-admin
templates:
- name: main
steps:
- - name: generate-data
template: generate-artifact

- - name: process-data
template: process-artifact
arguments:
artifacts:
- name: input-file
from: "{{steps.generate-data.outputs.artifacts.result}}"

- - name: analyze-data
template: analyze-artifact
arguments:
artifacts:
- name: input-file
from: "{{steps.process-data.outputs.artifacts.result}}"

- name: generate-artifact
container:
image: busybox
command: [sh, -c]
args:
- |
echo "Generating data at $(date)" > /tmp/data.txt
echo "Line 1: Sample data" >> /tmp/data.txt
echo "Line 2: More data" >> /tmp/data.txt
echo "Line 3: Final data" >> /tmp/data.txt
cat /tmp/data.txt
outputs:
artifacts:
- name: result
path: /tmp/data.txt

- name: process-artifact
inputs:
artifacts:
- name: input-file
path: /tmp/input.txt
container:
image: busybox
command: [sh, -c]
args:
- |
echo "Processing input file:"
cat /tmp/input.txt
echo "---"
echo "Processed at $(date)" > /tmp/output.txt
cat /tmp/input.txt | tr '[:lower:]' '[:upper:]' >> /tmp/output.txt
cat /tmp/output.txt
outputs:
artifacts:
- name: result
path: /tmp/output.txt

- name: analyze-artifact
inputs:
artifacts:
- name: input-file
path: /tmp/final.txt
container:
image: busybox
command: [sh, -c]
args:
- |
echo "Final analysis:"
cat /tmp/final.txt
echo "---"
wc -l /tmp/final.txt

Notice the following points:

  • outputs.artifacts defines files to pass to next steps
  • inputs.artifacts defines where to receive files
  • Argo automatically handles file transfer between steps
  • Use from: "{{steps.XXX.outputs.artifacts.YYY}}", where XXX is the step name and YYY is the artifact name, to reference artifacts
  • Each step can read, transform, and output new artifacts

Submit and watch:

$ argo submit -n argo artifact-passing.yaml --watch
STEP TEMPLATE PODNAME DURATION
βœ” artifact-passing-xxxxx main
β”œβ”€βœ” generate-data generate-artifact artifact-passing-xxxxx-gen 5s
β”œβ”€βœ” process-data process-artifact artifact-passing-xxxxx-proc 4s
β””β”€βœ” analyze-data analyze-artifact artifact-passing-xxxxx-anal 3s

View logs from the final step:

$ argo logs -n argo artifact-passing-j84t8 artifact-passing-j84t8-analyze-artifact-99164579
artifact-passing-j84t8-analyze-artifact-99164579: time="2026-05-12T20:01:28.155Z" level=info msg="capturing logs" argo=true
artifact-passing-j84t8-analyze-artifact-99164579: Final analysis:
artifact-passing-j84t8-analyze-artifact-99164579: Processed at Tue May 12 20:01:18 UTC 2026
artifact-passing-j84t8-analyze-artifact-99164579: GENERATING DATA AT TUE MAY 12 20:01:08 UTC 2026
artifact-passing-j84t8-analyze-artifact-99164579: LINE 1: SAMPLE DATA
artifact-passing-j84t8-analyze-artifact-99164579: LINE 2: MORE DATA
artifact-passing-j84t8-analyze-artifact-99164579: LINE 3: FINAL DATA
artifact-passing-j84t8-analyze-artifact-99164579: ---
artifact-passing-j84t8-analyze-artifact-99164579: 5 /tmp/final.txt
artifact-passing-j84t8-analyze-artifact-99164579: time="2026-05-12T20:01:29.156Z" level=info msg="sub-process exited" argo=true error="<nil>"

Understanding Workflow Execution​

When you submit a workflow:

  1. Workflow Controller watches for new Workflow resources
  2. Scheduler creates pods for each step based on dependencies
  3. Executor runs containers and manages artifacts
  4. Outputs are collected (parameters, artifacts)
  5. Next steps are triggered based on dependencies
  6. Status is updated continuously

You can monitor workflows using:

# List workflows
$ argo list -n argo
NAME STATUS AGE DURATION PRIORITY MESSAGE
artifact-passing-j84t8 Succeeded 15m 30s 0
parallel-724fr Succeeded 26m 40s 0
sequential-fq6mr Succeeded 37m 30s 0
hello-world-x8m96 Succeeded 39m 10s 0

# Get workflow details
$ argo get -n argo parallel-724fr
Name: parallel-724fr
Namespace: argo
ServiceAccount: unset (will run with the default ServiceAccount)
Status: Succeeded
Conditions:
PodRunning False
Completed True
Created: Tue May 12 19:50:07 +0000 (27 minutes ago)
Started: Tue May 12 19:50:07 +0000 (27 minutes ago)
Finished: Tue May 12 19:50:47 +0000 (26 minutes ago)
Duration: 40 seconds
Progress: 4/4
ResourcesDuration: 1m6s*(100Mi memory),3s*(1 cpu)

STEP TEMPLATE PODNAME DURATION MESSAGE
βœ” parallel-724fr main
β”œβ”€β”¬β”€βœ” task-a process-task parallel-724fr-process-task-2419718903 15s
β”‚ β”œβ”€βœ” task-b process-task parallel-724fr-process-task-2436496522 21s
β”‚ β””β”€βœ” task-c process-task parallel-724fr-process-task-2453274141 17s
β””β”€β”€β”€βœ” summary print-message parallel-724fr-print-message-2792551925 4s


# Watch workflow execution
$ argo watch -n argo <workflow-name>

# View logs
$ argo logs -n argo parallel-724fr
parallel-724fr-process-task-2419718903: time="2026-05-12T19:50:11.579Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2419718903: Processing Task A
parallel-724fr-process-task-2453274141: time="2026-05-12T19:50:12.332Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2453274141: Processing Task C
parallel-724fr-process-task-2436496522: time="2026-05-12T19:50:13.059Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2436496522: Processing Task B
parallel-724fr-process-task-2419718903: Task A done
parallel-724fr-process-task-2419718903: time="2026-05-12T19:50:22.581Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-process-task-2453274141: Task C done
parallel-724fr-process-task-2453274141: time="2026-05-12T19:50:24.338Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-process-task-2436496522: Task B done
parallel-724fr-process-task-2436496522: time="2026-05-12T19:50:28.066Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-print-message-2792551925: time="2026-05-12T19:50:40.653Z" level=info msg="capturing logs" argo=true
parallel-724fr-print-message-2792551925: All parallel tasks completed!
parallel-724fr-print-message-2792551925: time="2026-05-12T19:50:41.654Z" level=info msg="sub-process exited" argo=true error="<nil>"

# Delete workflow
$ argo delete -n argo parallel-724fr

Argo Workflows Practice​

Now that you've learned how Argo Workflows work by running the examples, it's time to build your own workflows! You'll create two multi-step workflows that demonstrate real-world batch processing scenarios.

important

In these exercises, you'll create the Argo Workflow YAML yourself. We provide ready-made containerized applications - your task is to orchestrate them using Argo Workflows.


Exercise 1: Image Processing Pipeline​

Objective​

Create an Argo Workflow that orchestrates a multi-step image processing pipeline:

  1. Download an image from a URL
  2. Convert the image to black and white (grayscale)
  3. Detect faces in the image and draw rectangles around them

Your Task: Create the Workflow​

You will start from the image-processing.yaml file bellow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: image-processing-
spec:
entrypoint: image-pipeline
serviceAccountName: argo-admin
templates:
- name: image-pipeline
steps:
#TODO-1: Call the download-image template with the following parameter: https://raw.githubusercontent.com/opencv/opencv/master/samples/data/lena.jpg
- - name: download
template: download-image

#TODO-2: Call the convert greyscale
#- - name: grayscale
# template: convert-grayscale
# arguments:
#TODO-2: Add input artifacts

#TODO-3: Call the detect-faces template
#- - name: detect
# template: detect-faces
#TODO-3: Add input artifacts

- name: download-image
container:
#TODO-1: Add container to download image from paramenter
outputs:
artifacts:
- name: image
path: /tmp/image.jpg

#TODO-2: Uncomment the following lines
#- name: convert-grayscale
# inputs:
#TODO-2: Add correct inputs
# container:
# image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/image-processor:latest
#TODO-2: Call the application with the correct command and args
#TODO-2: Add an output artifact that gets the grayscale output image

#TODO-3: Uncomment the following lines
#- name: detect-faces
# inputs:
#TODO-3: Add input paths
# container:
# image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/image-processor:latest
#TODO-3: Call the application with the correct command and args
#TODO-3: Add an output artifact that gets the grayscale output image

Use the image-processing.yaml file as a starting point to accomplish the following:

  1. Use the gitlab.cs.pub.ro:5050/scgc/cloud-courses/image-processor:latest container image. In it you can run the app.py application as such:
    • app.py grayscale /tmp/input.jpg /tmp/gray.jpg outputs a grayscale file in the gray.jpg file starting from the input.jpg file
    • app.py detect-faces /tmp/input.jpg, /tmp/argo-results/result.jpg outputs a file with a face detection algorithm applied
  2. The workflow will run three steps based on three templates:
    • download: Downloads image from URL using operation "download"
    • grayscale: Converts image to grayscale using operation "grayscale"
    • detect: Detects faces using operation "detect-faces"
  3. Pass the image file as an artifact between steps
  4. Uses this test image URL: https://raw.githubusercontent.com/opencv/opencv/master/samples/data/lena.jpg
tip

Follow the TODOs marked in the file for a step-by-step implementation.

info
  • Review Example 4 (Artifact Passing) from the Argo Workflows guide
  • Each template should use container with the image-processor image or busybox
  • Use command: [python, /app.py] and args: [...] to specify the operation
  • Remember to define outputs.artifacts to pass files to the next step
  • Remember to define inputs.artifacts to receive files from the previous step
  • You will be able to see the artifacts from the Argo UI dashboard to check on your work

Phase 1: TODO-1​

  • The step is already created, you will have to design a container spec to download the input file and pass it as an artifact
  • Follow Example 4 for Artifact Passing and use the busybox image to run curl inside a container
  • Check that the output file for the curl matches the artifact path

Phase 2: TODO-2​

  • Uncomment the YAML lines
  • Add inputs artifacts for the step
  • Add inputs artifact paths for the pipeline
  • Call the application with the correct parameters
  • Add output artifact to the step template

Phase 3: TODO-3​

  • Uncomment the YAML lines
  • Add inputs artifacts for the step
  • Add inputs artifact paths for the pipeline
  • Call the application with the correct parameters
  • Add output artifact to the step template
  • Download the image from the Argo dashboard

Objective​

Create an Argo Workflow that:

  1. Scrapes two websites in parallel (Hacker News and Reddit)
  2. Aggregates the scraped links and ranks them by frequency
  3. Outputs the top 10 most frequently linked pages

Your Task: Create the Workflow​

Create an Argo Workflow file named web-scraping.yaml that:

We will be using a python application embedded in the image that works thusly:

  • app.py scrape <url> <output.json> scrapes a url and saves the output to a json
  • app.py aggregate <links1.json> <links2.json> <aggregate.json> aggregates the two jsons and counts the top linked pages

What you have to do:

  1. Use the gitlab.cs.pub.ro:5050/scgc/cloud-courses/web-scraper:latest container image
  2. Run two steps:
    • Phase 1 (parallel): Two tasks that scrape websites concurrently:
      • Scrape https://news.ycombinator.com
      • Scrape https://old.reddit.com/r/programming
    • Phase 2 (after phase 1 completes): One task that aggregates the results
  3. Pass scraped links as artifacts (JSON files) between steps
  4. The aggregate step receives both scraped link files and outputs a result

Hints:

  • Review Example 3 (Parallel Execution) from the Argo Workflows guide
  • Review Example 4 (Artifact Passing) from the Argo Workflows guide
  • Use single dash for parallel steps: - - name: scrape1 and - name: scrape2 (both under one - -)
  • Use double dash for the next sequential step: - - name: aggregate
  • Remember that the aggregate operation takes 2 input files and 1 output file