Running jobs on Kubernetes

Setup

We will be using a virtual machine in the faculty's cloud.

When creating a virtual machine in the Launch Instance window:

Name your VM using the following convention: cc_lab<no>_<username>, where <no> is the lab number and <username> is your institutional account.
Select Boot from image in Instance Boot Source section
Select CC Template in Image Name section
Select the g.large flavor.

Creating a Kubernetes cluster

As in the previous laboratories, we will create a cluster on the lab machine, using the kind create cluster command:

student@lab-kubernetes:~$ kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.23.4) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Thanks for using kind! 😊

note

It is recommended that you use port-forwarding instead of X11 forwarding to interact with the UI.

Kubernetes Jobs

Introduction to Batch Workloads

In the context of cloud computing up until now we have only interacted with applications or services whose lifetime is infinite, which means that they are started and they are never stopped unless an error appears.

However, this does not cover most use cases in distributed computing. Many processing tasks are batch workloads - discrete units of work that:

Run to completion
Process a specific dataset or task
Exit when finished
Should not be automatically restarted after successful completion

Examples of batch workloads include:

Data processing: ETL (Extract, Transform, Load) pipelines
Machine learning: Training models, batch inference
Report generation: Periodic analytics and exports
Backup and archival: Database backups, log aggregation
Video/image processing: Transcoding, thumbnail generation
Scientific computing: Simulations, numerical analysis

Kubernetes by its nature works as a scheduler for jobs, which makes it well suited for scheduling processing jobs.

Jobs vs Pods

A Kubernetes Job should be used instead of a Pod when:

The workload has a defined start and end
You expect the action to finish successfully
You don't want resources lingering in the cluster after completion
You need guarantees about completion and retry behavior

Key differences:

Feature	Pod	Job
Lifecycle	Long-running	Run-to-completion
Restart behavior	Restarts indefinitely on failure	Controlled retry with backoffLimit
Completion tracking	N/A	Tracks successful completions
Resource cleanup	Runs forever unless deleted	Can be automatically cleaned up
Use case	Services, daemons	Batch processing, one-time tasks

The object which manages a discrete work item in Kubernetes is called a Job and it contains a specification for a container, as we are used to from Pod specifications.

The example bellow displays a job which displays a debug message:

apiVersion: batch/v1
kind: Job
metadata:
  name: hello-world-job
spec:
  template:
    spec:
      containers:
      - name: hello-world
        image: ghcr.io/containerd/busybox
        command: ["echo", "Hello from Kubernetes batch job!"]
      restartPolicy: Never
  backoffLimit: 4

When applying the above manifest, we can see that the Job is created, and we can inspect its output as follows:

student@lab-jobs:~/$ kubectl apply -f hello-world.yaml
kubectl get jobsjob.batch/hello-world-job created
student@lab-jobs:~/$ kubectl get jobs
NAME              COMPLETIONS   DURATION   AGE
hello-world-job   0/1           0s         0s
student@lab-jobs:~/$ kubectl logs job/hello-world-job
Hello from Kubernetes batch job!

The above example is useful for quick and dirty jobs, but when running in an actual batch environment there are some other factors which have to be involved:

the increase scheduling accuracy and system cohesion you would add resource limits;
use a custom job script;
add fail conditions;
limit job duration.

The following example is used for creating a complex job which runs a custom python script, limits its resources and requests a restart of the application fails:

apiVersion: batch/v1
kind: Job
metadata:
  name: matrix-multiplication-job
spec:
  template:
    spec:
      containers:
      - name: matrix-multiply
        image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/python:3.9-slim
        command: ["bash", "-c"]
        args:
        - |
          pip install numpy && python /scripts/matrix_multiply.py
        volumeMounts:
        - name: script-volume
          mountPath: /scripts
        - name: pip-local
          mountPath: /.local
        - name: pip-local
          mountPath: /.cache
        resources:
          requests:
            cpu: "2"
            memory: "4Gi"
          limits:
            cpu: "4"
            memory: "8Gi"
      volumes:
      - name: script-volume
        configMap:
          name: matrix-multiplication-script
      - name: pip-local
        emptyDir: {}
      restartPolicy: OnFailure
  backoffLimit: 2
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: matrix-multiplication-script
data:
  matrix_multiply.py: |
    import numpy as np
    import time
    import os

    # Create large matrices
    size = 5000
    print(f'Creating {size}x{size} matrices...')
    a = np.random.rand(size, size)
    b = np.random.rand(size, size)

    # Perform CPU-intensive matrix multiplication
    print('Starting matrix multiplication...')
    start_time = time.time()
    result = np.matmul(a, b)
    duration = time.time() - start_time

    print(f'Matrix multiplication complete in {duration:.2f} seconds')
    print(f'Result matrix shape: {result.shape}')

The requests dict is used for scheduling purposes, it is used as a minimum resource specification used for the container when choosing a node for placement. The limits dict is used to specify the actual limits imposed on the container which it can't surpass. As with a regular Pod, ConfigMaps, Secrets and other kubernetes objects can be mounted into the container.

Let's run it and see its output:

student@lab-jobs:~/$ kubectl logs job/matrix-multiplication-job
Collecting numpy
  Downloading numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.5/19.5 MB 101.7 MB/s eta 0:00:00
Installing collected packages: numpy
Successfully installed numpy-2.0.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.0.1 -> 25.1.1
[notice] To update, run: pip install --upgrade pip
Creating 5000x5000 matrices...
Starting matrix multiplication...
Matrix multiplication complete in 14.20 seconds
Result matrix shape: (5000, 5000)

Job Configuration Options

Jobs provide several configuration options to control their behavior:

completions

Specifies the number of successful pod completions needed for the job to be considered complete.

spec:
  completions: 5  # Job completes after 5 successful pod runs

parallelism

Controls how many pods run simultaneously. Useful for processing large datasets in parallel.

spec:
  completions: 10
  parallelism: 3  # Run 3 pods at a time until 10 completions

activeDeadlineSeconds

Sets a timeout for the entire job. If the job doesn't complete within this time, it's terminated.

spec:
  activeDeadlineSeconds: 3600  # Job fails if not done in 1 hour

backoffLimit

Number of retries before marking the job as failed. Default is 6.

spec:
  backoffLimit: 3  # Retry up to 3 times on failure

ttlSecondsAfterFinished

Automatically cleans up the job after completion or failure.

spec:
  ttlSecondsAfterFinished: 86400  # Delete job 24 hours after completion

Job Patterns

Kubernetes Jobs support several common patterns for batch processing:

Pattern 1: Single Completion Job

The simplest pattern - run one pod to completion.

apiVersion: batch/v1
kind: Job
metadata:
  name: single-task
spec:
  template:
    spec:
      containers:
      - name: task
        image: busybox
        command: ["sh", "-c", "echo Processing task && sleep 10"]
      restartPolicy: Never

Pattern 2: Parallel Jobs with Fixed Completion Count

Process multiple items by running multiple pods in parallel.

apiVersion: batch/v1
kind: Job
metadata:
  name: parallel-processing
spec:
  completions: 10      # Need 10 successful completions
  parallelism: 3       # Run 3 pods at a time
  template:
    spec:
      containers:
      - name: processor
        image: busybox
        command: ["sh", "-c", "echo Processing item $RANDOM && sleep 5"]
      restartPolicy: Never

Use case: Processing a known set of tasks (e.g., generating 10 reports, processing 100 images in batches).

Pattern 3: Work Queue Pattern

Multiple workers processing tasks from a shared queue. Workers continue until the queue is empty.

apiVersion: batch/v1
kind: Job
metadata:
  name: work-queue
spec:
  parallelism: 5       # 5 workers processing in parallel
  # No completions - workers exit when queue is empty
  template:
    spec:
      containers:
      - name: worker
        image: my-worker:latest
        env:
        - name: QUEUE_URL
          value: "redis://queue:6379"
      restartPolicy: Never

Use case: Processing an unknown number of tasks from a message queue (RabbitMQ, Redis, SQS).

Best Practices for Jobs

Set resource limits: Always specify requests and limits to prevent resource starvation
Use ttlSecondsAfterFinished: Automatically clean up completed jobs to avoid clutter
Choose appropriate backoffLimit: Balance between retry attempts and fast failure
Monitor job status: Use kubectl get jobs and kubectl describe job to track progress
Use init containers: Separate setup (downloading data) from processing
Consider parallelism: Use parallel jobs for independent tasks that can run simultaneously
Handle failures gracefully: Ensure your container exits with proper exit codes

Case study: zip cracking

Let's look at a real world example of cracking a password using fcrackzip and jobs in Kubernetes. The decrypt-zip.yaml is the basis for our job. It contains the commands used for cracking the password for a zip file. The fcrackzip tool can brute-force a ZIP archive's password.

Our task is to download the archive, and crack its password.

The following manifest will define our job and Persistent Volume:

apiVersion: batch/v1
kind: Job
metadata:
  name: zip-decryption-job
  labels:
    app: zip-decryption
spec:
  ttlSecondsAfterFinished: 86400  # Automatically delete job 24h after completion
  backoffLimit: 2  # Number of retries before considering job failed
  template:
    metadata:
      labels:
        app: zip-decryption
    spec:
      restartPolicy: OnFailure
      initContainers:
      - name: download-zip
        image: ghcr.io/curl/curl-container/curl:master   # Lightweight curl image
        command: ["/bin/sh", "-c"]
        volumeMounts:
        - name: data-volume
          mountPath: /data
        args:
        - >
          echo "Downloading ZIP file from remote source..." &&
          curl http://swarm.cs.pub.ro/~sweisz/encrypted.zip -o /data/encrypted.zip
      containers:
      - name: hashcat-container
        image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/fcrackzip  # Replace with appropriate hashcat image
        command: ["/bin/sh"]
        args:
        - "-c"
        - >
          cd /data &&
          fcrackzip -v -b -c a -l 5-5 -u encrypted.zip > results_lowercase.txt &&
          cat results_lowercase.txt
        volumeMounts:
        - name: data-volume
          mountPath: /data
        resources:
          requests:
            cpu: "2"
            memory: "4Gi"
          limits:
            cpu: "4"
            memory: "8Gi"
      volumes:
      - name: data-volume
        emptyDir: {}
      - name: wordlist-volume
        configMap:
          name: zip-decrypt-config

We know that the file has a password made up of 5 letters, which led us to use the -l 5-5 option, together with -b to do brute-forcing. We use the initContainer to download the archive and the main container to run fcrackzip.

Exercise: Crack using wordlist

Change the above job in order to run fcrackzip using the wordlist from the following link: http://swarm.cs.pub.ro/~sweisz/wordlist.txt. You can attach the wordlist as a ConfigMap as you've seen in the matrix multiplication example. You can see how to configure fcrackzip to use wordlists in the following link: https://sohvaxus.github.io/content/fcrackzip-bruteforce-tutorial.html.

Cronjobs

While regular Jobs are useful from a scheduling point of view, they cannot be set to run periodically or on a set timer. CronJobs are a mechanism implemented in Kubernetes to enhance the regular Jobs feature. They are a type of Job which are managed and scheduled by Kubernetes to run at a specific time based on a user-defined rule.

Some use cases which we can define for CronJobs are:

scheduling regular data exports or backups to off-site facilities
periodic environment cleanup jobs, for example deleting temporary files or files which have been generated and haven't been used for some time
crawling endpoint for new data or information

The following is an example manifest for a job:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: first-job
spec:
  schedule: "0 2 8 * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: first-job
            image: busybox
            command: ["echo", "First job"]
          restartPolicy: OnFailure

The jobTemplate specification works as a job specification field, in which we add the requirements for a job.

The schedule value is specified using the following convention from the cron manual:

# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').

This means that the above job will run on the 8th day of the month at 2:00 AM. If we want to specify a job which would run for every minute we could to the following change:

-  schedule: "0 2 8 * *"
+  schedule: "*/1 * * * *"

The */x means the job will run every x minutes.

tip

For an easy way to define the cron schedule, you can use https://crontab.guru/.

Case study: Database backup

For this case study we will pe running a PostgreSQL defined by the following manifest:

# PostgreSQL Pod
apiVersion: v1
kind: Pod
metadata:
  name: postgres-db
  labels:
    app: postgres
spec:
  containers:
  - name: postgres
    image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/postgres:14-alpine
    ports:
    - containerPort: 5432
      name: postgres
    env:
    - name: PGDATA
      value: /var/lib/postgresql/data/pg/
    - name: POSTGRES_USER
      valueFrom:
        secretKeyRef:
          name: postgres-credentials
          key: username
    - name: POSTGRES_PASSWORD
      valueFrom:
        secretKeyRef:
          name: postgres-credentials
          key: password
    - name: POSTGRES_DB
      valueFrom:
        secretKeyRef:
          name: postgres-credentials
          key: database
    volumeMounts:
    - name: postgres-data
      mountPath: /var/lib/postgresql/data/
  volumes:
  - name: postgres-data
    emptyDir: {}
---
# Service for PostgreSQL
apiVersion: v1
kind: Service
metadata:
  name: postgres-service
spec:
  ports:
  - port: 5432
    targetPort: 5432
  selector:
    app: postgres

The pgsql.yaml file deploys a database server. For this database server we need to create backups which will be store in another volume which will them be deployed off-site.

In order to prepare the setup we first need to create the database that we will be creating. Run the following command to setup the database deployment and service in the lab directory:

kubectl apply -f pgsql.yaml

We will start from the following already created CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup-container
            image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/postgres:14-alpine
            command:
            - /bin/sh
            - -c
            - |
              # Set date format for backup filename
              BACKUP_DATE=$(date +\%Y-\%m-\%d-\%H\%M)

              # Create backup
              echo "Starting PostgreSQL backup at $(date)"
              mkdir /tmp/backups
              pg_dump \
                -h ${DB_HOST} \
                -U ${DB_USER} \
                -d ${DB_NAME} \
                -F custom \
                -Z 9 \
                -f /tmp/backups/${DB_NAME}-${BACKUP_DATE}.pgdump

            env:
            - name: DB_HOST
              valueFrom:
                secretKeyRef:
                  name: postgres-credentials
                  key: host
            - name: DB_USER
              valueFrom:
                secretKeyRef:
                  name: postgres-credentials
                  key: username
            - name: DB_NAME
              valueFrom:
                secretKeyRef:
                  name: postgres-credentials
                  key: database
            - name: PGPASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgres-credentials
                  key: password
          restartPolicy: OnFailure
---
# Secret for database credentials
apiVersion: v1
kind: Secret
metadata:
  name: postgres-credentials
type: Opaque
data:
  host: cG9zdGdyZXMtc2VydmljZQ==  # postgres-service (base64 encoded)
  username: YmFja3VwX3VzZXI=        # backup_user (base64 encoded)
  password: c2VjdXJlUGFzc3dvcmQxMjM= # securePassword123 (base64 encoded)
  database: cHJvZHVjdGlvbl9kYg==     # production_db (base64 encoded)

The above CronJob creates a backup of the database using pg_dump and puts it in a temporary location.

Apply them so we can see the backup in action.

student@lab-jobs:~/ocp/upgrade$ kubectl get cronjobs
NAME              SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
postgres-backup   */1 * * * *   False     0        35s             39m

The issue with the above CronJob is that although it creates a backup file, it doesn't add it to any kind of persistent storage.

Create an emptyDir volume mount, mount it to the /backup path and change the backup script so that it copies the backup files to the backup volume.

Change the backup schedule so that it only does a backup every hour.

Change the policy so that it can only run one backup job in parallel. Look into the documentation so that you will not allow concurrent jobs: https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/.

Argo Workflows

What is Argo Workflows?

Argo Workflows is a cloud-native workflow engine for Kubernetes that orchestrates parallel jobs. It's designed for compute-intensive workflows where each step is performed by a container.

Key features:

Native Kubernetes CRDs: Workflows are defined as Custom Resource Definitions
DAG-based workflows: Define complex dependencies between tasks
Container-native: Each step runs in its own container
Artifact management: Pass files and data between workflow steps
Parameter passing: Share variables between workflow steps
Parallel execution: Run multiple tasks simultaneously
Web UI: Visualize workflow execution in real-time
Scalable: Leverages Kubernetes for scheduling and resource management

Why Use Argo Workflows?

Kubernetes Jobs are great for simple batch workloads, but they have limitations:

Feature	Kubernetes Jobs	Argo Workflows
Multi-step workflows	Manual orchestration	Built-in DAG support
Dependencies	No native support	Declare dependencies easily
Parameter passing	Manual (ConfigMaps/Secrets)	Native input/output parameters
File passing	Manual (volumes)	Native artifact management
Parallel execution	Limited	Advanced parallelism patterns
Conditional logic	Not supported	Conditionals, loops, recursion
Visualization	Basic kubectl output	Rich web UI
Retry logic	Job-level only	Step-level with custom strategies

Use Argo Workflows when you need:

Multi-step data pipelines
Complex dependencies between tasks
Passing data (files, parameters) between steps
Parallel processing with aggregation
Machine learning pipelines
CI/CD workflows
Data science workflows (ETL, training, inference)

Installation

Install Argo Workflows Server

Create the Argo namespace and install the server components:

$ kubectl create namespace argo
namespace/argo created

$ kubectl apply --server-side -n argo -f "https://github.com/argoproj/argo-workflows/releases/download/v3.7.14/quick-start-minimal.yaml"
customresourcedefinition.apiextensions.k8s.io/clusterworkflowtemplates.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/cronworkflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workfloweventbindings.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtaskresults.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtasksets.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtemplates.argoproj.io created
serviceaccount/argo created
serviceaccount/argo-server created
role.rbac.authorization.k8s.io/argo-role created
...
deployment.apps/workflow-controller created
deployment.apps/argo-server created

Verify the installation:

$ kubectl get pods -n argo
NAME                                   READY   STATUS    RESTARTS   AGE
argo-server-65f9588cf6-jgtj7           1/1     Running   0          51s
workflow-controller-7df5f5d5c8-vrk85   1/1     Running   0          51s

Both pods should be in Running status.

Install Argo CLI

The Argo CLI makes it easier to submit and manage workflows:

$ curl -sLO "https://github.com/argoproj/argo-workflows/releases/download/v3.7.14/argo-linux-amd64.gz"
$ gunzip argo-linux-amd64.gz
$ chmod +x argo-linux-amd64
$ sudo mv argo-linux-amd64 /usr/local/bin/argo

$ argo version
argo: v3.7.14

info

Argo Workflows provides a dashboard to interact with the workflows on localhost:2746.

There are two options for connecting to the Argo user interface: SSH tunneling or Chrome Remote Desktop.

info

Option 1: SSH tunneling

Follow this tutorial to configure the SSH service to bind and forward the 2746 port to your machine:

ssh -J fep -L 2746:127.0.0.1:2746 -i ~/.ssh/id_fep  student@10.9.X.Y

info

Option 2: Chrome Remote Desktop

An alternative to SSH tunneling or X11 forwarding is Chrome Remote Desktop, which allows you to connect to the graphical interface of your VM.

If you want to use this method, follow the steps from here.

tip

Start a kubectl port-forward on the VM:

$ kubectl -n argo port-forward deployment/argo-server 2746:2746
Forwarding from 127.0.0.1:2746 -> 2746

Open your browser to https://localhost:2746 (accept the self-signed certificate warning).

To authenticate to the webserver you must run the following commands and paste the resulting token on the login screen.

$ kubectl -n argo create sa argo-admin
$ kubectl -n argo create clusterrolebinding argo-admin \
  --clusterrole=cluster-admin \
    --serviceaccount=argo:argo-admin
$ kubectl -n argo create token argo-admin

warning

Add the prefix Bearer <token> to the token when pasting it in the login screen.

The Argo UI is extremely useful for:

Visualizing workflow DAGs
Monitoring workflow execution in real-time
Viewing logs from each step
Debugging failed workflows
Downloading artifacts

Argo Workflow Concepts

An Argo Workflow is a Kubernetes resource that defines a sequence of steps to execute.

It uses templates as a reusable component that defines what to execute. Templates can be:

Container template: Runs a container
Script template: Runs a script in a container
Steps template: Defines a sequence of sub-templates
DAG template: Defines tasks with dependencies

A workflow receives input parameters that allow you to pass values into templates. Output parameters allow you to pass values out of templates and next steps in a workflows

Instead of using parametes for output handling, you can use artifacts, which are files that are passed between workflow steps. Argo manages uploading and downloading artifacts automatically.

Working Examples

Let's explore working examples that demonstrate Argo Workflows capabilities. Run each of these to understand how workflows work before attempting the exercises.

Example 1: Hello World Workflow

The simplest workflow - run a single container.

Create a file hello-world.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-
spec:
  entrypoint: hello
  serviceAccountName: argo-admin
  templates:
  - name: hello
    container:
      image: busybox
      command: [echo]
      args: ["Hello World from Argo Workflows!"]

We notice the following:

generateName creates a unique workflow name
entrypoint specifies which template to start with
The workflow runs a single container

Submit the workflow:

$ argo submit -n argo hello-world.yaml --watch
Name:                hello-world-xxxxx
Namespace:           argo
ServiceAccount:      unset
Status:              Succeeded
Created:             Mon Jan 01 12:00:00 +0000 (10 seconds ago)
Started:             Mon Jan 01 12:00:00 +0000 (10 seconds ago)
Finished:            Mon Jan 01 12:00:05 +0000 (5 seconds ago)
Duration:            5 seconds

STEP                  TEMPLATE  PODNAME             DURATION  MESSAGE
 ✔ hello-world-xxxxx  hello     hello-world-xxxxx   3s

View the logs:

$ argo logs -n argo hello-world-xxxxx
hello-world-xxxxx: Hello World from Argo Workflows!

We see that there was a mod that ran in the argo namespace:

student@lab-jobs:~$ kubectl get pods -n argo
NAME                                   READY   STATUS      RESTARTS   AGE
argo-server-5549677b6-f5hm6            1/1     Running     0          4m47s
hello-world-x8m96                      0/2     Completed   0          66s
httpbin-f5ccc9c6-t47d6                 1/1     Running     0          4m47s
minio-5877d79784-zph9x                 1/1     Running     0          4m47s
workflow-controller-7df5f5d5c8-qj8vd   1/1     Running     0          4m47s

Example 2: Sequential Multi-Step Workflow

Run multiple steps in sequence, passing parameters between them.

Create sequential-steps.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: sequential-
spec:
  entrypoint: main
  serviceAccountName: argo-admin
  templates:
  - name: main
    steps:
    - - name: step1
        template: print-message
        arguments:
          parameters:
          - name: message
            value: "Step 1: Starting workflow"

    - - name: step2
        template: print-message
        arguments:
          parameters:
          - name: message
            value: "Step 2: Processing data"

    - - name: step3
        template: print-message
        arguments:
          parameters:
          - name: message
            value: "Step 3: Workflow complete"

  - name: print-message
    inputs:
      parameters:
      - name: message
    container:
      image: busybox
      command: [sh, -c]
      args: ["echo '{{inputs.parameters.message}}' && date"]

We see that we have defined a template for a container. This receives a parameter called message and runs a container to print it. We then devine three steps that run the print-message template.

Notice the following:

steps template defines sequential execution
Each step is an array - - (double dash)
Parameters are passed to templates via arguments
The {{inputs.parameters.message}} syntax accesses parameters

Submit and watch:

$ argo submit -n argo sequential-steps.yaml --watch
STEP                      TEMPLATE       PODNAME                 DURATION
 ✔ sequential-xxxxx       main
 ├─✔ step1                print-message  sequential-xxxxx-step1  5s
 ├─✔ step2                print-message  sequential-xxxxx-step2  4s
 └─✔ step3                print-message  sequential-xxxxx-step3  4s

We see the logs where each step prints the message:

student@lab-jobs:~$ argo logs -n argo sequential-fq6mr
sequential-fq6mr-print-message-1642373302: time="2026-05-12T19:38:56.328Z" level=info msg="capturing logs" argo=true
sequential-fq6mr-print-message-1642373302: Step 1: Starting workflow
sequential-fq6mr-print-message-1642373302: Tue May 12 19:38:56 UTC 2026
sequential-fq6mr-print-message-1642373302: time="2026-05-12T19:38:57.329Z" level=info msg="sub-process exited" argo=true error="<nil>"
sequential-fq6mr-print-message-700546422: time="2026-05-12T19:39:06.138Z" level=info msg="capturing logs" argo=true
sequential-fq6mr-print-message-700546422: Step 2: Processing data
sequential-fq6mr-print-message-700546422: Tue May 12 19:39:06 UTC 2026
sequential-fq6mr-print-message-700546422: time="2026-05-12T19:39:07.139Z" level=info msg="sub-process exited" argo=true error="<nil>"
sequential-fq6mr-print-message-1865731698: time="2026-05-12T19:39:16.157Z" level=info msg="capturing logs" argo=true
sequential-fq6mr-print-message-1865731698: Step 3: Workflow complete
sequential-fq6mr-print-message-1865731698: Tue May 12 19:39:16 UTC 2026
sequential-fq6mr-print-message-1865731698: time="2026-05-12T19:39:17.158Z" level=info msg="sub-process exited" argo=true error="<nil>"

Example 3: Parallel Execution

Run multiple tasks simultaneously and wait for all to complete.

Create parallel-tasks.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: parallel-
spec:
  entrypoint: main
  serviceAccountName: argo-admin
  templates:
  - name: main
    steps:
    # All three tasks run in parallel (single dash means parallel)
    - - name: task-a
        template: process-task
        arguments:
          parameters:
          - name: task-name
            value: "Task A"
          - name: duration
            value: "10"

      - name: task-b
        template: process-task
        arguments:
          parameters:
          - name: task-name
            value: "Task B"
          - name: duration
            value: "15"

      - name: task-c
        template: process-task
        arguments:
          parameters:
          - name: task-name
            value: "Task C"
          - name: duration
            value: "12"

    # This step runs after all parallel tasks complete
    - - name: summary
        template: print-message
        arguments:
          parameters:
          - name: message
            value: "All parallel tasks completed!"

  - name: process-task
    inputs:
      parameters:
      - name: task-name
      - name: duration
    container:
      image: busybox
      command: [sh, -c]
      args: ["echo 'Processing {{inputs.parameters.task-name}}'; sleep {{inputs.parameters.duration}}; echo '{{inputs.parameters.task-name}} done'"]

  - name: print-message
    inputs:
      parameters:
      - name: message
    container:
      image: busybox
      command: [echo]
      args: ["{{inputs.parameters.message}}"]

Notice the following:

Single dash - name: (within one - - block) means parallel execution
All three tasks start simultaneously
The summary step waits for all parallel tasks to complete

We'll notice the tasks run in parallel when we watch them run:

$ argo submit -n argo parallel-tasks.yaml --watch
STEP                    TEMPLATE      PODNAME               DURATION
 ✔ parallel-xxxxx       main
 ├─✔ task-a             process-task  parallel-xxxxx-taska  12s
 ├─✔ task-b             process-task  parallel-xxxxx-taskb  17s
 ├─✔ task-c             process-task  parallel-xxxxx-taskc  14s
 └─✔ summary            print-message parallel-xxxxx-sum    2s

Let's look at the logs:

student@lab-jobs:~$ argo logs -n argo parallel-724fr
parallel-724fr-process-task-2419718903: time="2026-05-12T19:50:11.579Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2419718903: Processing Task A
parallel-724fr-process-task-2453274141: time="2026-05-12T19:50:12.332Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2453274141: Processing Task C
parallel-724fr-process-task-2436496522: time="2026-05-12T19:50:13.059Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2436496522: Processing Task B
parallel-724fr-process-task-2419718903: Task A done
parallel-724fr-process-task-2419718903: time="2026-05-12T19:50:22.581Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-process-task-2453274141: Task C done
parallel-724fr-process-task-2453274141: time="2026-05-12T19:50:24.338Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-process-task-2436496522: Task B done
parallel-724fr-process-task-2436496522: time="2026-05-12T19:50:28.066Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-print-message-2792551925: time="2026-05-12T19:50:40.653Z" level=info msg="capturing logs" argo=true
parallel-724fr-print-message-2792551925: All parallel tasks completed!
parallel-724fr-print-message-2792551925: time="2026-05-12T19:50:41.654Z" level=info msg="sub-process exited" argo=true error="<nil>"

Example 4: Passing Artifacts (Files) Between Steps

Generate a file in one step and consume it in another.

Create artifact-passing.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: artifact-passing-
spec:
  entrypoint: main
  serviceAccountName: argo-admin
  templates:
  - name: main
    steps:
    - - name: generate-data
        template: generate-artifact

    - - name: process-data
        template: process-artifact
        arguments:
          artifacts:
          - name: input-file
            from: "{{steps.generate-data.outputs.artifacts.result}}"

    - - name: analyze-data
        template: analyze-artifact
        arguments:
          artifacts:
          - name: input-file
            from: "{{steps.process-data.outputs.artifacts.result}}"

  - name: generate-artifact
    container:
      image: busybox
      command: [sh, -c]
      args:
        - |
          echo "Generating data at $(date)" > /tmp/data.txt
          echo "Line 1: Sample data" >> /tmp/data.txt
          echo "Line 2: More data" >> /tmp/data.txt
          echo "Line 3: Final data" >> /tmp/data.txt
          cat /tmp/data.txt
    outputs:
      artifacts:
      - name: result
        path: /tmp/data.txt

  - name: process-artifact
    inputs:
      artifacts:
      - name: input-file
        path: /tmp/input.txt
    container:
      image: busybox
      command: [sh, -c]
      args:
        - |
          echo "Processing input file:"
          cat /tmp/input.txt
          echo "---"
          echo "Processed at $(date)" > /tmp/output.txt
          cat /tmp/input.txt | tr '[:lower:]' '[:upper:]' >> /tmp/output.txt
          cat /tmp/output.txt
    outputs:
      artifacts:
      - name: result
        path: /tmp/output.txt

  - name: analyze-artifact
    inputs:
      artifacts:
      - name: input-file
        path: /tmp/final.txt
    container:
      image: busybox
      command: [sh, -c]
      args:
        - |
          echo "Final analysis:"
          cat /tmp/final.txt
          echo "---"
          wc -l /tmp/final.txt

Notice the following points:

outputs.artifacts defines files to pass to next steps
inputs.artifacts defines where to receive files
Argo automatically handles file transfer between steps
Use from: "{{steps.XXX.outputs.artifacts.YYY}}", where XXX is the step name and YYY is the artifact name, to reference artifacts
Each step can read, transform, and output new artifacts

Submit and watch:

$ argo submit -n argo artifact-passing.yaml --watch
STEP                           TEMPLATE          PODNAME                      DURATION
 ✔ artifact-passing-xxxxx      main
 ├─✔ generate-data             generate-artifact artifact-passing-xxxxx-gen   5s
 ├─✔ process-data              process-artifact  artifact-passing-xxxxx-proc  4s
 └─✔ analyze-data              analyze-artifact  artifact-passing-xxxxx-anal  3s

View logs from the final step:

$ argo logs -n argo artifact-passing-j84t8 artifact-passing-j84t8-analyze-artifact-99164579
artifact-passing-j84t8-analyze-artifact-99164579: time="2026-05-12T20:01:28.155Z" level=info msg="capturing logs" argo=true
artifact-passing-j84t8-analyze-artifact-99164579: Final analysis:
artifact-passing-j84t8-analyze-artifact-99164579: Processed at Tue May 12 20:01:18 UTC 2026
artifact-passing-j84t8-analyze-artifact-99164579: GENERATING DATA AT TUE MAY 12 20:01:08 UTC 2026
artifact-passing-j84t8-analyze-artifact-99164579: LINE 1: SAMPLE DATA
artifact-passing-j84t8-analyze-artifact-99164579: LINE 2: MORE DATA
artifact-passing-j84t8-analyze-artifact-99164579: LINE 3: FINAL DATA
artifact-passing-j84t8-analyze-artifact-99164579: ---
artifact-passing-j84t8-analyze-artifact-99164579: 5 /tmp/final.txt
artifact-passing-j84t8-analyze-artifact-99164579: time="2026-05-12T20:01:29.156Z" level=info msg="sub-process exited" argo=true error="<nil>"

Understanding Workflow Execution

When you submit a workflow:

Workflow Controller watches for new Workflow resources
Scheduler creates pods for each step based on dependencies
Executor runs containers and manages artifacts
Outputs are collected (parameters, artifacts)
Next steps are triggered based on dependencies
Status is updated continuously

You can monitor workflows using:

# List workflows
$ argo list -n argo
NAME                     STATUS      AGE   DURATION   PRIORITY   MESSAGE
artifact-passing-j84t8   Succeeded   15m   30s        0
parallel-724fr           Succeeded   26m   40s        0
sequential-fq6mr         Succeeded   37m   30s        0
hello-world-x8m96        Succeeded   39m   10s        0

# Get workflow details
$ argo get -n argo parallel-724fr
Name:                parallel-724fr
Namespace:           argo
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Succeeded
Conditions:
 PodRunning          False
 Completed           True
Created:             Tue May 12 19:50:07 +0000 (27 minutes ago)
Started:             Tue May 12 19:50:07 +0000 (27 minutes ago)
Finished:            Tue May 12 19:50:47 +0000 (26 minutes ago)
Duration:            40 seconds
Progress:            4/4
ResourcesDuration:   1m6s*(100Mi memory),3s*(1 cpu)

STEP               TEMPLATE       PODNAME                                  DURATION  MESSAGE
 ✔ parallel-724fr  main
 ├─┬─✔ task-a      process-task   parallel-724fr-process-task-2419718903   15s
 │ ├─✔ task-b      process-task   parallel-724fr-process-task-2436496522   21s
 │ └─✔ task-c      process-task   parallel-724fr-process-task-2453274141   17s
 └───✔ summary     print-message  parallel-724fr-print-message-2792551925  4s


# Watch workflow execution
$ argo watch -n argo <workflow-name>

# View logs
$ argo logs -n argo parallel-724fr
parallel-724fr-process-task-2419718903: time="2026-05-12T19:50:11.579Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2419718903: Processing Task A
parallel-724fr-process-task-2453274141: time="2026-05-12T19:50:12.332Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2453274141: Processing Task C
parallel-724fr-process-task-2436496522: time="2026-05-12T19:50:13.059Z" level=info msg="capturing logs" argo=true
parallel-724fr-process-task-2436496522: Processing Task B
parallel-724fr-process-task-2419718903: Task A done
parallel-724fr-process-task-2419718903: time="2026-05-12T19:50:22.581Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-process-task-2453274141: Task C done
parallel-724fr-process-task-2453274141: time="2026-05-12T19:50:24.338Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-process-task-2436496522: Task B done
parallel-724fr-process-task-2436496522: time="2026-05-12T19:50:28.066Z" level=info msg="sub-process exited" argo=true error="<nil>"
parallel-724fr-print-message-2792551925: time="2026-05-12T19:50:40.653Z" level=info msg="capturing logs" argo=true
parallel-724fr-print-message-2792551925: All parallel tasks completed!
parallel-724fr-print-message-2792551925: time="2026-05-12T19:50:41.654Z" level=info msg="sub-process exited" argo=true error="<nil>"

# Delete workflow
$ argo delete -n argo parallel-724fr

Argo Workflows Practice

Now that you've learned how Argo Workflows work by running the examples, it's time to build your own workflows! You'll create two multi-step workflows that demonstrate real-world batch processing scenarios.

important

In these exercises, you'll create the Argo Workflow YAML yourself. We provide ready-made containerized applications - your task is to orchestrate them using Argo Workflows.

Exercise 1: Image Processing Pipeline

Objective

Create an Argo Workflow that orchestrates a multi-step image processing pipeline:

Download an image from a URL
Convert the image to black and white (grayscale)
Detect faces in the image and draw rectangles around them

Your Task: Create the Workflow

You will start from the image-processing.yaml file bellow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: image-processing-
spec:
  entrypoint: image-pipeline
  serviceAccountName: argo-admin
  templates:
  - name: image-pipeline
    steps:
    #TODO-1: Call the download-image template with the following parameter: https://raw.githubusercontent.com/opencv/opencv/master/samples/data/lena.jpg
    - - name: download
        template: download-image

    #TODO-2: Call the convert greyscale
    #- - name: grayscale
    #    template: convert-grayscale
    #    arguments:
    #TODO-2: Add input artifacts

    #TODO-3: Call the detect-faces template
    #- - name: detect
    #    template: detect-faces
    #TODO-3: Add input artifacts

  - name: download-image
    container:
    #TODO-1: Add container to download image from paramenter
    outputs:
      artifacts:
      - name: image
        path: /tmp/image.jpg

  #TODO-2: Uncomment the following lines
  #- name: convert-grayscale
  #  inputs:
  #TODO-2: Add correct inputs
  #  container:
  #    image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/image-processor:latest
  #TODO-2: Call the application with the correct command and args
  #TODO-2: Add an output artifact that gets the grayscale output image

  #TODO-3: Uncomment the following lines
  #- name: detect-faces
  #  inputs:
  #TODO-3: Add input paths
  #  container:
  #    image: gitlab.cs.pub.ro:5050/scgc/cloud-courses/image-processor:latest
  #TODO-3: Call the application with the correct command and args
  #TODO-3: Add an output artifact that gets the grayscale output image

Use the image-processing.yaml file as a starting point to accomplish the following:

Use the gitlab.cs.pub.ro:5050/scgc/cloud-courses/image-processor:latest container image. In it you can run the app.py application as such:
- app.py grayscale /tmp/input.jpg /tmp/gray.jpg outputs a grayscale file in the gray.jpg file starting from the input.jpg file
- app.py detect-faces /tmp/input.jpg, /tmp/argo-results/result.jpg outputs a file with a face detection algorithm applied
The workflow will run three steps based on three templates:
- download: Downloads image from URL using operation "download"
- grayscale: Converts image to grayscale using operation "grayscale"
- detect: Detects faces using operation "detect-faces"
Pass the image file as an artifact between steps
Uses this test image URL: https://raw.githubusercontent.com/opencv/opencv/master/samples/data/lena.jpg

tip

Follow the TODOs marked in the file for a step-by-step implementation.

info

Review Example 4 (Artifact Passing) from the Argo Workflows guide
Each template should use container with the image-processor image or busybox
Use command: [python, /app.py] and args: [...] to specify the operation
Remember to define outputs.artifacts to pass files to the next step
Remember to define inputs.artifacts to receive files from the previous step
You will be able to see the artifacts from the Argo UI dashboard to check on your work

Phase 1: TODO-1

The step is already created, you will have to design a container spec to download the input file and pass it as an artifact
Follow Example 4 for Artifact Passing and use the busybox image to run curl inside a container
Check that the output file for the curl matches the artifact path

Phase 2: TODO-2

Uncomment the YAML lines
Add inputs artifacts for the step
Add inputs artifact paths for the pipeline
Call the application with the correct parameters
Add output artifact to the step template

Phase 3: TODO-3

Uncomment the YAML lines
Add inputs artifacts for the step
Add inputs artifact paths for the pipeline
Call the application with the correct parameters
Add output artifact to the step template
Download the image from the Argo dashboard

Exercise 2: Web Scraping and Link Analysis

Objective

Create an Argo Workflow that:

Scrapes two websites in parallel (Hacker News and Reddit)
Aggregates the scraped links and ranks them by frequency
Outputs the top 10 most frequently linked pages

Your Task: Create the Workflow

Create an Argo Workflow file named web-scraping.yaml that:

We will be using a python application embedded in the image that works thusly:

app.py scrape <url> <output.json> scrapes a url and saves the output to a json
app.py aggregate <links1.json> <links2.json> <aggregate.json> aggregates the two jsons and counts the top linked pages

What you have to do:

Use the gitlab.cs.pub.ro:5050/scgc/cloud-courses/web-scraper:latest container image
Run two steps:
- Phase 1 (parallel): Two tasks that scrape websites concurrently:
  - Scrape https://news.ycombinator.com
  - Scrape https://old.reddit.com/r/programming
- Phase 2 (after phase 1 completes): One task that aggregates the results
Pass scraped links as artifacts (JSON files) between steps
The aggregate step receives both scraped link files and outputs a result

Hints:

Review Example 3 (Parallel Execution) from the Argo Workflows guide
Review Example 4 (Artifact Passing) from the Argo Workflows guide
Use single dash for parallel steps: - - name: scrape1 and - name: scrape2 (both under one - -)
Use double dash for the next sequential step: - - name: aggregate
Remember that the aggregate operation takes 2 input files and 1 output file

Setup​

Creating a Kubernetes cluster​

Kubernetes Jobs​

Introduction to Batch Workloads​

Jobs vs Pods​

Job Configuration Options​

completions​

parallelism​

activeDeadlineSeconds​

backoffLimit​

ttlSecondsAfterFinished​

Job Patterns​

Pattern 1: Single Completion Job​

Pattern 2: Parallel Jobs with Fixed Completion Count​

Pattern 3: Work Queue Pattern​

Best Practices for Jobs​

Case study: zip cracking​

Exercise: Crack using wordlist​

Cronjobs​

Case study: Database backup​

Argo Workflows​

What is Argo Workflows?​

Why Use Argo Workflows?​

Installation​

Install Argo Workflows Server​

Install Argo CLI​

Argo Workflow Concepts​

Working Examples​

Example 1: Hello World Workflow​

Example 2: Sequential Multi-Step Workflow​

Example 3: Parallel Execution​

Example 4: Passing Artifacts (Files) Between Steps​

Understanding Workflow Execution​

Argo Workflows Practice​

Exercise 1: Image Processing Pipeline​

Objective​

Your Task: Create the Workflow​

Phase 1: TODO-1​

Phase 2: TODO-2​

Phase 3: TODO-3​

Exercise 2: Web Scraping and Link Analysis​

Objective​

Your Task: Create the Workflow​

Setup

Creating a Kubernetes cluster

Kubernetes Jobs

Introduction to Batch Workloads

Jobs vs Pods

Job Configuration Options

completions

parallelism

activeDeadlineSeconds

backoffLimit

ttlSecondsAfterFinished

Job Patterns

Pattern 1: Single Completion Job

Pattern 2: Parallel Jobs with Fixed Completion Count

Pattern 3: Work Queue Pattern

Best Practices for Jobs

Case study: zip cracking

Exercise: Crack using wordlist

Cronjobs

Case study: Database backup

Argo Workflows

What is Argo Workflows?

Why Use Argo Workflows?

Installation

Install Argo Workflows Server

Install Argo CLI

Argo Workflow Concepts

Working Examples

Example 1: Hello World Workflow

Example 2: Sequential Multi-Step Workflow

Example 3: Parallel Execution

Example 4: Passing Artifacts (Files) Between Steps

Understanding Workflow Execution

Argo Workflows Practice

Exercise 1: Image Processing Pipeline

Objective

Your Task: Create the Workflow

Phase 1: TODO-1

Phase 2: TODO-2

Phase 3: TODO-3

Exercise 2: Web Scraping and Link Analysis

Objective

Your Task: Create the Workflow