Parallel Processing using Expansions

    For this example there are only three items: apple, banana, and cherry. The sample Jobs process each item by printing a string then pausing.

    See using Jobs in real workloads to learn about how this pattern fits more realistic use cases.

    You should be familiar with the basic, non-parallel, use of .

    You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:

    For basic templating you need the command-line utility .

    To follow the advanced templating example, you need a working installation of , and the Jinja2 template library for Python.

    Once you have Python set up, you can install Jinja2 by running:

    Create Jobs based on a template

    First, download the following template of a Job to a file called job-tmpl.yaml. Here’s what you’ll download:

    1. apiVersion: batch/v1
    2. kind: Job
    3. metadata:
    4. name: process-item-$ITEM
    5. labels:
    6. jobgroup: jobexample
    7. spec:
    8. template:
    9. metadata:
    10. name: jobexample
    11. labels:
    12. jobgroup: jobexample
    13. spec:
    14. containers:
    15. - name: c
    16. image: busybox
    17. command: ["sh", "-c", "echo Processing item $ITEM && sleep 5"]
    18. restartPolicy: Never
    1. # Use curl to download job-tmpl.yaml

    The file you downloaded is not yet a valid Kubernetes manifest. Instead that template is a YAML representation of a Job object with some placeholders that need to be filled in before it can be used. The $ITEM syntax is not meaningful to Kubernetes.

    The following shell snippet uses sed to replace the string $ITEM with the loop variable, writing into a temporary directory named jobs. Run this now:

    1. # Expand the template into multiple files, one for each item to be processed.
    2. mkdir ./jobs
    3. for i in apple banana cherry
    4. do
    5. cat job-tmpl.yaml | sed "s/\$ITEM/$i/" > ./jobs/job-$i.yaml
    6. done

    Check if it worked:

    1. ls jobs/

    The output is similar to this:

    1. job-apple.yaml
    2. job-cherry.yaml

    Next, create all the Jobs with one kubectl command:

    The output is similar to this:

    1. job.batch/process-item-apple created
    2. job.batch/process-item-banana created
    3. job.batch/process-item-cherry created

    Now, check on the jobs:

    1. kubectl get jobs -l jobgroup=jobexample

    The output is similar to this:

    1. NAME COMPLETIONS DURATION AGE
    2. process-item-apple 1/1 14s 22s
    3. process-item-banana 1/1 12s 21s
    4. process-item-cherry 1/1 12s 20s

    Using the -l option to kubectl selects only the Jobs that are part of this group of jobs (there might be other unrelated jobs in the system).

    You can check on the Pods as well using the same :

    1. kubectl get pods -l jobgroup=jobexample

    The output is similar to:

    1. NAME READY STATUS RESTARTS AGE
    2. process-item-apple-kixwv 0/1 Completed 0 4m
    3. process-item-banana-wrsf7 0/1 Completed 0 4m
    4. process-item-cherry-dnfu9 0/1 Completed 0 4m

    We can use this single command to check on the output of all jobs at once:

    The output should be:

    1. Processing item apple
    2. Processing item banana
    3. Processing item cherry
    1. # Remove the Jobs you created
    2. # Your cluster automatically cleans up their Pods
    3. kubectl delete job -l jobgroup=jobexample

    In the first example, each instance of the template had one parameter, and that parameter was also used in the Job’s name. However, are restricted to contain only certain characters.

    This slightly more complex example uses the Jinja template language to generate manifests and then objects from those manifests, with a multiple parameters for each Job.

    For this part of the task, you are going to use a one-line Python script to convert the template to a set of manifests.

    First, copy and paste the following template of a Job object, into a file called job.yaml.jinja2:

    1. {% set params = [{ "name": "apple", "url": "http://dbpedia.org/resource/Apple", },
    2. { "name": "banana", "url": "http://dbpedia.org/resource/Banana", },
    3. %}
    4. {% for p in params %}
    5. {% set name = p["name"] %}
    6. {% set url = p["url"] %}
    7. ---
    8. apiVersion: batch/v1
    9. kind: Job
    10. metadata:
    11. labels:
    12. jobgroup: jobexample
    13. spec:
    14. template:
    15. metadata:
    16. name: jobexample
    17. labels:
    18. jobgroup: jobexample
    19. spec:
    20. containers:
    21. - name: c
    22. image: busybox
    23. command: ["sh", "-c", "echo Processing URL {{ url }} && sleep 5"]
    24. restartPolicy: Never
    25. {% endfor %}

    This example relies on a feature of YAML. One YAML file can contain multiple documents (Kubernetes manifests, in this case), separated by --- on a line by itself. You can pipe the output directly to kubectl to create the Jobs.

    Next, use this one-line Python program to expand the template:

    1. alias render_template='python -c "from jinja2 import Template; import sys; print(Template(sys.stdin.read()).render());"'

    Use render_template to convert the parameters and template into a single YAML file containing Kubernetes manifests:

    1. # This requires the alias you defined earlier
    2. cat job.yaml.jinja2 | render_template > jobs.yaml

    You can view jobs.yaml to verify that the render_template script worked correctly.

    Once you are happy that render_template is working how you intend, you can pipe its output into kubectl:

    Kubernetes accepts and runs the Jobs you created.

    1. # Remove the Jobs you created
    2. # Your cluster automatically cleans up their Pods
    3. kubectl delete job -l jobgroup=jobexample

    Using Jobs in real workloads

    In a real use case, each Job performs some substantial computation, such as rendering a frame of a movie, or processing a range of rows in a database. If you were rendering a movie you would set $ITEM to the frame number. If you were processing rows from a database table, you would set $ITEM to represent the range of database rows to process.

    In the task, you ran a command to collect the output from Pods by fetching their logs. In a real use case, each Pod for a Job writes its output to durable storage before completing. You can use a PersistentVolume for each Job, or an external storage service. For example, if you are rendering frames for a movie, use HTTP to PUT the rendered frame data to a URL, using a different URL for each frame.

    After you create a Job, Kubernetes automatically adds additional labels that distinguish one Job’s pods from another Job’s pods.

    In this example, each Job and its Pod template have a label: jobgroup=jobexample.

    Kubernetes itself pays no attention to labels named . Setting a label for all the Jobs you create from a template makes it convenient to operate on all those Jobs at once. In the you used a template to create several Jobs. The template ensures that each Pod also gets the same label, so you can check on all Pods for these templated Jobs with a single command.

    Note: The label key jobgroup is not special or reserved. You can pick your own labelling scheme. There are recommended labels that you can use if you wish.

    Alternatives

    If you plan to create a large number of Job objects, you may find that:

    • Even using labels, managing so many Jobs is cumbersome.
    • You are limited by a resource quota on Jobs: the API server permanently rejects some of your requests when you create a great deal of work in one batch.

    You could also consider writing your own to manage Job objects automatically.