Shell Pipeline¶

The shell pipeline allows to execute an arbitrary command, typically a shell or a python script.

The pipeline definition file is below (and can be obtained typing owl pdef get shell – see Running Pipelines).

version: 1.2
name: shell

# Directory where the command is writing data to (optional)
# output_dir: /tmp/output

command: ["sleep", "300"]

# use dask (optional, see docs)
# use_dask: false

resources:
  workers: 1
  memory: 8
  threads: 1

The output_dir is optional and if specified the sheduler will save a log file and the configuration used to run the pipeline as well as a list of environmental variables.

The command parameter defines which command or script to run. The example above just waits for 5 minutes. Other command examples:

# Execute a Python script
command: ["/opt/conda/bin/python", "script.py"]

# Execute a script that takes two arguments as input
command: ["/home/eglez/scripts/script.sh", "100", "200"]

Note that the path to the executable must be specified fully and the command is a list containing all parts of the commands, i.e. the command ls -la would be written as ["ls", "-la"].

The use_dask parameters needs a bit of explanation. In the default mode (false) the script is run in only one worker with access to as much memory and cores requested. Internally the (python) script can use Dask, multiprocessing, multithreading or any other mechanism but the resources will be fixed to one worker.

If use_dask is true then it makes sense to request more workers. It is required the python script connects to the Dask scheduler as follows:

import os
from distributed import Client

DASK_SCHEDULER = os.getenv("DASK_SCHEDULER_ADDRESS")
client = Client(DASK_SCHEDULER)

and performs calculations using the Dask API.

The script dask_pipeline.py demonstrates a full script using Dask.

Shell Pipeline¶

Docs

Tutorials

Resources