Shell Pipeline¶
The shell pipeline allows to execute an arbitrary command, typically a shell or a python script.
The pipeline definition file is below (and can be obtained typing owl pdef get shell
– see Running Pipelines).
version: 1.2
name: shell
# Directory where the command is writing data to (optional)
# output_dir: /tmp/output
command: ["sleep", "300"]
# use dask (optional, see docs)
# use_dask: false
resources:
workers: 1
memory: 8
threads: 1
The output_dir
is optional and if specified the sheduler will save a log file and
the configuration used to run the pipeline as well as a list of environmental variables.
The command
parameter defines which command or script to run. The example above just
waits for 5 minutes. Other command examples:
# Execute a Python script
command: ["/opt/conda/bin/python", "script.py"]
# Execute a script that takes two arguments as input
command: ["/home/eglez/scripts/script.sh", "100", "200"]
Note that the path to the executable must be specified fully and the command
is a list containing all parts of the commands, i.e. the command ls -la
would be
written as ["ls", "-la"]
.
The use_dask
parameters needs a bit of explanation. In the default mode (false
) the
script is run in only one worker with access to as much memory and cores requested. Internally
the (python) script can use Dask, multiprocessing, multithreading or any other mechanism but
the resources will be fixed to one worker.
If use_dask
is true
then it makes sense to request more workers. It is required
the python script connects to the Dask scheduler as follows:
import os
from distributed import Client
DASK_SCHEDULER = os.getenv("DASK_SCHEDULER_ADDRESS")
client = Client(DASK_SCHEDULER)
and performs calculations using the Dask API.
The script dask_pipeline.py
demonstrates a full script using Dask.