.. _shellpipe: Shell Pipeline ============== The shell pipeline allows to execute an arbitrary command, typically a shell or a python script. The pipeline definition file is below (and can be obtained typing ``owl pdef get shell`` -- see :ref:`piperun`). .. code-block:: yaml version: 1.2 name: shell # Directory where the command is writing data to (optional) # output_dir: /tmp/output command: ["sleep", "300"] # use dask (optional, see docs) # use_dask: false resources: workers: 1 memory: 8 threads: 1 The ``output_dir`` is optional and if specified the sheduler will save a log file and the configuration used to run the pipeline as well as a list of environmental variables. The ``command`` parameter defines which command or script to run. The example above just waits for 5 minutes. Other command examples: .. code-block:: yaml # Execute a Python script command: ["/opt/conda/bin/python", "script.py"] # Execute a script that takes two arguments as input command: ["/home/eglez/scripts/script.sh", "100", "200"] Note that the path to the executable must be specified fully and the command is a list containing all parts of the commands, i.e. the command ``ls -la`` would be written as ``["ls", "-la"]``. The ``use_dask`` parameters needs a bit of explanation. In the default mode (``false``) the script is run in only one worker with access to as much memory and cores requested. Internally the (python) script can use Dask, multiprocessing, multithreading or any other mechanism but the resources will be fixed to one worker. If ``use_dask`` is ``true`` then it makes sense to request more workers. It is required the python script connects to the Dask scheduler as follows: .. code-block:: python import os from distributed import Client DASK_SCHEDULER = os.getenv("DASK_SCHEDULER_ADDRESS") client = Client(DASK_SCHEDULER) and performs calculations using the Dask API. The script :download:`dask_pipeline.py <./dask_pipeline.py>` demonstrates a full script using Dask.