.. _shellpipe:

Shell Pipeline
==============

The shell pipeline allows to execute an arbitrary command, typically a shell or a 
python script.

The pipeline definition file is below (and can be obtained typing ``owl pdef get shell``
-- see :ref:`piperun`).

.. code-block:: yaml

    version: 1.2
    name: shell

    # Directory where the command is writing data to (optional)
    # output_dir: /tmp/output

    command: ["sleep", "300"]

    # use dask (optional, see docs)
    # use_dask: false

    resources:
      workers: 1
      memory: 8
      threads: 1

The ``output_dir`` is optional and if specified the sheduler will save a log file and
the configuration used to run the pipeline as well as a list of environmental variables.

The ``command`` parameter defines which command or script to run. The example above just
waits for 5 minutes. Other command examples:

.. code-block:: yaml

    # Execute a Python script
    command: ["/opt/conda/bin/python", "script.py"]

    # Execute a script that takes two arguments as input
    command: ["/home/eglez/scripts/script.sh", "100", "200"]

Note that the path to the executable must be specified fully and the command
is a list containing all parts of the commands, i.e. the command ``ls -la`` would be
written as ``["ls", "-la"]``.

The ``use_dask`` parameters needs a bit of explanation. In the default mode (``false``) the
script is run in only one worker with access to as much memory and cores requested. Internally
the (python) script can use Dask, multiprocessing, multithreading or any other mechanism but
the resources will be fixed to one worker.

If ``use_dask`` is ``true`` then it makes sense to request more workers. It is required
the python script connects to the Dask scheduler as follows:

.. code-block:: python

    import os
    from distributed import Client

    DASK_SCHEDULER = os.getenv("DASK_SCHEDULER_ADDRESS")
    client = Client(DASK_SCHEDULER)

and performs calculations using the Dask API.

The script :download:`dask_pipeline.py <./dask_pipeline.py>` demonstrates a full script using Dask.