This two-part blog aims to help users understand Nextflow’s powerful caching mechanism. Part one describes how it works whilst part two will focus on execution provenance and troubleshooting. You can read part two here
Task execution caching and checkpointing is an essential feature of any modern workflow manager and Nextflow provides an automated caching mechanism with every workflow execution. When using the -resume
flag, successfully completed tasks are skipped and the previously cached results are used in downstream tasks. But understanding the specifics of how it works and debugging situations when the behaviour is not as expected is a common source of frustration.
The mechanism works by assigning a unique ID to each task. This unique ID is used to create a separate execution directory, called the working directory, where the tasks are executed and the results stored. A task’s unique ID is generated as a 128-bit hash number obtained from a composition of the task’s:
The -resume
command line option allows for the continuation of a workflow execution. It can be used in its most basic form with:
$ nextflow run nextflow-io/hello -resume
In practice, every execution starts from the beginning. However, when using resume, before launching a task, Nextflow uses the unique ID to check if:
If these conditions are satisfied, the task execution is skipped and the previously computed outputs are applied. When a task requires recomputation, ie. the conditions above are not fulfilled, the downstream tasks are automatically invalidated.
By default, the task work directories are created in the directory from where the pipeline is launched. This is often a scratch storage area that can be cleaned up once the computation is completed. A different location for the execution work directory can be specified using the command line option -w
e.g.
$ nextflow run <script> -w /some/scratch/dir
Note that if you delete or move the pipeline work directory, this will prevent to use the resume feature in subsequent runs.
Also note that the pipeline work directory is intended to be used as a temporary scratch area. The final
workflow outputs are expected to be stored in a different location specified using the publishDir
directive.
The hash provides a convenient way for Nextflow to determine if a task requires recomputation. For each input file, the hash code is computed with:
Therefore, even just performing a touch on a file will invalidate the task execution.
It is good practice to organize each experiment in its own folder. An experiment’s input parameters can be specified using a Nextflow config file which also makes it simple to track and replicate an experiment over time. Note that you should avoid launching two (or more) Nextflow instances in the same directory concurrently.
The nextflow log command lists the executions run in the current folder:
$ nextflow log TIMESTAMP DURATION RUN NAME STATUS REVISION ID SESSION ID COMMAND 2019-05-06 12:07:32 1.2s focused_carson ERR a9012339ce 7363b3f0-09ac-495b-a947-28cf430d0b85 nextflow run hello 2019-05-06 12:08:33 21.1s mighty_boyd OK a9012339ce 7363b3f0-09ac-495b-a947-28cf430d0b85 nextflow run rnaseq-nf -with-docker 2019-05-06 12:31:15 1.2s insane_celsius ERR b9aefc67b4 4dc656d2-c410-44c8-bc32-7dd0ea87bebf nextflow run rnaseq-nf 2019-05-06 12:31:24 17s stupefied_euclid OK b9aefc67b4 4dc656d2-c410-44c8-bc32-7dd0ea87bebf nextflow run rnaseq-nf -resume -with-docker
You can use the resume command with the session ID to recover a specific execution. For example:
nextflow run naseq-nf -resume 4dc656d2-c410-44c8-bc32-7dd0ea87bebf
Stay tuned for part two where we will discuss resume in more detail with respect to provenance and troubleshooting techniques!