Scientific workflows have become the lingua franca of scientific research for orchestrating the execution of processes and tasks. For instance, workflows are commonly used to document analyses in publications, automate tedious and repetitive tasks, share data analysis protocols, abstract analysis complexity from users, or provide training. Despite their increasing popularity, pipelines remain difficult to create given the complexity of the processes they implement, as well as the intricacies of the underlying platforms on which they run.
Workflows often require combining tools that were designed with rigid input and output formats and requirements. For example, some programs will require that the input file be named using a predetermined literal (ex.: “input.txt”), while other programs will only take input from standard input stream (stdin). As such, being able to combine programs requires not only sufficient familiarity with their dependencies, but also requires understanding of the underlying operating system and the commands needed to comply with the programs’ input, output and execution requirements.
Workflows are most commonly written using general-purpose programming languages (GPPLs). As an example, in bioinformatics, Python and Bash are particularly popular languages for implementing workflows. While GPPLs allow for great flexibility, their lack of support for workflow-specific constructs requires that that the programmer (or domain scientist) manually implements ancillary functionality, such as task synchronization, error control, logging, etc. Furthermore, the lack of portability across platforms and architectures means that a workflow written to run on a server can rarely be run on a High Performance Computing (HPC) cluster without substantial modifications to the code. These limitations have fueled a plethora of new projects which either extend the functionality of GPPLs, or which provide new, domain-specific programming languages (DSLs) that were specifically designed for the implementation of workflows.
Implementing a Pipeline in Pseudocode
In what follows, we will see how a simple workflow using two programs can be implemented using pseudocode: a human-readable notation resembling a GPPL. We will then build upon this simple program by iteratively adding pseudocode to expand the workflow’s scope. We will conclude this blogpost by recognizing (1) the complexity associated with using GPPLs to implement workflows, and (2) the benefits of using a domain-specific language (such as NextFlow). A more advanced video tutorial of NextFlow will be provided in a follow-up to this blog.
The workflow we will implement uses two made up programs: estimateDistribution and combineDistributionEsimates. The first program, estimateDistribution, takes as input a text file of raw data for a geographical location, ex. HAWAII.TXT, and produces another text file of processed data, ex. HAWAII_PROCESSED.TXT. An example call of estimateDistribution could look like the following:
The second program, combineDistributionEstimates, takes any number of processed data files, ex. HAWAII_PROCESSED.TXT, TAIWAN_PROCESSED.TXT and AUSTRALIA_PROCESSED.TXT, and generate a final output FINAL_OUTPUT.TXT. An example call of combineDistributionEsimates could look like the following:
The pseudocode describing a workflow for running estimateDistribution and combineDistributionEsimates on four input files: HAWAII_PROCESSED.TXT, TAIWAN_PROCESSED.TXT, AUSTRALIA_PROCESSED.TXT and JAPAN_PROCESSED.TXT is given in Example 1. A graphical representation of the workflow is given in Figure 1.
In lines 1, 2 and 3, we declare the list of inputs to and outputs of estimateDistribution. In lines 6, 7 and 8, we run estimateDistribution over each of the input files in Input_names_list and generate an output according to the naming scheme in Output_names_list. Line 10 runs the second program, combineDistributionEsimates, on all the files described in Output_names_list.
Implementing Workflow-Specific Functionality
While the workflow described in Example 1 can be viable, there are some enhancements which are not strictly necessary, but which need to be addressed to avoid runtime errors, improve efficiency and enhance cross-platform portability.
These enhancements are addressed as follows:
1- First, we need to ascertain that the input files do exist; otherwise, we need to stop the execution of the program. This functionality can be implemented in pseudocode as shown in lines 5-7 of Example 2.
2- Workflows written in traditional programming languages cannot by default resume in case of errors. For instance, in the Example 1, an error during the execution of combineDistributionEsimates would require re-running the pipeline from the beginning, even if estimateDistribution complete successfully. In a large workflow comprising dozens of steps and hundreds of input files, requiring a complete rerun of a workflow due to an error that occurred in the last step would be an utter waste of time and resources. Luckily, a check to only rerun programs that did complete successfully can be implemented using pseudocode, as shown in line 19 of Example 2.
3- When using a traditional programming language, any assumptions about the underlying execution environment need to be explicit. For instance, additional pseudocode is required for running the workflow on an HPC cluster where the inputs to estimateDistribution can be run in parallel on different nodes. Furthermore, to achieve a high level of portability, one needs to explicitly account for different HPC job schedulers, such as SLURM, PBS, SGE, and LSF. This additional functionality is illustrated in lines 8-13 of Example 2.
4- We need to manually synchronize execution of the program to ascertain that combineDistributionEsimates does not start before all four executions of estimateDistribution have completed. We can achieve this by adding pseudocode that will pause the workflow execution until estimateDistribution has completed running on all inputs. The code is illustrated in line 23 of Example 2, Updated Workflow from Example 1: the new pseudocode accounts for synchronization.
Workflow Frameworks: Facilitating the Implementation and Execution of Workflows
These four simple enhancements make it abundantly clear that despite the simplicity of the original workflow, the complexity associated with ancillary workflow tasks such as synchronization, fault-tolerance, cross-platform portability, parallelization, etc. can quickly bloat workflows with technical and extraneous functionality that renders the workflows difficult to understand, extend and/or maintain by third-party users. For these reasons, numerous efforts have undertaken the provision of libraries to facilitate the implementation of workflows. There are in fact dozens, if not hundreds of such libraries; see for instance https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems for a non-exhaustive list. Workflow Libraries differ principally on: 1- programming languages used, 2- philosophy, and 3- comprehensiveness of the provided functionality. We tackle these three points in what follows.
1- Languages supported.
Although workflow libraries exist for a plethora of programming languages, they are particularly popular for interpreted languages, such as Python, Ruby and Groovy. Preferences for frameworks are often biased by one’s programming expertise.
To make workflow implementation accessible to non-programmers who do not wish to learn a programming language for the sole purpose of writing a workflow, an increasing number of frameworks use domain-specific languages (DSL) instead. In short, DSL-based frameworks define their own languages, which comprise easy to use language constructs built on small syntax and specifically tailored for describing common workflow tasks. A particularly interesting DSL-based framework is the BigDataScript which, in my opinion, elegantly combines language expressiveness and power in an easy to master DSL.
The term philosophy is used here as a portmanteau for various characteristics. For example: Are the inputs implicitly or explicitly determined? Is the workflow declaration coupled with its execution (as in Ruffus) or is the workflow described in configuration files (as in NSF’s Pegasus framework)? For more info on philosophy-related issues, see Jeremy Liepzig’s review of “bioinformatic pipeline frameworks”.
Another key difference between workflow frameworks is the extent of the functionality that is supported “out of the box.” For instance, does the framework support easy parallelization of independent tasks? Is it portable across architectures such as docker, cloud and HPC environments? Does it gracefully handle errors and does it keep track of what tasks were successfully run?
As mentioned above, there are over a hundred workflow frameworks, with over a dozen for Python alone. Identifying the perfect framework for a project or an organization is no easy feat and depends on various parameters, not the least of which is the user’s (implementer’s) programming or scripting expertise.
A Sample Workflow Using NextFlow
In what follows I have chosen to introduce NextFlow (https://www.nextflow.io/), an expressive, versatile and particularly comprehensive DSL framework for composing and executing workflows. I chose to focus on NextFlow since it is a DSL which also supports the full syntax and semantics of Groovy, a dynamic language that runs on the Java platform. NextFlow’s support for a full-fledged programming language provides it with an extra edge that other DSL frameworks are missing. From a software engineering perspective, NextFlow seems to be very well thought out and very well supported. Furthermore, NextFlow uses the UNIX pipe concept, which simplifies writing parallel, scalable and portable pipelines.
The workflow in Example 2 can be described in NextFlow as follows:
In lines 1-6, we declare our input files which are loaded in a Channel – think of it as a queue – which we call inputFilesChan. Note that each of the items in the queue is of type file. This informs NextFlow to “automagically” stage files as necessary under different architectures.
A less verbose declaration can be achieved using the following syntax:
In lines 8-17, we declare a NextFlow process: a workflow unit. Line 8 declares the process name. Lines 9 and 10 indicate that the list of inputs to this process will be read from the Channel named inputFilesChan.
Just as we can build Channels, or queues, to store input files, we can also do the same for output files. Here, lines 11 and 12 create a channel of outputs that store the list of files generated from the run of the process. Note that since the inputs are of type file, we can use Groovy to request the baseName of each file (base name of “HAWAII.TXT” is “HAWAII”) as is done in line 12. We can then use the base name to build an output name on the fly by appending “_PROCESSED.TXT” to it.
For each file, we run the estimateDistribution program. The command line executed is defined in line 15, enclosed by “”” as the syntax of NextFlow’s process requires.
Lines 19 to 26 declare the second process of our pipeline which is concerned with the execution combineDistributionEsimates. Since the latter requires all input to be passed as list, we use Groovy to convert our output Channel from the previous process. This is done using the toList() Groovy method.
The output of the NextFlow script in Example 3 is given below.
NextFlow submits four runs of the estimateDistribution process and one run of the combineDistributionEsimates process. Although we did not explicitly implement any of the ancillary functionality (pseudocode in red from Example 2), handling and resuming from errors, synchronization, and running on different environments is automatically processed by NextFlow. For instance, let’s erroneously misspell the input HAWAII.TXT to KAWAII.TXT. After rerunning NextFlow on our input workflow, we obtain the output below.
As was the case with the first example, NextFlow schedules and submits four runs of the estimateDistribution process. Each of these runs is associated with one of the four input files. Immediately after submitting the jobs, NextFlow reports an ERROR ~ ‘Error executing process > ‘estimateDistribution (1)‘. The “Command error” section of the output shows that the program estimateDistribution could not locate the file KAWAII.TXT. Note however that although the error occurred on the first element of the list, NextFlow continued the execution for the remaining inputs of estimateDistribution. After correcting the erroneous file name form KAWAII.TXT to HAWAII.TXT and rerunning the workflow using the –resume option, NextFlow output is:
We see from the output that estimateDistribution (1) – the (1) here refers to the first element of the input files’ Channel – that the first input was submitted whereas the other inputs (2,3 and 4), which were already cached were not resubmitted. We did not need to add code to explicitly check for existing files as NextFlow automatically handles that for us.
A comprehensive workflow DSL can also facilitate portability of a workflow across platforms. For instance, when using NextFlow, only a few lines of code are required to run a pipeline on a HPC cluster environment. For example, the following lines allow us to configure our workflow for execution on the University of Hawaii’s HPC.
This configuration file instructs NextFlow to run both estimateDistribution and combineDistributionEsimates using SLURM on the cluster’s “community.q” queue on the cluster. The estimateDistribution and combineDistributionEsimates processes use 4 CPUs and 1 CPU respectively. It’s that simple!
The output looks identical to that obtained when running locally. However, retrieving the history of jobs on the cluster shows that two jobs “nf-estima+” and “nf=combin+” were both executed by the (SLURM) job scheduler.
The goal here is not to publicize NextFlow, but rather to show how a workflow DSL can streamline the implementation of complex functionality. The choice of infrastructure and products plays a critical role in helping scientists implement best practices of research transparency and reproducibility, particularly under the new paradigm of data-intensive science. Using appropriate scientific workflow frameworks can alleviate this burden by allowing researchers to abstract, manage and share complex scientific analyses with minimum effort.
The next installment of this two-part blogpost will give a brief tutorial on using NextFlow for the implementation of workflows, with a specific emphasis on bioinformatics and the Docker platform.
Thanks enormously to Dr. Mahdi Belcaid for this careful exposition and insight into how new tools can be used to streamline workflow production. It’s a major long-term goal of EarthCube, CRESCYNT, and many researchers to be able to assemble computational workflows more easily and efficiently, and we’re grateful for the vision Mahdi has shared here. Look for his follow-up screencast tutorial in a subsequent blogpost.