Using RStudio with GitHub for Securing Your Work, Versioning, and Collaboration

bicycle-bike-cyclist-37836_pexels

RStudio and GitHub go together like two wheels of a bicycle. Together, they form a low-overhead yet powerful open source workbench – a lean machine that can help take your data to far places.

In a recent informal evaluation of coral reef related research articles that included simultaneous publication of code and data, by far the most popular language used was R, and RStudio is the most popular interface for working with R.

In a CRESCYNT Data Science for Coral Reefs: Data Integration and Team Science workshop held earlier this year at NCEAS, the most powerful skill introduced was using RStudio with GitHub: writing data and code to GitHub from RStudio.

Once the link is set up, the work can continue in RStudio in the way people may be familiar with, and then one can make commits to GitHub periodically to save the work and potentially pave the way for collaboration.

  1. Download and install R
  2. Download and install R Studio
  3. Create a GitHub account
  4. Connect a repository in the GitHub account to RStudio.  This takes multiple steps; here are some good options to work through the process.

You can use sections of NCEAS’s long tutorial on Introduction to Open Data Science, initially developed by their Ocean Health Index group. Use the sections on overview of R and RStudio, Markdown, intro to GitHub, and skip down to collaboration in GitHub.

There are a number of other tutorials available to show how to make and use these softwares together; a beautifully clean and clear step-by-step tutorial is from Resources at GitHub; another excellent one is from Support at RStudio.

Also available to you: Hadley Wickham on Git and GitHub, a Study Group’s Version Control with RStudio and GitHub Simply Explained, R Class’s An Introduction to Git and How to Use it with RStudio, and U Chicago’s Using Git within RStudio, and Happy Git’s Connect RStudio to Git and GitHub. You may prefer the style of one of these over others.

If later you want to go further, come back for these tutorials hosted at R-Bloggers: R Blogdown Setup in GitHub, and Migrating from GitHub to GitLab with RStudio. And good news – you can now archive a snapshot of a GitHub repository to preserve and even publish a particular version of your RStudio work – plus get a doi to share – at Zenodo.

Summary:  Many research scientists use RStudio as their primary analytical and visualization tool. RStudio now has the ability to connect to a GitHub repository and make commits to it from RStudio. This permits critical core functions for a simplified workbench: documenting workflows (R Markdown), preserving code and provenance, producing repeatable results, creating flexible pipelines, sharing data and code, and allowing collaboration among members of a team. Versioning and teamwork is simplified by making commits frequently and always doing fresh pulldowns prior to commit (rather than focusing on branch development). The process is valuable for individual researchers, documenting project work, and collaborating in teams.

Related blogposts: Learning to Love R More and R Resources for Graphing and Visualization.

>>>Go to the blog Masterpost or the CRESCYNT website or NSF EarthCube.<<<

Using RStudio with GitHub for Securing Your Work, Versioning, and Collaboration

CRESCYNT Data Science For Coral Reefs Workshop 2 – Data Integration and Team Science

light-bulb-comfreak-pixabay_503881_1920

We’re extremely pleased to be able to offer two workshops in March 2018 at NCEAS. The second is CRESCYNT Data Science for Coral Reefs Workshop 2: Data Modeling, Data Integration and Team Science. Apply here.

When: March 12-15, 2018
Where: NCEAS, Santa Barbara, CA

Workshop description:

This workshop is recommended for early to mid-career and senior scientists with interest in applying technical skills to collaborative research questions and committed to subsequently sharing what they learn. Participants will learn how to structure and combine heterogeneous data sets relevant to coral reef scientists in a collaborative way. Topics covered on days 1 and 2 of the workshop will cover reproducible workflows using R/RStudio and RMarkdown, collaborative coding with GitHub, strategies for team research, data modeling and data wrangling, and advanced data integration and visualization tools. Participants will also spend 2 days working in small teams to integrate various coral reef datasets to practice the skills learned and develop workflows for data tidying and integration.

The workshop is limited to 20 participants. We encourage you to apply via this form. Workshop costs will be covered with support from NSF EarthCubeCRESCYNT RCN. We anticipate widely sharing workshop outcomes, including workflows and recommendations. Anticipate some significant pre-workshop prep effort.

Related posts: Learning to Love R More and R Resources for Visualization

UPDATE: HERE IS THE AGENDA FOR THE WORKSHOP, WITH TRAINING LINKS.

>>>Go to the blog Masterpost or the CRESCYNT website or NSF EarthCube.<<<

CRESCYNT Data Science For Coral Reefs Workshop 2 – Data Integration and Team Science

CRESCYNT Toolbox – Data Cleaning

cleaning_ancient-1807518_1280_pixabay

Data cleaning. Data cleansing. Data preparation. Data wrangling. Data munging.

Garbage In, Garbage Out.

If you’re like most people, your data is self-cleaning, meaning: you clean it yourself! We often hear that 80% of our “data time” is spent in data cleaning to enable 20% in analysis. Wouldn’t it be great to work through data prep faster and keep more of our data time for analysis, exploration, visualization, and next steps?

Here we look over the landscape of tools to consider, then come back to where our feet may be right now to offer specific suggestions for workbook users – lessons learned the hard way over a long time.

The end goal is for our data to be accurate, human-readable, machine-readable, and calculation-ready.

Software for data cleaning:

RapidMiner may be the best free (for academia) non-coding tool available right now. It was built for data mining, which doesn’t have to be your purpose for it to work hard for you. It has a diagram interface that’s very helpful. It almost facilitates a “workflow discovery” process as you incrementally try, tweak, build, and re-use workflow paths that grow during the process of data cleaning. It makes quick work of plotting histograms for each data column to instantly SEE distributions, zeros, outliers, and number of valid entries. It also records and tracks commands (like a baby Jupyter notebook). When pulling in raw datasets, it automatically keeps the originals intact: RapidMiner makes changes only to a copy of the raw data, and then one can export the finished files to use with other software. It’s really helpful in joining data from multiple sources, and pulling subsets for output data files. Rapid Miner Studio: Data Prep.

R is popular in domain sciences and has a number of powerful packages that help with data cleaning. Make use of RStudio as you clean and manipulate data with dplyr and tidyr. New packages are frequently released, such as assertr, janitor, and datamaid. A great thing about R is its active community in supporting learning. Check out this swirl tutorial on Getting and Cleaning Data – or access through DataCamp. The most comprehensive list of courses on R for data cleaning is here via R-bloggers. There’s lovely guidance for data wrangling in R by Hadley Wickham – useful even outside of R.

Data cleaning tool recommendations by KD Nuggets, Quora, and Varonis are a little dated and business-oriented, but these survivors may be worth investigating:

  • Trifacta Wrangler was built for desktop use, and designed for many steps of data wrangling: cleaning and beyond. See intro video, datasheet, demo with Tableau.
  • DataCleaner – community or commercial versions; can use SQL databases. Mostly designed for business applications; videos show what it can do.
  • OpenRefine gets the legacy spotlight (was Google Refine… now community held). Free, open source, and still in use. Here’s a recent walkthrough. Helps fix messy text and categorical data; less useful for other science research data.

There are some great tools to potentially steal borrow that started in data journalism:

  • Tabula is “a tool for liberating data tables trapped inside PDF files” – extracts text-based pdfs (not scans) to data tables.
  • csvkit is “a suite of command-line tools for converting to and working with CSV, the king of tabular file formats.” Helpful for converting Excel to csv cleanly, csv to json, json to csv, working with sql, and more.
  • agate is “a Python data analysis library that is optimized for humans instead of machines…. an alternative to numpy and pandas that solves real-world problems with readable code.” Here’s the cookbook.

Finally, Python itself is clearly a very powerful open source tool available for data cleaning. Look into it with this DataCamp course, pandas and other Python libraries, or this kaggle competition walkthrough.

Manual Data Munging. If you’re using Excel, Open Office, or Google Sheets to clean your data (e.g., small complex datasets common to many kinds of research), you may know all the tricks you need. For those newer to data editing, here are some tips.

  • To start: save a copy of your original file with a new name (e.g., tack on “initials-mod” plus the current date: YYYYMMDD). Then make your original file read-only to protect it. Pretend it’s in an untouchable vault. Use only your modifiable copy.
  • Create a Changes page where you record the edits you make in the order you make them. This also lets you scribble notes for changes you plan to make or items you need to track down but haven’t yet executed (Done and To-Do lists).
  • First edit: if you don’t have a unique ID for each row, add a first column with a simple numeric sequence before doing anything else.
  • Create a copy of that spreadsheet page, leave the original intact, and make modifications only to the newly copied page. If each new page is created on the left, the older pages are allowed to accumulate to the right (less vulnerable to accidental editing). Name each tab usefully.
  • Second edit: if your column headings take up more than one row, consolidate that information to one row. Do not Merge cells. Include units but no special characters or spaces: use only letters, numbers, dashes and underlines.
  • Add a Data Definitions page to record your old column headings, your new column headings, and explain what each column heading means. Include units here and also in column headings where possible.
  • In cells with text entries, do not use bare commas. Either use semicolons and dashes instead of commas in your text, or enclose text entries in quotation marks (otherwise creates havoc exporting to and importing from csv).
  • Add a Comments column, usually at the end of other columns, to record any notes that apply to individual rows or a subset of rows. Hit Save, now and often.
  • Now you’re free to sort each column to find data entry typos (e.g., misplaced decimals), inconsistent formats, or missing values. The danger here is failing to select the entire spreadsheet before sorting – always select the square northwest of cell A1 (or convert the spreadsheet to a table). This is where you’ll be glad you numbered each row at the start: to compare with the original.
  • If there’s a short note like data source credit that MUST accompany the page and must not get sorted, park it in the column header row to the right of the meaningful headers so it won’t get sorted, lost, or confused with actual data.
  • If you use formulas, document the formulas in your Data Definitions page (replace cells with column_names), and copy-paste as value-only as soon as practical.
  • Make sure there is only one kind of data in each column: do not mix numeric and text entries. Instead, create extra columns if needed.
  • Workbooks should be saved each day of editing with that day’s date (as YYYYMMDD) as part of the filename so you can get back to an older copy. At the end of your session clean up your Changes page, moving To-Do to Done and planning next steps.

Find more spreadsheet guidance here (a set of guidelines recently developed for participants in another project – good links to more resources at its end).

Beyond Workbooks. If you can execute and document your data cleaning workflows in a workbook like Excel, Open Office, or Google Sheets, then you can take your data cleaning to the next level. Knowing steps and sequences appropriate for your specific kinds of datasets will help enormously when you want to convert to using tools such as RapidMiner, R, or Python that can help with some automation and much bigger datasets.

Want more depth? Check out Data Preparation Tips, Tricks, and Tools: An Interview with the Insiders  “If you are not good at data preparation, you are NOT a good data scientist…. The validity of any analysis is resting almost completely on the preparation.” – Claudia Perlich

Happy scrubbing! Email or comment with your own favorite tips. Cheers, Ouida Meier

 

>>>Go to NSF EarthCube or the CRESCYNT website or the blog Masterpost.<<<

CRESCYNT Toolbox – Data Cleaning

CRESCYNT Toolbox – Discovery of Online Datasets

cinergi-coralscreendump
Data discovery at cinergi.sdsc.edu

Announcing recent progress for data discovery in support of coral reef research!

Take advantage of this valuable community resource: a data discovery search engine with a special nose for locating coral reef research data sources: cinergi.sdsc.edu.

A major way CRESCYNT has made progress is by serving as a collective coral reef use case for EarthCube groups that are building great new software tools. One of those is a project called CINERGI. It registers resources – especially online repositories and individual online datasets, plus documents and software tools – and then enriches the descriptors to make the resources more searchable. The datasets themselves stay in place: a record of the dataset’s location and description are registered and augmented for better find and filter. Registered datasets and other resources, of course, keep whatever access and use license their authors have given them.

CINERGI already has over a million data sources registered, and over 11,000 of these are specifically coral reef datasets and data repositories. The interface now also features a geoportal to support spatial search options.

The CINERGI search tool is now able to incorporate ANY online resources you wish, so if you don’t find your favorite resources or want to connect your own publications, data, data products, software, code, and other resources, please contribute. If it’s a coral-related resource, be sure to include the word “coral” somewhere in your title or description so it can be retrieved that way later as well. (Great retrieval starts with great metadata!)

To add new resources: Go to cinergi.sdsc.edu, and click on CONTRIBUTE. Fill in ESPECIALLY the first fields – title, description, and URL – then as much of the rest as you can.

Try it out!

Thanks to EarthCube, the CINERGI Data Discovery Hub, and the great crew at the San Diego Supercomputer Center and partners for making this valuable tool possible for coral reef research and other geoscience communities. Here are slides and a video to learn more.

 

>>>Go to NSF EarthCube or the CRESCYNT website or the blog Masterpost.<<<

CRESCYNT Toolbox – Discovery of Online Datasets

A Brief Introduction to Scientific Workflows – by Mahdi Belcaid

Introduction

Scientific workflows have become the lingua franca of scientific research for orchestrating the execution of processes and tasks. For instance, workflows are commonly used to document analyses in publications, automate tedious and repetitive tasks, share data analysis protocols, abstract analysis complexity from users, or provide training. Despite their increasing popularity, pipelines remain difficult to create given the complexity of the processes they implement, as well as the intricacies of the underlying platforms on which they run.

Workflows often require combining tools that were designed with rigid input and output formats and requirements. For example, some programs will require that the input file be named using a predetermined literal (ex.: “input.txt”), while other programs will only take input from standard input stream (stdin). As such, being able to combine programs requires not only sufficient familiarity with their dependencies, but also requires understanding of the underlying operating system and the commands needed to comply with the programs’ input, output and execution requirements.

Workflows are most commonly written using general-purpose programming languages (GPPLs). As an example, in bioinformatics, Python and Bash are particularly popular languages for implementing workflows. While GPPLs allow for great flexibility, their lack of support for workflow-specific constructs requires that that the programmer (or domain scientist) manually implements ancillary functionality, such as task synchronization, error control, logging, etc. Furthermore, the lack of portability across platforms and architectures means that a workflow written to run on a server can rarely be run on a High Performance Computing (HPC) cluster without substantial modifications to the code. These limitations have fueled a plethora of new projects which either extend the functionality of GPPLs, or which provide new, domain-specific programming languages (DSLs) that were specifically designed for the implementation of workflows.

Implementing a Pipeline in Pseudocode

In what follows, we will see how a simple workflow using two programs can be implemented using pseudocode: a human-readable notation resembling a GPPL. We will then build upon this simple program by iteratively adding pseudocode to expand the workflow’s scope. We will conclude this blogpost by recognizing (1) the complexity associated with using GPPLs to implement workflows, and (2) the benefits of using a domain-specific language (such as NextFlow). A more advanced video tutorial of NextFlow will be provided in a follow-up to this blog.

The workflow we will implement uses two made up programs: estimateDistribution and combineDistributionEsimates. The first program, estimateDistribution, takes as input a text file of raw data for a geographical location, ex. HAWAII.TXT, and produces another text file of processed data, ex. HAWAII_PROCESSED.TXT. An example call of estimateDistribution could look like the following:

mbw-1

The second program, combineDistributionEstimates, takes any number of processed data files, ex. HAWAII_PROCESSED.TXT, TAIWAN_PROCESSED.TXT and AUSTRALIA_PROCESSED.TXT, and generate a final output FINAL_OUTPUT.TXT. An example call of combineDistributionEsimates could look like the following:

mbw-2

The pseudocode describing a workflow for running estimateDistribution and combineDistributionEsimates on four input files: HAWAII_PROCESSED.TXT, TAIWAN_PROCESSED.TXT, AUSTRALIA_PROCESSED.TXT and JAPAN_PROCESSED.TXT is given in Example 1. A graphical representation of the workflow is given in Figure 1.

mbw-3
Figure 1. Graphical representation of the workflow described in the pseudocode of Example 1. Programs are represented with a rectangle and inputs are represented with a curved-border rectangle. While our example only pertains to four input files representing four geographical regions, the process represented in dashed lines depicts additional input(s) that could be processed in estimateDistribution and added to the combineDistributionEsimates computation.
mbw-4
Example 1. Workflow for running estimateDistribution and combineDistributionEsimates on four input files.

In lines 1, 2 and 3, we declare the list of inputs to and outputs of estimateDistribution. In lines 6, 7 and 8, we run estimateDistribution over each of the input files in Input_names_list and generate an output according to the naming scheme in Output_names_list. Line 10 runs the second program, combineDistributionEsimates, on all the files described in Output_names_list.

Implementing Workflow-Specific Functionality

While the workflow described in Example 1 can be viable, there are some enhancements which are not strictly necessary, but which need to be addressed to avoid runtime errors, improve efficiency and enhance cross-platform portability.

These enhancements are addressed as follows:

1- First, we need to ascertain that the input files do exist; otherwise, we need to stop the execution of the program. This functionality can be implemented in pseudocode as shown in lines 5-7 of Example 2.

2- Workflows written in traditional programming languages cannot by default resume in case of errors. For instance, in the Example 1, an error during the execution of combineDistributionEsimates would require re-running the pipeline from the beginning, even if estimateDistribution complete successfully. In a large workflow comprising dozens of steps and hundreds of input files, requiring a complete rerun of a workflow due to an error that occurred in the last step would be an utter waste of time and resources. Luckily, a check to only rerun programs that did complete successfully can be implemented using pseudocode, as shown in line 19 of Example 2.

3- When using a traditional programming language, any assumptions about the underlying execution environment need to be explicit. For instance, additional pseudocode is required for running the workflow on an HPC cluster where the inputs to estimateDistribution can be run in parallel on different nodes. Furthermore, to achieve a high level of portability, one needs to explicitly account for different HPC job schedulers, such as SLURM, PBS, SGE, and LSF. This additional functionality is illustrated in lines 8-13 of Example 2.

4- We need to manually synchronize execution of the program to ascertain that combineDistributionEsimates does not start before all four executions of estimateDistribution have completed. We can achieve this by adding pseudocode that will pause the workflow execution until estimateDistribution has completed running on all inputs. The code is illustrated in line 23 of Example 2, Updated Workflow from Example 1: the new pseudocode accounts for synchronization.

mbw-5
Example 2. Final extended workflow for running estimateDistribution and combineDistributionEsimates on four input files.

Workflow Frameworks: Facilitating the Implementation and Execution of Workflows

These four simple enhancements make it abundantly clear that despite the simplicity of the original workflow, the complexity associated with ancillary workflow tasks such as synchronization, fault-tolerance, cross-platform portability, parallelization, etc. can quickly bloat workflows with technical and extraneous functionality that renders the workflows difficult to understand, extend and/or maintain by third-party users. For these reasons, numerous efforts have undertaken the provision of libraries to facilitate the implementation of workflows. There are in fact dozens, if not hundreds of such libraries; see for instance https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems for a non-exhaustive list. Workflow Libraries differ principally on:  1- programming languages used, 2- philosophy, and 3- comprehensiveness of the provided functionality. We tackle these three points in what follows.

1- Languages supported.

Although workflow libraries exist for a plethora of programming languages, they are particularly popular for interpreted languages, such as Python, Ruby and Groovy. Preferences for frameworks are often biased by one’s programming expertise.

To make workflow implementation accessible to non-programmers who do not wish to learn a programming language for the sole purpose of writing a workflow, an increasing number of frameworks use domain-specific languages (DSL) instead. In short, DSL-based frameworks define their own languages, which comprise easy to use language constructs built on small syntax and specifically tailored for describing common workflow tasks. A particularly interesting DSL-based framework is the BigDataScript which, in my opinion, elegantly combines language expressiveness and power in an easy to master DSL.

2- Philosophy.

The term philosophy is used here as a portmanteau for various characteristics. For example: Are the inputs implicitly or explicitly determined? Is the workflow declaration coupled with its execution (as in Ruffus) or is the workflow described in configuration files (as in NSF’s Pegasus framework)? For more info on philosophy-related issues, see Jeremy Liepzig’s review of “bioinformatic pipeline frameworks”.

3- Comprehensiveness.

Another key difference between workflow frameworks is the extent of the functionality that is supported “out of the box.” For instance, does the framework support easy parallelization of independent tasks? Is it portable across architectures such as docker, cloud and HPC environments? Does it gracefully handle errors and does it keep track of what tasks were successfully run?

As mentioned above, there are over a hundred workflow frameworks, with over a dozen for Python alone. Identifying the perfect framework for a project or an organization is no easy feat and depends on various parameters, not the least of which is the user’s (implementer’s) programming or scripting expertise.

A Sample Workflow Using NextFlow

In what follows I have chosen to introduce NextFlow (https://www.nextflow.io/), an expressive, versatile and particularly comprehensive DSL framework for composing and executing workflows. I chose to focus on NextFlow since it is a DSL which also supports the full syntax and semantics of Groovy, a dynamic language that runs on the Java platform. NextFlow’s support for a full-fledged programming language provides it with an extra edge that other DSL frameworks are missing. From a software engineering perspective, NextFlow seems to be very well thought out and very well supported. Furthermore, NextFlow uses the UNIX pipe concept, which simplifies writing parallel, scalable and portable pipelines.

The workflow in Example 2 can be described in NextFlow as follows:

mbw-6repl
Example 3. NextFlow script for running the workflow described using the pseudocode in Example 2.

In lines 1-6, we declare our input files which are loaded in a Channel – think of it as a queue – which we call inputFilesChan. Note that each of the items in the queue is of type file. This informs NextFlow to “automagically” stage files as necessary under different architectures.

A less verbose declaration can be achieved using the following syntax:

mbw-7

In lines 8-17, we declare a NextFlow process: a workflow unit. Line 8 declares the process name. Lines 9 and 10 indicate that the list of inputs to this process will be read from the Channel named inputFilesChan.

Just as we can build Channels, or queues, to store input files, we can also do the same for output files. Here, lines 11 and 12 create a channel of outputs that store the list of files generated from the run of the process. Note that since the inputs are of type file, we can use Groovy to request the baseName of each file (base name of “HAWAII.TXT”  is “HAWAII”) as is done in line 12. We can then use the base name to build an output name on the fly by appending “_PROCESSED.TXT” to it.

For each file, we run the estimateDistribution program. The command line executed is defined in line 15, enclosed by “”” as the syntax of NextFlow’s process requires.

Lines 19 to 26 declare the second process of our pipeline which is concerned with the execution combineDistributionEsimates. Since the latter requires all input to be passed as list, we use Groovy to convert our output Channel from the previous process. This is done using the toList() Groovy method.

The output of the NextFlow script in Example 3 is given below.

mbw-8
Output 1. Output generated by the execution of the NextFlow script from Example 3.

NextFlow submits four runs of the estimateDistribution process and one run of the combineDistributionEsimates process. Although we did not explicitly implement any of the ancillary functionality (pseudocode in red from Example 2), handling and resuming from errors, synchronization, and running on different environments is automatically processed by NextFlow. For instance, let’s erroneously misspell the input HAWAII.TXT to KAWAII.TXT. After rerunning NextFlow on our input workflow, we obtain the output below.

mbw-9
Output 2. Error message produced by the execution of the NextFlow script from Example 2 after introduction of an erroneous input file.

As was the case with the first example, NextFlow schedules and submits four runs of the estimateDistribution process. Each of these runs is associated with one of the four input files. Immediately after submitting the jobs, NextFlow reports an ERROR ~ ‘Error executing process > ‘estimateDistribution (1)‘. The “Command errorsection of the output shows that the program estimateDistribution could not locate the file KAWAII.TXT. Note however that although the error occurred on the first element of the list, NextFlow continued the execution for the remaining inputs of estimateDistribution. After correcting the erroneous file name form KAWAII.TXT to HAWAII.TXT and rerunning the workflow using the –resume option, NextFlow output is:

mbw-10
Output 3. Successfully completed jobs are cached by NextFlow across runs.

We see from the output that estimateDistribution (1) – the (1) here refers to the first element of the input files’ Channel – that the first input was submitted whereas the other inputs (2,3 and 4), which were already cached were not resubmitted. We did not need to add code to explicitly check for existing files as NextFlow automatically handles that for us.

A comprehensive workflow DSL can also facilitate portability of a workflow across platforms. For instance, when using NextFlow, only a few lines of code are required to run a pipeline on a HPC cluster environment. For example, the following lines allow us to configure our workflow for execution on the University of Hawaii’s HPC.

mbw-11
Example 4. Configuration file for running the NextFlow script from Example 3 on a compute cluster which uses the SLURM job scheduler.

This configuration file instructs NextFlow to run both estimateDistribution and combineDistributionEsimates using SLURM on the cluster’s “community.q” queue on the cluster. The estimateDistribution and combineDistributionEsimates processes use 4 CPUs and 1 CPU respectively. It’s that simple!

The output looks identical to that obtained when running locally. However, retrieving the history of jobs on the cluster shows that two jobs “nf-estima+” and “nf=combin+” were both executed by the (SLURM) job scheduler.

mbw-12
Output 4. Processes described in the NextFlow script from Example 3 are executed by SLURM as two independent jobs.

Conclusion

The goal here is not to publicize NextFlow, but rather to show how a workflow DSL can streamline the implementation of complex functionality. The choice of infrastructure and products plays a critical role in helping scientists implement best practices of research transparency and reproducibility, particularly under the new paradigm of data-intensive science. Using appropriate scientific workflow frameworks can alleviate this burden by allowing researchers to abstract, manage and share complex scientific analyses with minimum effort.

The next installment of this two-part blogpost will give a brief tutorial on using NextFlow for the implementation of workflows, with a specific emphasis on bioinformatics and the Docker platform.


Thanks enormously to Dr. Mahdi Belcaid for this careful exposition and insight into how new tools can be used to streamline workflow production. It’s a major long-term goal of EarthCube, CRESCYNT, and many researchers to be able to assemble computational workflows more easily and efficiently, and we’re grateful for the vision Mahdi has shared here. Look for his follow-up screencast tutorial in a subsequent blogpost.

>>>Go to NSF EarthCube or the CRESCYNT website or the blog Masterpost.<<<

A Brief Introduction to Scientific Workflows – by Mahdi Belcaid

CRESCYNT Toolbox – Workflows as Collaboration Space and Workbench Blueprint

puzzle-juggling_pixabayScientists need better ways to analyze and integrate their data and collaborate with other scientists; new computing technologies and tools can help with this. However, it’s difficult to overcome the challenge of disparate perspectives and the absence of a common vocabulary: this is true of multidisciplinary science teams, and true when scientists try to talk with computer scientists. Workflows, as a way to help design and implement a workbench, are needed both as a collaboration space and a blueprint for implementation.

Take a look at a recent presentation to the EarthCube science committee (video) or an earlier presentation offered at ASLO 2017 (slides and voice) to see a flexible and low-tech way to simultaneously (1) facilitate necessary sci-tech interactions for your own lab and (2) begin to sketch out a blueprint for work that needs to be done. Subsequent technical implementation is possible with new tools including Common Workflow Language (CWL) as a set of specifications, Dockers as modular and sharable containers for either fully developed tools or small pieces of code, and Nextflow as an efficient and highly scalable definitive software language to make the computational work happen. Look for a post in the near future by Mahdi Belcaid to describe the technical implementation of these workflows.

OPPORTUNITY! We will be hosting one or two in-person skills training workshops in the coming months, with your expenses covered by our NSF EarthCube CRESCYNT grant, focused particularly on training early career professionals, and will work through some challenging coral reef use cases and their cyberinfrastructure needs. We collected some great use cases at ICRS, but would like additional cases to consider, so we invite you to describe your own research challenges through this google form. Please contact us for more on this, or other issues. Thanks!

>>>Go to the blog Masterpost or the CRESCYNT website or NSF EarthCube.<<<

CRESCYNT Toolbox – Workflows as Collaboration Space and Workbench Blueprint

CRESCYNT Toolbox – EarthCube Sci-Tech Matchup – Lightning Talks

Get a fast intro to new Ready-for-Science EarthCube Tools!

We’ve helped arrange a series of lightning talks that will feature tools developed by EarthCube “building block” projects for direct use by scientists. Many EarthCube-built tools are designed to serve as internal components of an EarthCube platform. Other tools were built for scientists as direct end users, and a collection of these are now ripe for adoption. Will some of these help you with your research work?
The current collection will be shown over a span of two online sessions. Find log-in details for Wed., Feb. 15 and Fri., Feb. 17 (no RSVP required – just show up).
Join us!

Wednesday, Feb. 15, 2017
4-5pm EST / 1-2pm PST / 11am-12pm HST – login link HERE
GeoDataspace/GeoTrust, Tanu Malik
ECOGEO Virtual Machine, Elisha Wood-Charlson
LinkedEarth, Julien Emile-Geay
OntoSoft, Yolanda Gil
Flyover Country, Amy Myrbo

Friday, Feb. 17, 2017
4-5:30pm EST / 1-2:30pm PST / 11am-12:30pm HST – login link HERE
CHORDS, Mike Daniels
SuAVE, Ilya Zaslavsky
CINERGI, Ilya Zaslavsky
X-DOMES Ontology Registry, Janet Fredericks
X-DOMES SensorML Registry, Janet Fredericks
iSamplesIGSN, Kerstin Lehnert
GeoDeepDive, Shanan Peters
Digital Crust, Shanan Peters
ECITE, Sara Graves
Earth System Bridge, Scott Peckham

UPDATE: All of the videos from the first rounds of talks are now on the EarthCube YouTube channel – here’s the playlist. Slides are now accessible at the EarthCube Tools Inventory, including additional presentations.

CRESCYNT Toolbox – EarthCube Sci-Tech Matchup – Lightning Talks