Using RStudio with GitHub for Securing Your Work, Versioning, and Collaboration

bicycle-bike-cyclist-37836_pexels

RStudio and GitHub go together like two wheels of a bicycle. Together, they form a low-overhead yet powerful open source workbench – a lean machine that can help take your data to far places.

In a recent informal evaluation of coral reef related research articles that included simultaneous publication of code and data, by far the most popular language used was R, and RStudio is the most popular interface for working with R.

In a CRESCYNT Data Science for Coral Reefs: Data Integration and Team Science workshop held earlier this year at NCEAS, the most powerful skill introduced was using RStudio with GitHub: writing data and code to GitHub from RStudio.

Once the link is set up, the work can continue in RStudio in the way people may be familiar with, and then one can make commits to GitHub periodically to save the work and potentially pave the way for collaboration.

  1. Download and install R
  2. Download and install R Studio
  3. Create a GitHub account
  4. Connect a repository in the GitHub account to RStudio.  This takes multiple steps; here are some good options to work through the process.

You can use sections of NCEAS’s long tutorial on Introduction to Open Data Science, initially developed by their Ocean Health Index group. Use the sections on overview of R and RStudio, Markdown, intro to GitHub, and skip down to collaboration in GitHub.

There are a number of other tutorials available to show how to make and use these softwares together; a beautifully clean and clear step-by-step tutorial is from Resources at GitHub; another excellent one is from Support at RStudio.

Also available to you: Hadley Wickham on Git and GitHub, a Study Group’s Version Control with RStudio and GitHub Simply Explained, R Class’s An Introduction to Git and How to Use it with RStudio, and U Chicago’s Using Git within RStudio, and Happy Git’s Connect RStudio to Git and GitHub. You may prefer the style of one of these over others.

If later you want to go further, come back for these tutorials hosted at R-Bloggers: R Blogdown Setup in GitHub, and Migrating from GitHub to GitLab with RStudio. And good news – you can now archive a snapshot of a GitHub repository to preserve and even publish a particular version of your RStudio work – plus get a doi to share – at Zenodo.

Summary:  Many research scientists use RStudio as their primary analytical and visualization tool. RStudio now has the ability to connect to a GitHub repository and make commits to it from RStudio. This permits critical core functions for a simplified workbench: documenting workflows (R Markdown), preserving code and provenance, producing repeatable results, creating flexible pipelines, sharing data and code, and allowing collaboration among members of a team. Versioning and teamwork is simplified by making commits frequently and always doing fresh pulldowns prior to commit (rather than focusing on branch development). The process is valuable for individual researchers, documenting project work, and collaborating in teams.

Related blogposts: Learning to Love R More and R Resources for Graphing and Visualization.

>>>Go to the blog Masterpost or the CRESCYNT website or NSF EarthCube.<<<

Advertisement
Using RStudio with GitHub for Securing Your Work, Versioning, and Collaboration

CRESCYNT Data Science For Coral Reefs Workshop 2 – Data Integration and Team Science

light-bulb-comfreak-pixabay_503881_1920

We’re extremely pleased to be able to offer two workshops in March 2018 at NCEAS. The second is CRESCYNT Data Science for Coral Reefs Workshop 2: Data Modeling, Data Integration and Team Science. Apply here.

When: March 12-15, 2018
Where: NCEAS, Santa Barbara, CA

Workshop description:

This workshop is recommended for early to mid-career and senior scientists with interest in applying technical skills to collaborative research questions and committed to subsequently sharing what they learn. Participants will learn how to structure and combine heterogeneous data sets relevant to coral reef scientists in a collaborative way. Topics covered on days 1 and 2 of the workshop will cover reproducible workflows using R/RStudio and RMarkdown, collaborative coding with GitHub, strategies for team research, data modeling and data wrangling, and advanced data integration and visualization tools. Participants will also spend 2 days working in small teams to integrate various coral reef datasets to practice the skills learned and develop workflows for data tidying and integration.

The workshop is limited to 20 participants. We encourage you to apply via this form. Workshop costs will be covered with support from NSF EarthCubeCRESCYNT RCN. We anticipate widely sharing workshop outcomes, including workflows and recommendations. Anticipate some significant pre-workshop prep effort.

Related posts: Learning to Love R More and R Resources for Visualization

UPDATE: HERE IS THE AGENDA FOR THE WORKSHOP, WITH TRAINING LINKS.

>>>Go to the blog Masterpost or the CRESCYNT website or NSF EarthCube.<<<

CRESCYNT Data Science For Coral Reefs Workshop 2 – Data Integration and Team Science

CRESCYNT Toolbox – Discovery of Online Datasets

cinergi-coralscreendump
Data discovery at cinergi.sdsc.edu

Announcing recent progress for data discovery in support of coral reef research!

Take advantage of this valuable community resource: a data discovery search engine with a special nose for locating coral reef research data sources: cinergi.sdsc.edu.

A major way CRESCYNT has made progress is by serving as a collective coral reef use case for EarthCube groups that are building great new software tools. One of those is a project called CINERGI. It registers resources – especially online repositories and individual online datasets, plus documents and software tools – and then enriches the descriptors to make the resources more searchable. The datasets themselves stay in place: a record of the dataset’s location and description are registered and augmented for better find and filter. Registered datasets and other resources, of course, keep whatever access and use license their authors have given them.

CINERGI already has over a million data sources registered, and over 11,000 of these are specifically coral reef datasets and data repositories. The interface now also features a geoportal to support spatial search options.

The CINERGI search tool is now able to incorporate ANY online resources you wish, so if you don’t find your favorite resources or want to connect your own publications, data, data products, software, code, and other resources, please contribute. If it’s a coral-related resource, be sure to include the word “coral” somewhere in your title or description so it can be retrieved that way later as well. (Great retrieval starts with great metadata!)

To add new resources: Go to cinergi.sdsc.edu, and click on CONTRIBUTE. Fill in ESPECIALLY the first fields – title, description, and URL – then as much of the rest as you can.

Try it out!

Thanks to EarthCube, the CINERGI Data Discovery Hub, and the great crew at the San Diego Supercomputer Center and partners for making this valuable tool possible for coral reef research and other geoscience communities. Here are slides and a video to learn more.

 

>>>Go to NSF EarthCube or the CRESCYNT website or the blog Masterpost.<<<

CRESCYNT Toolbox – Discovery of Online Datasets

CRESCYNT Toolbox – Workflows as Collaboration Space and Workbench Blueprint

puzzle-juggling_pixabayScientists need better ways to analyze and integrate their data and collaborate with other scientists; new computing technologies and tools can help with this. However, it’s difficult to overcome the challenge of disparate perspectives and the absence of a common vocabulary: this is true of multidisciplinary science teams, and true when scientists try to talk with computer scientists. Workflows, as a way to help design and implement a workbench, are needed both as a collaboration space and a blueprint for implementation.

Take a look at a recent presentation to the EarthCube science committee (video) or an earlier presentation offered at ASLO 2017 (slides and voice) to see a flexible and low-tech way to simultaneously (1) facilitate necessary sci-tech interactions for your own lab and (2) begin to sketch out a blueprint for work that needs to be done. Subsequent technical implementation is possible with new tools including Common Workflow Language (CWL) as a set of specifications, Dockers as modular and sharable containers for either fully developed tools or small pieces of code, and Nextflow as an efficient and highly scalable definitive software language to make the computational work happen. Look for a post in the near future by Mahdi Belcaid to describe the technical implementation of these workflows.

OPPORTUNITY! We will be hosting one or two in-person skills training workshops in the coming months, with your expenses covered by our NSF EarthCube CRESCYNT grant, focused particularly on training early career professionals, and will work through some challenging coral reef use cases and their cyberinfrastructure needs. We collected some great use cases at ICRS, but would like additional cases to consider, so we invite you to describe your own research challenges through this google form. Please contact us for more on this, or other issues. Thanks!

>>>Go to the blog Masterpost or the CRESCYNT website or NSF EarthCube.<<<

CRESCYNT Toolbox – Workflows as Collaboration Space and Workbench Blueprint

CoralNet: deploying deep learning in the shallow seas – by Oscar Beijbom

coralnet_oscar-beijbom

Having dedicated my PhD to automating the annotation of coral reef survey images, I have seen my fair share of surveys and talked to my fair share of coral ecologists. In these conversations, I always heard the same story: collecting survey images is quick, fun and exciting. Annotating them is, on the other hand, slow, boring, and excruciating.

When I started CoralNet (coralnet.ucsd.edu) back in 2012 the main goal was to make the manual annotation work less tedious by deploying automated annotators alongside human experts. These automated annotators were trained on previously annotated data using what was then the state-of-the-art in computer vision and machine learning. Experiments indicated that around 50% of the annotation work could be done automatically without sacrificing the quality of the ecological indicators (Beijbom et al. PLoS ONE 2015).

The Alpha version of CoralNet was thus created and started gaining popularity across the community. I think this was partly due to the promise of reduced annotation burden, but also because it offered a convenient online system for keeping track of and managing the annotation work. By the time we started working on the Beta release this summer, the Alpha site had over 300,000 images with over 5 million point annotations – all provided by the global coral community.

There was, however, a second purpose of creating CoralNet Alpha. Even back in 2012 the machine learning methods of the day were data-hungry. Basically, the more data you have, the better the algorithms will perform. Therefore, the second purpose of creating CoralNet was quite simply to let the data come to me rather than me chasing people down to get my hands on their data.

At the same time the CoralNet Alpha site was starting to buckle under increased usage. Long queues started to build up in the computer vision backend as power-users such as NOAA CREP and Catlin Seaview Survey uploaded tens of thousands of images to the site for analysis assistance. Time was ripe for an update.

As it turned out the timing was fortunate. A revolution has happened in the last few years, with the development of so-called deep convolutional neural networks. These immensely powerful, and large nets are capable of learning from vast databases to achieve vastly superior performance compared to methods from the previous generation.

During my postdoc at UC Berkeley last year, I researched ways to adapt this new technology to the coral reef image annotation task in the development of CoralNet Beta. Leaning on the vast database accumulated in CoralNet Alpha, I tuned a net with 14 hidden layers  and 150 million parameters to recognize over 1,000 types of coral substrates. The results, which are in preparation for publication, indicate that the annotation work can be automated to between 80% and 100% depending on the survey. Remarkably: in some situations, the classifier is more consistent with the human annotators than those annotators are with themselves. Indeed, we show that the combination of confident machine predictions with human annotations beat both the human and the machine alone!

Using funding from NOAA CREP and CRCP, I worked together with UCSD alumnus Stephen Chan to develop CoralNet Beta: a major update which includes migration of all hardware to Amazon Web Services, and a brand new, highly parallelizable, computer vision backend. Using the new computer vision backend the 350,000 images on the site were re-annotated in one week! Software updates include improved search, import, export and visualization tools.

With the new release in place we are happy to welcome new users to the site; the more data the merrier!

_____________

– Many thanks to Oscar Beijbom for this guest posting as well as significant technological contributions to the analysis and understanding of coral reefs. You can find Dr. Beijbom on GitHub, or see more of his projects and publications here. You can also find a series of video tutorials on using CoralNet (featuring the original Alpha interface) on CoralNet’s vimeo channel, and technical details about the new Beta version in the release notes.

 

>>>Go to NSF EarthCube or the CRESCYNT website or the blog Masterpost.<<<

CoralNet: deploying deep learning in the shallow seas – by Oscar Beijbom

CRESCYNT Toolbox – Disaster Planning and Recovery

With computers, the question is not whether they will fail, but when.

tl;dr – It’s very practical to have cloud storage backup in addition to still-useful external hard drive backup routines. Here are some secure cloud alternatives.

itcrowd_giphyPersonal note. I’ve had hard drive failures due to lightning strike; simultaneous death of mirrored hard drives within a RAID; drenching from an upper floor emergency shower left flowing by a disgruntled chemistry student; and most recently, demise of my laptop by sudden immersion in salt water (don’t ask). By some intersection of luck and diligence, on each occasion recent backups were available for data recovery. In the most recent remake, it was a revelation how much work is now backed up via regular entry into the casual cloud.

This latest digital landing was mercifully soft (…cloudlike). Because of work portability, my recent sequential backup habit has been to a paid unshared Dropbox account; $10/mo is a bargain for peace of mind (beyond a certain size, restoration is not drag-and-drop). A surprising number of files these days are embedded in multiple team projects – much on Google Drive – so all of that was available, with revision history. Group conversations and files were on Slack and email. One auxiliary brain (iPhone)  was in a waterproof case with cloud backup, and another auxiliary brain (project/task tracking) was in a web app, KanbanFlow. Past years of long-term archives were already on external hard drives in two different cities. GitHub is an amazing place to develop, document, recover and share work in progress and products, but it is not a long-term curated data repository. For valuable datasets, the rule is to simplify formats, attach metadata, and update media periodically.

Thinking about your own locations for data storage and access? Check out this review of more secure alternatives to – and apps on top of – Dropbox. Some, like OwnCloud, can serve as both storage and linked access for platforms like Agave. A strength of some current analytical platforms is that they can access multiple data storage locations; for example, Open Science Framework can access Dropbox, Google Drive, GitHub, Box, figshare, and now Dataverse and Amazon Web Services as well.

A collaborator recently pointed out that the expense of any particular type of data storage is really the expense of its backup processes: frequency, automation, security, and combination of archiving media. Justifying the expense can come down to this question: What would it cost to replace these data? Some things are more priceless than others.

Disaster Planning and Recovery tools.  To go beyond data recovery in your planning, here’s an online guide for IT disaster recovery planning and cyberattacks. How much of a problem is this really? See Google’s real-time attack map (hit “play”). Better to plan than fear. You did update those default passwords on your devices, yes?

Feel free to share your own digital-disaster-recovery story in the comments.

CRESCYNT Toolbox – Disaster Planning and Recovery

CRESCYNT Toolbox – Open Science Framework supports reproducible science

osf

The Open Science Framework, or OSF (osf.io) is a free and open source platform for supporting reproducible science. It’s designed more for documenting work than for streamlining work. It’s potentially a useful place to host a messy spread-out collaborative research project partly because of the add-ons it can connect with, (1) for storage: Amazon S3, Box, Dataverse, Dropbox, figshare, Google Drive, and GitHub, and (2) for references: Mendeley and Zotero. OSF also comes with a dashboard, a wiki, email notifications for your group, OSF file storage with built-in version control, data licensing background and assignment capability, ability to apply permission controls, and ability to make projects and components either private or public. Projects that one chooses to make public can be assigned DOIs (which can be transferred if you move your project elsewhere).

Aside from its primary role as a place to host research documentation and collaboration, OSF has also been used to teach classes in open science and reproducibility, and as a location to host conference products such as presentations and posters.

OSF is not a perfect platform for science – that elusive creature does not yet exist – but it’s a robust start with its ability to integrate other resources you may already be using, gets extra points for being free and open source, and could definitely be worth the learning curve of using with a next project. It continues to be improved over time, and how will we know what to ask of a platform if we don’t wrestle a bit with what’s already been built?

Learn more at the Open Science Framework FAQs and OSF Guides
or on YouTube (where everyone seems to learn new software these days):
+ Getting Started with the OSF (2 mins) (start here!) –
+ Most recent “OSF 101” intro webinar (1 hour) –
+ Deep dive into the OSF (1 hour) (thumbs up!) –
+ and more at OSF’s YouTube channel.

If you try it out, please let us know what you think!

Update: OSF now also connects with Bitbucket and ownCloud. See current Add-ons.

>>>Go to the blog Masterpost or the CRESCYNT website or NSF EarthCube.<<<

CRESCYNT Toolbox – Open Science Framework supports reproducible science