Using RStudio with GitHub for Securing Your Work, Versioning, and Collaboration

bicycle-bike-cyclist-37836_pexels

RStudio and GitHub go together like two wheels of a bicycle. Together, they form a low-overhead yet powerful open source workbench – a lean machine that can help take your data to far places.

In a recent informal evaluation of coral reef related research articles that included simultaneous publication of code and data, by far the most popular language used was R, and RStudio is the most popular interface for working with R.

In a CRESCYNT Data Science for Coral Reefs: Data Integration and Team Science workshop held earlier this year at NCEAS, the most powerful skill introduced was using RStudio with GitHub: writing data and code to GitHub from RStudio.

Once the link is set up, the work can continue in RStudio in the way people may be familiar with, and then one can make commits to GitHub periodically to save the work and potentially pave the way for collaboration.

  1. Download and install R
  2. Download and install R Studio
  3. Create a GitHub account
  4. Connect a repository in the GitHub account to RStudio.  This takes multiple steps; here are some good options to work through the process.

You can use sections of NCEAS’s long tutorial on Introduction to Open Data Science, initially developed by their Ocean Health Index group. Use the sections on overview of R and RStudio, Markdown, intro to GitHub, and skip down to collaboration in GitHub.

There are a number of other tutorials available to show how to make and use these softwares together; a beautifully clean and clear step-by-step tutorial is from Resources at GitHub; another excellent one is from Support at RStudio.

Also available to you: Hadley Wickham on Git and GitHub, a Study Group’s Version Control with RStudio and GitHub Simply Explained, R Class’s An Introduction to Git and How to Use it with RStudio, and U Chicago’s Using Git within RStudio, and Happy Git’s Connect RStudio to Git and GitHub. You may prefer the style of one of these over others.

If later you want to go further, come back for these tutorials hosted at R-Bloggers: R Blogdown Setup in GitHub, and Migrating from GitHub to GitLab with RStudio. And good news – you can now archive a snapshot of a GitHub repository to preserve and even publish a particular version of your RStudio work – plus get a doi to share – at Zenodo.

Summary:  Many research scientists use RStudio as their primary analytical and visualization tool. RStudio now has the ability to connect to a GitHub repository and make commits to it from RStudio. This permits critical core functions for a simplified workbench: documenting workflows (R Markdown), preserving code and provenance, producing repeatable results, creating flexible pipelines, sharing data and code, and allowing collaboration among members of a team. Versioning and teamwork is simplified by making commits frequently and always doing fresh pulldowns prior to commit (rather than focusing on branch development). The process is valuable for individual researchers, documenting project work, and collaborating in teams.

Related blogposts: Learning to Love R More and R Resources for Graphing and Visualization.

>>>Go to the blog Masterpost or the CRESCYNT website or NSF EarthCube.<<<

Using RStudio with GitHub for Securing Your Work, Versioning, and Collaboration

CoralNet: deploying deep learning in the shallow seas – by Oscar Beijbom

coralnet_oscar-beijbom

Having dedicated my PhD to automating the annotation of coral reef survey images, I have seen my fair share of surveys and talked to my fair share of coral ecologists. In these conversations, I always heard the same story: collecting survey images is quick, fun and exciting. Annotating them is, on the other hand, slow, boring, and excruciating.

When I started CoralNet (coralnet.ucsd.edu) back in 2012 the main goal was to make the manual annotation work less tedious by deploying automated annotators alongside human experts. These automated annotators were trained on previously annotated data using what was then the state-of-the-art in computer vision and machine learning. Experiments indicated that around 50% of the annotation work could be done automatically without sacrificing the quality of the ecological indicators (Beijbom et al. PLoS ONE 2015).

The Alpha version of CoralNet was thus created and started gaining popularity across the community. I think this was partly due to the promise of reduced annotation burden, but also because it offered a convenient online system for keeping track of and managing the annotation work. By the time we started working on the Beta release this summer, the Alpha site had over 300,000 images with over 5 million point annotations – all provided by the global coral community.

There was, however, a second purpose of creating CoralNet Alpha. Even back in 2012 the machine learning methods of the day were data-hungry. Basically, the more data you have, the better the algorithms will perform. Therefore, the second purpose of creating CoralNet was quite simply to let the data come to me rather than me chasing people down to get my hands on their data.

At the same time the CoralNet Alpha site was starting to buckle under increased usage. Long queues started to build up in the computer vision backend as power-users such as NOAA CREP and Catlin Seaview Survey uploaded tens of thousands of images to the site for analysis assistance. Time was ripe for an update.

As it turned out the timing was fortunate. A revolution has happened in the last few years, with the development of so-called deep convolutional neural networks. These immensely powerful, and large nets are capable of learning from vast databases to achieve vastly superior performance compared to methods from the previous generation.

During my postdoc at UC Berkeley last year, I researched ways to adapt this new technology to the coral reef image annotation task in the development of CoralNet Beta. Leaning on the vast database accumulated in CoralNet Alpha, I tuned a net with 14 hidden layers  and 150 million parameters to recognize over 1,000 types of coral substrates. The results, which are in preparation for publication, indicate that the annotation work can be automated to between 80% and 100% depending on the survey. Remarkably: in some situations, the classifier is more consistent with the human annotators than those annotators are with themselves. Indeed, we show that the combination of confident machine predictions with human annotations beat both the human and the machine alone!

Using funding from NOAA CREP and CRCP, I worked together with UCSD alumnus Stephen Chan to develop CoralNet Beta: a major update which includes migration of all hardware to Amazon Web Services, and a brand new, highly parallelizable, computer vision backend. Using the new computer vision backend the 350,000 images on the site were re-annotated in one week! Software updates include improved search, import, export and visualization tools.

With the new release in place we are happy to welcome new users to the site; the more data the merrier!

_____________

– Many thanks to Oscar Beijbom for this guest posting as well as significant technological contributions to the analysis and understanding of coral reefs. You can find Dr. Beijbom on GitHub, or see more of his projects and publications here. You can also find a series of video tutorials on using CoralNet (featuring the original Alpha interface) on CoralNet’s vimeo channel, and technical details about the new Beta version in the release notes.

 

>>>Go to NSF EarthCube or the CRESCYNT website or the blog Masterpost.<<<

CoralNet: deploying deep learning in the shallow seas – by Oscar Beijbom

CRESCYNT Toolbox – R Resources for Graphing and Visualization

In a previous post we offered some solid supportive resources for learning R – a healthy  dinner with lots of great vegetables. Here we offer a dessert cart of rich resources for data visualization and graphing. It’s a powerful motivation for using R.
rgraphgallery

First up is The New R Graph Gallery – extensive, useful, and actually new. “It contains more than 200 data visualizations categorized by type, along with the R code that created them. You can browse the gallery by types of chart (boxplots, maps, histograms, interactive charts, 3-D charts, etc), or search the chart descriptions. Once you’ve found a chart you like, you can admire it in the gallery (and interact with it, if possible), and also find the R code which you can adapt for your own use. Some entries even include mini-tutorials describing how the chart was made.” (Description by Revolutions.)

Sometimes we want (or need) plain vanilla – something clean and elegant rather than extravagant. Check out A Compendium of Clean Graphs in R, including code. Many examples are especially well-suited for the spartan challenge of conveying information in grayscale. The R Graph Catalog is a similar resource.

If you’re just getting started with R, take a look at the Painless Data Visualization section (p. 17 onward) in this downloadable Beginner’s Guide.

In R, ggplot2,  based on the Grammar of Graphics, is perhaps the single most popular R package for data visualization. The R Cookbook‘s section on Graphs using ggplot2 is a helpful precursor to the R Graphics Cookbook. DataCamp’s DataVis with ggplot2 has a free segment of intro lessons.

For more on visualization and other capabilities, check out this recommended list of useful R packages in the R Studio support blog – succinct and terrific.

If you’re already skilled in R and want a new challenge, an indirect method of harnessing some of the power of D3.js for interactive web visualizations is available through plotly for R. Here’s getting started with plotly and ggplot2, plotly and Shiny, and a gallery. The resources offer code and in some cases the chance to open a visualization and modify its data.

Have a favorite resource? Please share as a comment, or email us!

CRESCYNT Toolbox – R Resources for Graphing and Visualization

CRESCYNT Toolbox – Learning to Love R More (or R is for Reproducible)

caribbean_reef_shark_wikimediacommons_albertkokWe are driven to learn like sharks: constantly take in new flows, or die. In a recent workshop, when coral reef scientists were asked: “How many of you use R?” 60% raised a hand. To: “How many of you are comfortable with and love using R?” only about 15% kept a hand up.

Here’s where to go to learn to love R more.

rlogoYou likely already know of the R Project, free and open source software for statistical computing and graphics. You may already know of the reliability of the Comprehensive R Archive Network or CRAN repository, favored by many over other potential sources of community-generated code because of their metadata and testing requirements; it now hosts over 9,300 packages (sorted by date and name).

You may also know the elegance of RStudio, the excitement of putting your own interactive code online in RStudio’s Shiny, some great cheat sheets, the most popular R packages, and Stack Overflow as a great place to find answers to your R questions.

You may not know of the new R course finder, an online directory you can search and filter to find the best online R course for your next step (note there are often free versions or segments of even the pay courses listed). There are YouTube videos for R learning, like  twotorials (two-minute tutorials) and YaRrr! (because pirates) with book.

A very recent new book is getting rave reviews from both statistics and programming viewpoints: The Book of R by Tilman Davies (preview it here). The author writes:

“The Book of R …represents the introduction to the language that I wish I’d had when I began exploring R, combined with the first-year fundamentals of statistics as a discipline, implemented in R….   Try not to be afraid of R. It will do exactly what you tell it to – nothing more, nothing less. When something doesn’t work as expected or an error occurs, this literal behavior works in your favor….   Especially in your early stages of learning…try to use R for everything, even for very simple tasks or calculations you might usually do elsewhere. This will force your mind to switch to ‘R mode’ more often, and it’ll get you comfortable with the environment quickly.”

Because R is such a  stellar example of free and open source software with a very robust community (e.g., great stuff at r-bloggers), it’s a surprise how lucky we are that it IS open source, as heard in this interview with R founder Hadley Wickham on the DataStori.es podcast.

We’ll soon host a guest blogpost on some exploratory coral symbiont data analyses, visualizations, and comments generated in R Markdown, which is RStudio’s method for preserving code and output in one running web document. The work is beautiful and useful, and highlights the use of an electronic notebook as a way to capture and share data exploration, analysis and visualization, and to tell a data story. (A major advance to that software was announced this week in the form of R Notebook, which will ship within the next couple of months.)

Why is it worth learning to love R more?

R helps make sure your data work is reproducible (such an issue for science), repeatable (valuable for any processing you have to do periodically), and reusable (on other datasets or data versions, or by colleagues or your future self).

A couple of high-level languages, like R and Python, are becoming more popular each year, and are finding their way as general purpose tools into analytical platforms. These will serve as primary sources of flexibility in cyberinfrastructure platforms now available or under development. Our future selves thank us for the learning investment.

2016 Top 10 Tools for Analytics and Data Science - KD Nuggets Software Poll
R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results  (click graph)

Update: speaking of interviews with R makers, here’s an October 2016 interview with JJ Allaire, the creator of RStudio, Shiny, and R Markdown. His advice for people new to R:

I would suggest that they get a copy of the R for Data Science book written by Hadley Wickham and Garrett Grolemund…. Also, when you have questions or run into problems don’t give up. There’s a lot of great activity around R on stackoverflow and other places and there’s an excellent chance you’re going to find the answers to your questions if you look carefully for them.

Further Update: In January 2018, Kaggle released resources for Hands-On Data Science learning, including lessons for R in data setup, data visualization, and machine learning.

CRESCYNT Toolbox – Learning to Love R More (or R is for Reproducible)

3D Mapping of Coral Reefs – How to Get Started – by John Burns

Rapid technological advancements are providing a suite of new tools that can help advance ecological and biological studies of coral reefs. I’ve studied coral health and disease for the last several years. One large gap in our research approach is the ability to connect changes in coral health to large-scale ecological processes. I knew that when corals died from disease it would alter the fundamental habitat of the system, which in turn would impact associated reef organisms. What I didn’t know was how to effectively document and quantify these changes. Sometimes we just need to alter our perspective to find the answers we are looking for. I starting reviewing methods used by terrestrial researchers to measure landscape changes associated with landslides and erosion. In doing so I came across structure-from-motion (SfM) photogrammetry, and it was immediately clear that this technique could improve our understanding of coral reef ecosystems. I spent the next few years developing methods to use this approach underwater, and have since used SfM to detect changes in reef structure associated with disturbances as well as improve our understanding of coral diseases.
 kiritimati_nature_figure

The first question I am usually asked is, “How easy is it to use this technique and what does it cost?” The best answer I can provide is the logistic constraints depend on your research question. If you are interested in accuracy and controlling the parameters of the 3D reconstruction process, then you should use proprietary software like Agisoft PhotoScan and Pix4D. These programs give you full control, yet require more understanding of photogrammetry and substantial computing power. Autodesk ReCap can process images remotely, which reduces the need for a powerful computer, but also reduces your control over the 3D reconstruction process. At the most simple level, you can download the Autodesk 123D catch app on your phone and create 3D reconstructions in minutes! There are also multiple open-source software options, but they tend to be less powerful and lack a graphical user interface. My advice is to start small. Get started with some simple and free open source tools such as Visual SfM or Bundler. Collect a few sets of images and get some experience with the processing steps to determine if the model outputs are applicable for your research approach.

The second question I receive is, “What is the best way to collect the images?” Unfortunately, the answer is not to use the ‘auto’ setting on your camera and just take a bunch of pictures. Image quality will directly affect the resolution of your model, and is also important for stitching and spatial accuracy. Spend time to understand the principles of underwater photography. A medium aperture (f-stop of 8 to 11) will let in enough light in ambient conditions while not causing blur and distortion associated with depth of field. Since images are taken while moving through a scene, a high enough shutter speed is required that will eliminate blur and dark images. Since conditions can be highly variable, one must adapt to changes in light and underwater visibility while in the field. Cameras with auto-ISO can be helpful for dealing with changing light conditions while surveying. I also recommend DSLR or mirrorless cameras with high-quality fixed lenses, as they will minimize distortion and optimize overall resolution and clarity. For large areas I won’t use strobes because I take images from large distances off the reef, and this will typically create shadows in the images. I take images of the reef from both planar and oblique angles to capture as much of the reef scene as possible in order to eliminate ‘black holes’ in the resulting model. There is no ‘perfect approach,’ but you will need 70-80% overlap for accurate reconstruction. I swim in circular or lawn-mower patterns depending on the scene, and swear by the mantra that more is better (you can always throw out images later if there is too much overlap). It is worth investing time in experimenting with methods to develop a technique that works best for your study are and experimental design. SfM is a very flexible and dynamic tool, so don’t be afraid to create your own methods.

The third question is then, “How do you ground-truth the model for spatial accuracy?” This is a critical step that often gets overlooked. In order to achieve mm-scale accuracy the software must be able to rectify the model to known x,y,z coordinates. I use mailbox reflectors connected by PVC pipe to create ground control points (GCPs) with known distances. The red color and white outline of the reflectors is easily distinguished and identified by the software and saves a lot of time for optimizing the coordinates of the model. Creating functional GCPs is exceptionally important is spatial accuracy is required for your work. I also use several scale bars throughout my reef plots to check accuracy and scaling. This step of the process is critical for accurately measuring 3D habitat characteristics.

Maybe I’ve taken you too far into technical details at this point, but hopefully this helps for anyone looking to venture into the world of SfM. There is no perfect approach, and we must be adaptable as software continues to improve and new tools are constantly being created. We also need to continue to develop new methods for quantifying structure from 3D models. I export my models into geospatial software to extract structural information, but this step of the process can be improved with methods capable of annotating the true 3D surface of the models. As new software becomes available for annotating 3D surfaces we are entering an exciting phase with endless possibilities for collating and visualizing multiple forms of data. Being open-minded and creative with these techniques may provide new insight into how these environments function, and how we can protect them in the face of global stressors.
– Mahalo to John Burns for this in-depth guest posting. You can see more of his work, simultaneously beautiful and useful, at the Coral Health Atlas. Click on the image below for more of John’s remarkable 3D coral reef mapping work:
johnburns_sketchfab
>>>Go to NSF EarthCube or the CRESCYNT website or the blog Masterpost.<<<
3D Mapping of Coral Reefs – How to Get Started – by John Burns

CRESCYNT Toolbox – Open Science Framework supports reproducible science

osf

The Open Science Framework, or OSF (osf.io) is a free and open source platform for supporting reproducible science. It’s designed more for documenting work than for streamlining work. It’s potentially a useful place to host a messy spread-out collaborative research project partly because of the add-ons it can connect with, (1) for storage: Amazon S3, Box, Dataverse, Dropbox, figshare, Google Drive, and GitHub, and (2) for references: Mendeley and Zotero. OSF also comes with a dashboard, a wiki, email notifications for your group, OSF file storage with built-in version control, data licensing background and assignment capability, ability to apply permission controls, and ability to make projects and components either private or public. Projects that one chooses to make public can be assigned DOIs (which can be transferred if you move your project elsewhere).

Aside from its primary role as a place to host research documentation and collaboration, OSF has also been used to teach classes in open science and reproducibility, and as a location to host conference products such as presentations and posters.

OSF is not a perfect platform for science – that elusive creature does not yet exist – but it’s a robust start with its ability to integrate other resources you may already be using, gets extra points for being free and open source, and could definitely be worth the learning curve of using with a next project. It continues to be improved over time, and how will we know what to ask of a platform if we don’t wrestle a bit with what’s already been built?

Learn more at the Open Science Framework FAQs and OSF Guides
or on YouTube (where everyone seems to learn new software these days):
+ Getting Started with the OSF (2 mins) (start here!) –
+ Most recent “OSF 101” intro webinar (1 hour) –
+ Deep dive into the OSF (1 hour) (thumbs up!) –
+ and more at OSF’s YouTube channel.

If you try it out, please let us know what you think!

Update: OSF now also connects with Bitbucket and ownCloud. See current Add-ons.

>>>Go to the blog Masterpost or the CRESCYNT website or NSF EarthCube.<<<

CRESCYNT Toolbox – Open Science Framework supports reproducible science

WELCOME to CRESCYNT – the Coral Reef Science and Cyberinfrastructure Network

The Coral Reef Science & Cyberinfrastructure Network (CRESCYNT) is a multi-tiered and multidisciplinary network of coral reef researchers, ocean scientists, cyberinfrastructure specialists, and computer scientists, and we invite you to join us. Scope of Sciences within EarthCube

As an EarthCube Research Coordination Network, our goals are to foster a dynamic, diverse, durable, and creative community; to collectively consider and develop standards and resources for open data, research documentation, and data interoperability while making best use of work already accomplished by others; and to offer input to those groups within EarthCube who will ultimately create the data architecture for all of EarthCube. Along the way CRESCYNT expects to collect and share community resources and tools, and to offer training opportunities in topics prioritized by our members through widely accessible formats such as webinars and their recordings. We will also work to nurture unforeseen collaborative opportunities that emerge from our integrated collective work.

Because the coral reef community has exceptionally diverse data structures and analysis requirements needed to forward integrative science, it is an exemplar for cyberinfrastructure-enabled advances to other geosciences communities. The CRESCYNT network is working to match the data sources, data structures, and analysis needs of the coral reef community with current advances in data science, visualization, and image processing from multiple disciplines to advance coral reef research and meet the increasing challenges of conservation. The network has begun to assemble to coordinate, plan, and prioritize cyberinfrastructure needs within the coral reef community.

Workflows within CReSCyNT: participants to nodes to collective project outputsThe structure of CRESCYNT is a network of networks, currently including 18 disciplinary nodes and 7 technological nodes, where each network node represents an area of coral reef science (disciplinary nodes: e.g., microbial diversity, symbiosis regulation, disease, physiology & fitness, reef ecology, fish & fisheries, conservation & management, biogeochemistry, oceanography, paleontology, geology) or an area of computer science or technical practice (technological nodes: e.g., visualization, geospatial analysis & mapping, image analysis, legacy & dark data, database management). These nodes may expand, coalesce, or divide to meet the needs and interests of the subdisciplinary communities, while maintaining connections to CRESCYNT through node coordinators and ongoing network activities. We invite you to become a member of CRESCYNT, join one or more nodes that would advance your own work, collaborate on shared resources and tools for the coral reef community, and ensure that the data architecture and cyberinfrastructure of EarthCube will meet the needs of the coral reef community, and that broader data interoperability within EarthCube will benefit both coral reefs and our ability to answer complex questions.

PLEASE VISIT OUR WEBSITE at http://crescynt.org to enroll in CRESCYNT, join a node, work on tasks, discuss data and research priorities, and help determine the future shape of cyberinfrastructure for supporting coral reef research and other geoscience work. This collaborative work is supported by the NSF EarthCube initiative.  Dr. Ruth D. Gates, Director of the Hawaii Institute of Marine Biology, University of Hawaii, is the Principal Investigator of the CRESCYNT project. The CRESCYNT blog is written by Dr. Ouida Meier, the project’s program manager (crescyntrcn@gmail.com).

This material is based upon work supported by the National Science Foundation under Grant Number 1440342. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

>>>Go to the blog Masterpost or the CRESCYNT website or NSF EarthCube.<<<

WELCOME to CRESCYNT – the Coral Reef Science and Cyberinfrastructure Network