CRESCYNT.org website comes home

crescyntlogo-and-url

The CRESCYNT Coral Reef Science & Cyberinfrastructure Network was funded as a Research Coordination Network under an NSF EarthCube grant, a large-scale project bringing research scientists, software developers, and data specialists together in one place to build tools for scientists to ease the challenges of integrating data and executing analytical workflows to accelerate discovery in geosciences.

The EarthCube project has a new office, a few more years of funding, and in the process of building a new website is offloading web hosting of the funded projects themselves, so the undertakings and outcomes of the CRESCYNT project will no longer be listed at earthcube.org. For archiving, here’s a pdf copy of that CRESCYNT.org web front content.

Advertisement
CRESCYNT.org website comes home

In Memory of Ruth D. Gates

RuthGatesFlowers20181118

Dear Friends,

It has been very hard to say goodbye to the incomparable Ruth D. Gates.

Her ashes were taken aboard the Hōkūleʻa and released into Kāneʻohe Bay, Hawaiʻi in a ceremony on November 18, 2018.

If you miss her voice, as we do, the Gates Coral Lab site has links to some videos featuring Ruth and her publications. We love this one, where Ruth did a 30-min social media interview answering people’s questions. For introductory teaching and outreach we love this one, where Ruth explains corals and coral reefs in an extraordinarily accessible way.

If you wish to read them, here are links to some in memoriam articles and obituaries from multiple perspectives: The Atlantic, the University of Hawaii, The New York Times, the ARCS Foundation, and the Honolulu Civil Beat.

When asked by an audience member at a screening of Chasing Coral, “What should I do to help corals and reduce human impact on reefs?” her response was essentially: 1 – Pick Something, and 2 – Start. She believed in the power of every human being to step up and make a difference in some way, knew the diversity of talents and skills and people essential to moving forward, and through the radiance of her brilliance and passion, inspired us all to do more.

The Gates Coral Lab projects that were already in motion will continue with the students, postdocs and staff  whose growth she fostered. We know we share with you this common yoke and dedication to the work of inquiry and protection for coral reefs and the planet.  Those of us left behind may not have Ruth’s spectacular eloquence, insight, brilliance and spark, but we all can and must do what we are capable of doing to advocate for coral reefs, the planet, and its people.

Ruth had a vision of coral reef science as deeply collaborative, deeply integrative and multidisciplinary, and essentially multiscale. She also loved seeing scientists, managers, data and technical people working together and alongside educators, artists, and citizens for the good of coral reefs and the planet. We are all part of her collective legacy.

We’ll close with Ruth’s own words, taken from an interview posted at Paul G. Allen Philanthropies. In addition to the scientific acuity and perceptiveness of her research, Ruth’s practical and forward-looking philosophy is part of how she inspired so many.

What would you recommend we do in our daily lives to stem the loss or support the survival of corals? 

We can all play a part in a solution to the problem on reefs and the solution for the planet. It really depends on what you want to do. We’re all discussing green energy, the move towards electric cars, the embracing of solar technology that is a much cleaner source to offset the burning of fossil fuels. If you’re somebody who’s politically active you could lobby your local politician to support all action that would advance an agenda that would really help protect the coral reef. Maybe you’re somebody who wants to go out and clean up a beach because everything that we take off that beach will no longer wash onto the reef and potentially damage it. Maybe you’re a boater who drops an anchor, and instead of dropping an anchor you could talk to your local managers and ask for them to put a permanent mooring in so that when you tie up your boat you don’t drag your anchor across a reef. I think what I always say to people is choose a solution that is best for you. And start doing it.

What makes you most optimistic about the work you are doing?

I’m a believer that people can pretty much do anything. Unfortunately, we seem to prefer to fix problems than to stop the problem in the first place. But we have a problem. It’s a challenge and there are many young scientists who are committed to solving the challenge of “can we save the reefs?” I’m really invigorated by the energy from the young scientists I work with. I’m invigorated by the amount of creativity that a problem of this magnitude forces in the scientific communities. I feel we can solve any problem. But to do that we have to collaborate with people that we potentially don’t usually work with. It forces us really to change the way we as scientists do business, I think. To me that’s a very exciting framework.

Ruth D. Gates, March 28, 1962 – October 25, 2018

In Memory of Ruth D. Gates

Using RStudio with GitHub for Securing Your Work, Versioning, and Collaboration

bicycle-bike-cyclist-37836_pexels

RStudio and GitHub go together like two wheels of a bicycle. Together, they form a low-overhead yet powerful open source workbench – a lean machine that can help take your data to far places.

In a recent informal evaluation of coral reef related research articles that included simultaneous publication of code and data, by far the most popular language used was R, and RStudio is the most popular interface for working with R.

In a CRESCYNT Data Science for Coral Reefs: Data Integration and Team Science workshop held earlier this year at NCEAS, the most powerful skill introduced was using RStudio with GitHub: writing data and code to GitHub from RStudio.

Once the link is set up, the work can continue in RStudio in the way people may be familiar with, and then one can make commits to GitHub periodically to save the work and potentially pave the way for collaboration.

  1. Download and install R
  2. Download and install R Studio
  3. Create a GitHub account
  4. Connect a repository in the GitHub account to RStudio.  This takes multiple steps; here are some good options to work through the process.

You can use sections of NCEAS’s long tutorial on Introduction to Open Data Science, initially developed by their Ocean Health Index group. Use the sections on overview of R and RStudio, Markdown, intro to GitHub, and skip down to collaboration in GitHub.

There are a number of other tutorials available to show how to make and use these softwares together; a beautifully clean and clear step-by-step tutorial is from Resources at GitHub; another excellent one is from Support at RStudio.

Also available to you: Hadley Wickham on Git and GitHub, a Study Group’s Version Control with RStudio and GitHub Simply Explained, R Class’s An Introduction to Git and How to Use it with RStudio, and U Chicago’s Using Git within RStudio, and Happy Git’s Connect RStudio to Git and GitHub. You may prefer the style of one of these over others.

If later you want to go further, come back for these tutorials hosted at R-Bloggers: R Blogdown Setup in GitHub, and Migrating from GitHub to GitLab with RStudio. And good news – you can now archive a snapshot of a GitHub repository to preserve and even publish a particular version of your RStudio work – plus get a doi to share – at Zenodo.

Summary:  Many research scientists use RStudio as their primary analytical and visualization tool. RStudio now has the ability to connect to a GitHub repository and make commits to it from RStudio. This permits critical core functions for a simplified workbench: documenting workflows (R Markdown), preserving code and provenance, producing repeatable results, creating flexible pipelines, sharing data and code, and allowing collaboration among members of a team. Versioning and teamwork is simplified by making commits frequently and always doing fresh pulldowns prior to commit (rather than focusing on branch development). The process is valuable for individual researchers, documenting project work, and collaborating in teams.

Related blogposts: Learning to Love R More and R Resources for Graphing and Visualization.

>>>Go to the blog Masterpost or the CRESCYNT website or NSF EarthCube.<<<

Using RStudio with GitHub for Securing Your Work, Versioning, and Collaboration

Metadata Dream Team responds to request for recommendations for coral reef data

people in a room at a table working together
CRESCYNT-DDStudio participants at UCSD Scripps, 2018-08-15. Clockwise from top right (back): Ilya Zaslavsky, Zachary Mason, Samantha Clements, Karen Stocks, Ted Habermann, David Valentine, Hannah Ake, Eric Lingerfelt, Gary Hudman, Sarah O’Connor, Ouida Meier. Not pictured: Gastil Gastil-Buhl, Stephen Richard, Tom Whitenack.

When two major workshops concluded by the EarthCube CRESCYNT Coral Reef Science and Cyberinfrastructure Network in March 2018, there were some interesting clear outcomes in addition to the practical training and data exploration goals accomplished. The workshops were both structured around Data Science for Coral Reefs. At the end of the first, focused on Data Rescue and data management, participants decided that the most important new topic they learned about was metadata and its uses. At the end of the second, focused on Data Integration and Team Science, people had realized how essential writing good metadata was for being able to make datasets at disparate scales work together well. The metadata lessons were important emergent outcomes, and participants asked that data and metadata experts get together, use the data challenges that arose, and recommend some metadata practices and standards that would work for the coral reef community and its very broad range of data types, repositories, and pre-repository research, storage, sharing and analytical metadata needs.

We were luckily able to do exactly that with one final workshop. Through a jointly staged CRESCYNT-DDStudio workshop, we pulled together a group of metadata experts, coral reef data managers, representative scientists, and the EarthCube Data Discovery Studio’s scientists and software developers focused on metadata enhancement for finding and using data.

Special guests included Ted Habermann (Metadata 2020 project, co-author of “The influence of community recommendations on metadata completeness”; Stephen Richard (experience with schema.org and metadata standards authoring); coral reef data experts Gastil Gastil-Buhl (Moorea Coral Reef LTER), Hannah Ake (BCO-DMO), and Sarah O’Connor and Zachary Mason (NOAA NCEI’s user metadata writing interface and CoRIS), the three biggest formal repositories for coral reef research data in the US or sponsored by NSF; Eric Lingerfelt, the EarthCube Technical Officer; guests from Scripps; and DDH team members Ilya Zaslavsky, Karen Stocks, Gary Hudman, David Valentine, and Tom Whitenack with their broad and integrative metadata, software, and domain expertise. Ilya and Karen kindly hosted the group at UCSD’s San Diego Supercomputer Center and Scripps Institution of Oceanography.

Important outcomes from the workshop were mutualistic for the two projects. For CRESCYNT, they included cross-mapping an essential set of metadata (as defined by appropriate community repositories) to web standards and producing a draft ISO metadata profile for coral reef data at two levels of dataset access: (1) discovery and sharing (a simpler form with freeform text entry in many of the fields), and (2) understanding and usability at the workbench level (a more detailed form with options to supply more highly specified fields). We will finish writing these and offer them to the coral reef community for feedback and potential adoption.

For the Data Discovery Studio (formerly known as Data Discovery Hub), important outcomes included exploration of the use of the enhanced metadata at different repositories and in science use cases (including the coral reef use case), a deep dive into focusing the future trajectory of Data Discovery Studio, and some initial planning for an upcoming data science competition that will involve the coral reef data (details to be announced). Read more about DDStudio and its broader work, and be on the alert for a Data Discovery Science Competition in January 2019!

We gratefully acknowledge the generosity of our hosts, workshop travel support from NSF, the active work and engagement of our participants, and the organizations that allowed their employees time to attend and contribute to this collective effort.

>>>Go to the blog Masterpost or the CRESCYNT website or NSF EarthCube.<<<

Metadata Dream Team responds to request for recommendations for coral reef data

On Preserving Coral Reef Imagery and Related Data – by James W Porter

Porter_James_Slide_Discovery-Bay-Jamaica
James Porter’s coral photo monitoring project in Discovery Bay

In preparation for an upcoming Data Science for Coral Reefs: Data Rescue workshop, Dr. James W. Porter of the University of Georgia spoke eloquently about his own efforts to preserve historic coral reef imagery captured in Discovery Bay, Jamaica, from as early as 1976. It’s a story from the trenches with a senior scientist’s perspective, outlining the effort and steps needed to accomplish preservation of critical data, in this case characterizing a healthy reef over 40 years ago.

Enjoy this insightful 26-min audio description, recorded on 2018-01-04.

 

Transcript from 2018-01-04 (lightly edited):

This is Dr. Jim Porter from the University of Georgia. I’m talking about the preservation of a data set that is at least 42 years old now and started with a photographic record that I began making in Discovery Bay, Jamaica on the north coast of Jamaica in 1976. I always believed that the information that photographs would reveal would be important specifically because I had tried other techniques of line transecting and those were very ephemeral. They were hard to relocate in exactly the same place. And in addition to that they only captured a line’s worth of data. And yet coral reefs are three dimensional and have a great deal of material on them not well captured in the linear transect. So those data were… I was very consistent about photographing from 1976 to 1986.

But eventually funding ran out and I began focusing on physiological studies. But toward the end of my career I realized that I was sitting on a gold mine. So, the first thing that’s important when considering a dataset and whether it should be preserved or not is the individual’s belief in the material. Now it’s not always necessary for the material to be your own for you to believe in it. For instance, I’m working on Tom Goreau, Sr.’s collection which I have here at the University of Georgia. I neither made it nor in any way contributed to its preservation but I’ve realized that it’s extremely important and therefore I’m going to be spending a lot of time on it. But in both cases, the photographic record from Jamaica, as well as the coral collection itself – those two activities have in common my belief in the importance of the material.

The reason that the belief in the material is so important is that the effort required to capture and preserve it is high, and you’ve got to have a belief in the material in order to take the steps to assure the QA/QC of the data you’re preserving, as well as the many hours required to put it into digital format. And believing in the material then should take another step, which is a very self-effacing review of whether you believe the material to be of real significance to others. There’s nothing wrong with memorabilia. We all keep scrapbooks and photographs that we like – things relating to friends and family, and times that made us who we are as scientists and people. However, the kind of data preservation that we’re talking about here goes beyond that – could have 50 or 100 years’ worth of utility.

Those kinds of data really do require them to be of some kind of value, and the value could either be global, regional, or possibly even local. Many local studies can be of importance in a variety of ways: the specialness of the environment, or the possibility that people will come back to that same special environment in the future. The other thing that then is number two on the list – first is belief in the material – second is you’ve got to understand that the context in which you place your data is much more important to assure its survival and utility than the specificity of the data. Numbers for their own sake are numbers. Numbers in the service of science become science. It is the context in which you place your data that will assure its future utility and preservation.

Continue reading “On Preserving Coral Reef Imagery and Related Data – by James W Porter”

On Preserving Coral Reef Imagery and Related Data – by James W Porter

CRESCYNT Data Science For Coral Reefs Workshop 2 – Data Integration and Team Science

light-bulb-comfreak-pixabay_503881_1920

We’re extremely pleased to be able to offer two workshops in March 2018 at NCEAS. The second is CRESCYNT Data Science for Coral Reefs Workshop 2: Data Modeling, Data Integration and Team Science. Apply here.

When: March 12-15, 2018
Where: NCEAS, Santa Barbara, CA

Workshop description:

This workshop is recommended for early to mid-career and senior scientists with interest in applying technical skills to collaborative research questions and committed to subsequently sharing what they learn. Participants will learn how to structure and combine heterogeneous data sets relevant to coral reef scientists in a collaborative way. Topics covered on days 1 and 2 of the workshop will cover reproducible workflows using R/RStudio and RMarkdown, collaborative coding with GitHub, strategies for team research, data modeling and data wrangling, and advanced data integration and visualization tools. Participants will also spend 2 days working in small teams to integrate various coral reef datasets to practice the skills learned and develop workflows for data tidying and integration.

The workshop is limited to 20 participants. We encourage you to apply via this form. Workshop costs will be covered with support from NSF EarthCubeCRESCYNT RCN. We anticipate widely sharing workshop outcomes, including workflows and recommendations. Anticipate some significant pre-workshop prep effort.

Related posts: Learning to Love R More and R Resources for Visualization

UPDATE: HERE IS THE AGENDA FOR THE WORKSHOP, WITH TRAINING LINKS.

>>>Go to the blog Masterpost or the CRESCYNT website or NSF EarthCube.<<<

CRESCYNT Data Science For Coral Reefs Workshop 2 – Data Integration and Team Science

CRESCYNT Toolbox – Data Cleaning

cleaning_ancient-1807518_1280_pixabay

Data cleaning. Data cleansing. Data preparation. Data wrangling. Data munging.

Garbage In, Garbage Out.

If you’re like most people, your data is self-cleaning, meaning: you clean it yourself! We often hear that 80% of our “data time” is spent in data cleaning to enable 20% in analysis. Wouldn’t it be great to work through data prep faster and keep more of our data time for analysis, exploration, visualization, and next steps?

Here we look over the landscape of tools to consider, then come back to where our feet may be right now to offer specific suggestions for workbook users – lessons learned the hard way over a long time.

The end goal is for our data to be accurate, human-readable, machine-readable, and calculation-ready.

Software for data cleaning:

RapidMiner may be the best free (for academia) non-coding tool available right now. It was built for data mining, which doesn’t have to be your purpose for it to work hard for you. It has a diagram interface that’s very helpful. It almost facilitates a “workflow discovery” process as you incrementally try, tweak, build, and re-use workflow paths that grow during the process of data cleaning. It makes quick work of plotting histograms for each data column to instantly SEE distributions, zeros, outliers, and number of valid entries. It also records and tracks commands (like a baby Jupyter notebook). When pulling in raw datasets, it automatically keeps the originals intact: RapidMiner makes changes only to a copy of the raw data, and then one can export the finished files to use with other software. It’s really helpful in joining data from multiple sources, and pulling subsets for output data files. Rapid Miner Studio: Data Prep.

R is popular in domain sciences and has a number of powerful packages that help with data cleaning. Make use of RStudio as you clean and manipulate data with dplyr and tidyr. New packages are frequently released, such as assertr, janitor, and datamaid. A great thing about R is its active community in supporting learning. Check out this swirl tutorial on Getting and Cleaning Data – or access through DataCamp. The most comprehensive list of courses on R for data cleaning is here via R-bloggers. There’s lovely guidance for data wrangling in R by Hadley Wickham – useful even outside of R.

Data cleaning tool recommendations by KD Nuggets, Quora, and Varonis are a little dated and business-oriented, but these survivors may be worth investigating:

  • Trifacta Wrangler was built for desktop use, and designed for many steps of data wrangling: cleaning and beyond. See intro video, datasheet, demo with Tableau.
  • DataCleaner – community or commercial versions; can use SQL databases. Mostly designed for business applications; videos show what it can do.
  • OpenRefine gets the legacy spotlight (was Google Refine… now community held). Free, open source, and still in use. Here’s a recent walkthrough. Helps fix messy text and categorical data; less useful for other science research data.

There are some great tools to potentially steal borrow that started in data journalism:

  • Tabula is “a tool for liberating data tables trapped inside PDF files” – extracts text-based pdfs (not scans) to data tables.
  • csvkit is “a suite of command-line tools for converting to and working with CSV, the king of tabular file formats.” Helpful for converting Excel to csv cleanly, csv to json, json to csv, working with sql, and more.
  • agate is “a Python data analysis library that is optimized for humans instead of machines…. an alternative to numpy and pandas that solves real-world problems with readable code.” Here’s the cookbook.

Finally, Python itself is clearly a very powerful open source tool available for data cleaning. Look into it with this DataCamp course, pandas and other Python libraries, or this kaggle competition walkthrough.

Manual Data Munging. If you’re using Excel, Open Office, or Google Sheets to clean your data (e.g., small complex datasets common to many kinds of research), you may know all the tricks you need. For those newer to data editing, here are some tips.

  • To start: save a copy of your original file with a new name (e.g., tack on “initials-mod” plus the current date: YYYYMMDD). Then make your original file read-only to protect it. Pretend it’s in an untouchable vault. Use only your modifiable copy.
  • Create a Changes page where you record the edits you make in the order you make them. This also lets you scribble notes for changes you plan to make or items you need to track down but haven’t yet executed (Done and To-Do lists).
  • First edit: if you don’t have a unique ID for each row, add a first column with a simple numeric sequence before doing anything else.
  • Create a copy of that spreadsheet page, leave the original intact, and make modifications only to the newly copied page. If each new page is created on the left, the older pages are allowed to accumulate to the right (less vulnerable to accidental editing). Name each tab usefully.
  • Second edit: if your column headings take up more than one row, consolidate that information to one row. Do not Merge cells. Include units but no special characters or spaces: use only letters, numbers, dashes and underlines.
  • Add a Data Definitions page to record your old column headings, your new column headings, and explain what each column heading means. Include units here and also in column headings where possible.
  • In cells with text entries, do not use bare commas. Either use semicolons and dashes instead of commas in your text, or enclose text entries in quotation marks (otherwise creates havoc exporting to and importing from csv).
  • Add a Comments column, usually at the end of other columns, to record any notes that apply to individual rows or a subset of rows. Hit Save, now and often.
  • Now you’re free to sort each column to find data entry typos (e.g., misplaced decimals), inconsistent formats, or missing values. The danger here is failing to select the entire spreadsheet before sorting – always select the square northwest of cell A1 (or convert the spreadsheet to a table). This is where you’ll be glad you numbered each row at the start: to compare with the original.
  • If there’s a short note like data source credit that MUST accompany the page and must not get sorted, park it in the column header row to the right of the meaningful headers so it won’t get sorted, lost, or confused with actual data.
  • If you use formulas, document the formulas in your Data Definitions page (replace cells with column_names), and copy-paste as value-only as soon as practical.
  • Make sure there is only one kind of data in each column: do not mix numeric and text entries. Instead, create extra columns if needed.
  • Workbooks should be saved each day of editing with that day’s date (as YYYYMMDD) as part of the filename so you can get back to an older copy. At the end of your session clean up your Changes page, moving To-Do to Done and planning next steps.

Find more spreadsheet guidance here (a set of guidelines recently developed for participants in another project – good links to more resources at its end).

Beyond Workbooks. If you can execute and document your data cleaning workflows in a workbook like Excel, Open Office, or Google Sheets, then you can take your data cleaning to the next level. Knowing steps and sequences appropriate for your specific kinds of datasets will help enormously when you want to convert to using tools such as RapidMiner, R, or Python that can help with some automation and much bigger datasets.

Want more depth? Check out Data Preparation Tips, Tricks, and Tools: An Interview with the Insiders  “If you are not good at data preparation, you are NOT a good data scientist…. The validity of any analysis is resting almost completely on the preparation.” – Claudia Perlich

Happy scrubbing! Email or comment with your own favorite tips. Cheers, Ouida Meier

 

>>>Go to NSF EarthCube or the CRESCYNT website or the blog Masterpost.<<<

CRESCYNT Toolbox – Data Cleaning