CRESCYNT Toolbox – Data Repositories – Estate Planning for your Data

“Hypotheses come and go but data remain.”    – Ramon y Cajal

Taking care of our data for the long term is not just good practice, allowing us to share our data, defend our work, reassess conclusions, collaborate with colleagues, and examine broader scales of space and time – it’s also estate planning for our data, and a primary way of communicating with future scientists and managers.

egg-gold

Here are some great options for long-term data storage, highlighting repositories friendly to coral reef science.

First, there are some important repository networks useful for coral reef data – these can unify standards and offer collective search portals: we like DataONE (members here) and bioCaddie (members here).

KNB – the Knowledge Network for Biocomplexity offers open and private data uploads; ecological orientation. DataONE network.

NOAA CoRIS: Coral Reef Information System – often free to use and can accept coral reef related data beyond NOAA’s own data; contact them first.

BCO-DMO – Biological and Chemical Oceanography Data Management Office – if you have an NSF grant that requires data storage here, you’re fortunate. Good data management guidelines and metadata templates, excellent support staff. Now a DataONE member.

Dataverse – supported by Harvard endowments. There are multiple organizational dataverses – the Harvard Dataverse is free to use. bioCaddie member.

Zenodo – free to use, supported by the European Commission (this is a small slice of CERN’s enormous repository for the Large Hadron Collider). Assigns dois. We invite you to include the “Coral Reef” community when you upload. bioCaddie member.

NCBI – the National Center for Biotechnology Information is very broadly accepted for ‘omics data of all types. A bioCaddie member.

DataCite – not a repository, but if you upload a dataset at a repository that does not assign its own doi’s, you can get one at DataCite and include it when publishing your datasets.

We’ve not listed more costly repositories such as Dryad (focused on journal requirements) or repositories restricted to institutions. What about other storage options such as GitHub, Amazon Web Services, websites? Those have important uses, but are not curated repositories with long-term funding streams, so are not the best data legacy options.

eggs-stacked-imagesMost of these repositories allow either private (closed) or public (open) access, or later conversion to open access. Some have API’s for automated access within workflows. These are repositories we really like for storing and accessing coral reef work. Share your favorite long-term data repository – or experiences with any of the repositories listed here – in the comments.

Advertisements
CRESCYNT Toolbox – Data Repositories – Estate Planning for your Data

CoralNet: deploying deep learning in the shallow seas – by Oscar Beijbom

coralnet_oscar-beijbom

Having dedicated my PhD to automating the annotation of coral reef survey images, I have seen my fair share of surveys and talked to my fair share of coral ecologists. In these conversations, I always heard the same story: collecting survey images is quick, fun and exciting. Annotating them is, on the other hand, slow, boring, and excruciating.

When I started CoralNet (coralnet.ucsd.edu) back in 2012 the main goal was to make the manual annotation work less tedious by deploying automated annotators alongside human experts. These automated annotators were trained on previously annotated data using what was then the state-of-the-art in computer vision and machine learning. Experiments indicated that around 50% of the annotation work could be done automatically without sacrificing the quality of the ecological indicators (Beijbom et al. PLoS ONE 2015).

The Alpha version of CoralNet was thus created and started gaining popularity across the community. I think this was partly due to the promise of reduced annotation burden, but also because it offered a convenient online system for keeping track of and managing the annotation work. By the time we started working on the Beta release this summer, the Alpha site had over 300,000 images with over 5 million point annotations – all provided by the global coral community.

There was, however, a second purpose of creating CoralNet Alpha. Even back in 2012 the machine learning methods of the day were data-hungry. Basically, the more data you have, the better the algorithms will perform. Therefore, the second purpose of creating CoralNet was quite simply to let the data come to me rather than me chasing people down to get my hands on their data.

At the same time the CoralNet Alpha site was starting to buckle under increased usage. Long queues started to build up in the computer vision backend as power-users such as NOAA CREP and Catlin Seaview Survey uploaded tens of thousands of images to the site for analysis assistance. Time was ripe for an update.

As it turned out the timing was fortunate. A revolution has happened in the last few years, with the development of so-called deep convolutional neural networks. These immensely powerful, and large nets are capable of learning from vast databases to achieve vastly superior performance compared to methods from the previous generation.

During my postdoc at UC Berkeley last year, I researched ways to adapt this new technology to the coral reef image annotation task in the development of CoralNet Beta. Leaning on the vast database accumulated in CoralNet Alpha, I tuned a net with 14 hidden layers  and 150 million parameters to recognize over 1,000 types of coral substrates. The results, which are in preparation for publication, indicate that the annotation work can be automated to between 80% and 100% depending on the survey. Remarkably: in some situations, the classifier is more consistent with the human annotators than those annotators are with themselves. Indeed, we show that the combination of confident machine predictions with human annotations beat both the human and the machine alone!

Using funding from NOAA CREP and CRCP, I worked together with UCSD alumnus Stephen Chan to develop CoralNet Beta: a major update which includes migration of all hardware to Amazon Web Services, and a brand new, highly parallelizable, computer vision backend. Using the new computer vision backend the 350,000 images on the site were re-annotated in one week! Software updates include improved search, import, export and visualization tools.

With the new release in place we are happy to welcome new users to the site; the more data the merrier!

_____________

– Many thanks to Oscar Beijbom for this guest posting as well as significant technological contributions to the analysis and understanding of coral reefs. You can find Dr. Beijbom on GitHub, or see more of his projects and publications here. You can also find a series of video tutorials on using CoralNet (featuring the original Alpha interface) on CoralNet’s vimeo channel, and technical details about the new Beta version in the release notes.

CoralNet: deploying deep learning in the shallow seas – by Oscar Beijbom