
In preparation for an upcoming Data Science for Coral Reefs: Data Rescue workshop, Dr. James W. Porter of the University of Georgia spoke eloquently about his own efforts to preserve historic coral reef imagery captured in Discovery Bay, Jamaica, from as early as 1976. It’s a story from the trenches with a senior scientist’s perspective, outlining the effort and steps needed to accomplish preservation of critical data, in this case characterizing a healthy reef over 40 years ago.
Enjoy this insightful 26-min audio description, recorded on 2018-01-04.
Transcript from 2018-01-04 (lightly edited):
This is Dr. Jim Porter from the University of Georgia. I’m talking about the preservation of a data set that is at least 42 years old now and started with a photographic record that I began making in Discovery Bay, Jamaica on the north coast of Jamaica in 1976. I always believed that the information that photographs would reveal would be important specifically because I had tried other techniques of line transecting and those were very ephemeral. They were hard to relocate in exactly the same place. And in addition to that they only captured a line’s worth of data. And yet coral reefs are three dimensional and have a great deal of material on them not well captured in the linear transect. So those data were… I was very consistent about photographing from 1976 to 1986.
But eventually funding ran out and I began focusing on physiological studies. But toward the end of my career I realized that I was sitting on a gold mine. So, the first thing that’s important when considering a dataset and whether it should be preserved or not is the individual’s belief in the material. Now it’s not always necessary for the material to be your own for you to believe in it. For instance, I’m working on Tom Goreau, Sr.’s collection which I have here at the University of Georgia. I neither made it nor in any way contributed to its preservation but I’ve realized that it’s extremely important and therefore I’m going to be spending a lot of time on it. But in both cases, the photographic record from Jamaica, as well as the coral collection itself – those two activities have in common my belief in the importance of the material.
The reason that the belief in the material is so important is that the effort required to capture and preserve it is high, and you’ve got to have a belief in the material in order to take the steps to assure the QA/QC of the data you’re preserving, as well as the many hours required to put it into digital format. And believing in the material then should take another step, which is a very self-effacing review of whether you believe the material to be of real significance to others. There’s nothing wrong with memorabilia. We all keep scrapbooks and photographs that we like – things relating to friends and family, and times that made us who we are as scientists and people. However, the kind of data preservation that we’re talking about here goes beyond that – could have 50 or 100 years’ worth of utility.
Those kinds of data really do require them to be of some kind of value, and the value could either be global, regional, or possibly even local. Many local studies can be of importance in a variety of ways: the specialness of the environment, or the possibility that people will come back to that same special environment in the future. The other thing that then is number two on the list – first is belief in the material – second is you’ve got to understand that the context in which you place your data is much more important to assure its survival and utility than the specificity of the data. Numbers for their own sake are numbers. Numbers in the service of science become science. It is the context in which you place your data that will assure its future utility and preservation.
In my opinion, preserving data is like writing a really good lecture. The key to both is not to ask: What do you need to know, but to ask instead: Why do you need to know this? If you can answer that question, “Why do you need to know this,” you have then done more for your data than anything other that’s possible.
So when you’ve done those two things – belief in the material, and placing it in context – you then have a very difficult personal choice to make, which is: Do you have the time and financial resources to do this? Most academicians are in a very privileged position, in that coming toward the end of their career, they may have a little bit more free time. Think about that oxymoron – free time! – anyway, they’ll have a little bit more time. And if you’re able to you’ll have the financial resources to put the money into it to do the preservation, whether it’s slide scanning or computer programs to preserve it.
Once you’ve decided that all these things weigh in favor of recording a dataset, the next thing you have to do is somewhat counter-intuitive. It is not to ask: How to begin, but to ask: How does it end? Where does the material go? What is the long-term utility? How is the long-term utility of the material assured?
So you have to start asking questions that most academic scientists never ask, which is: How does a library work – you know, really work? Is there a museum that would take it? Are there data storage systems? For instance, EPA has a system called STORET, and the National Science Foundation has several other data repositories which they’ve pushed, particularly in the ocean science community, and also the Library of Congress has a system of data storage which in the past has historically related to published papers. So if you publish a paper, say in a journal like Ecology or Coral Reefs, you can be given a data archiving number, and that number is placed in the publication, so if someone goes to that publication they can see where it is.
But in fact, as far as I’m concerned, this is a frontier, and it’s an exciting frontier. What has happened in the last five years – probably not much more than that – is that the digital world has exploded and digital footprints can be made to have some degree of immortality – whether you want them to or not.
And this is not at all how most marine scientists grew up. The impermanence of what we did was… we only fought against that impermanence by publishing papers which sat on shelves in journals. But now everything is scanned, digitized, so the possibility for the long-term preservation exists. But back to my main point, though: you have to think about the end point before you think about the beginning point.
Now in my case, because of associations with museums, I always felt that there might be some way to get museums involved in the storage of the imagery that I am going to focus on here, which – even though those are not specimens (museums have much broader purposes now in life) – and I had the fortune of running across Ken Johnson from the British Museum of Natural History, which has changed its name now to the Natural History Museum, London. And they were interested in Jamaica because as a former colony, the British Museum of Natural History’s holdings are very rich in that.
Ken, as a curator of invertebrates, has thought as much about the digital space which he occupies as a scientist as he has about the coral skeletons’ and specimens’ space in which his museum has historically operated. We got together and started talking about this, and he agreed to take both the original films, which are the ones I’m scanning, as well as the digital library. So, it seems to me that step in the process is assuring the continuity of information has got to be front-loaded. I’m not saying don’t try to do a preservation project if you don’t know where the data are going, but understand that, as you approach the end of career you have to be realistic about the amount of time that you have in relationship to these data, and you have a real responsibility to think about its continuance through whatever means necessary. This also made me publish a paper specifically noting the existence of these data and where to find them. And so that’s recommended.
All right, so at that point in time I believed in the data. I knew that the imagery we had from 1976 was incredibly important, and I’ll give you an example of how that importance was contextualized and understood. It’s important in two ways. First, it is the oldest georeferenced imagery that we have of a coral reef anywhere in the world. We had put in stainless steel stakes and we knew, to within 0.5 cm, where each of the photographs in the photographic transect came from.
So it was the oldest georeferenced imagery we had of a coral reef anywhere in the world. And the other thing that I unfortunately have learned in my lifetime is that it turns out that it is the only georeferenced imagery we have of what coral reefs looked like before they began to decline. I would contend that the 200% coral cover that we photographed on Discovery Bay, Jamaica, 1976 was typical at the time.
We have good information to demonstrate that, but now is… I don’t think in the Caribbean there are more than one or two places. Even then possibly not that, and those will not be there for very much longer. So being able to put your data in a general context such as that assures that other people will sit up, take notice, and preserve the knowledge not only of its existence, but how to get to it, and the British Museum will keep those archived records.
Okay so at that point in time the next level of responsibilities for the data management, preservation, and restoration if necessary, came forward, and that was the really difficult pragmatic issues of how to preserve the data. The latest technology, in our case, was brilliant, which is: it allowed us to scan the photographs that we had, so that they now become digital: easily transferable. It’s important to think about the quality of that kind of data acquisition.
So we wanted to do high resolution scans but it was also important to preserve the original material – the film – from which the contemporary scans are being made, with the understanding that even the race between deteriorating film and increasing technology may not have 2018 as the sweet spot where the film is in the best-case condition necessary and the scanning is at the highest resolution necessarily.
There may be times in the future where, even though the film continues to deteriorate, the technology will have increased to the point where even a deteriorated film of the future can give even a better scan than the film being scanned now.
So: preserve the original material as well as the digital representation that you lift from it. I knew that was going to cost money. But having gone carefully through the first two steps that I’ve mentioned, I was willing to commit funds to it. I also as a research academic kept one eye on the horizon for the possibility of funding, and those funds have indeed and in fact come through. So there was help financially to get those data scanned. We were realistic about – you had to be realistic about the scope of the challenge. Between the Dry Tortugas in Florida, which is another system we had photographed in 1976, and the Discovery Bay Marine Lab forereef slope, which we also photographed in 1976, we had close to 7,000 images. They were 2.25” x 2.25” Hasselblad square images taken about a half meter above the reef, so they have incredible detail of the coral reef beneath the camera. We knew that scanning that kind of outsized film – this was not the usual 35mm film – our film was going to be a technological challenge. We worked through to find people who could do it. That was a real commitment of time, and once we did, each of the images is almost 100 megabytes in terms of file size. Now, of course, as I speak about this this issue in the beginning of the 21st century, 100 MB sounds large, but I could imagine even in a decade from now a 100 MB scan will be considered very small. The ways in which we improve the speed and accuracy of handling these large datasets means that future advancement is inevitable – it’s not even just probable, it’s inevitable. Nevertheless you do the best you can with what you have at the time – that should begin the operation.
So we’re scanning at about 100 MB per image, we’ve got 7,000 images to go. You have to set a schedule. You have to know: When am I winning, and when am I losing. Some people might be able to operate in environments where they couldn’t answer that question realistically. I suspect that if you can’t answer that question, you’re probably not going to complete the project. So a schedule is important. We have in fact been able to stick to that schedule. I predicted we would have all the slides scanned by the end of this calendar year 2017. We will miss that by a couple of months. We’ll have them all done by June of 2018. But we’re well more than two-thirds into it now, and having a goal, and knowing how we were doing also comes to the additional component of this which is: a team is almost invariably necessary, not just to get the work done – the scanning done – but also to prepare for its distribution to others who are interested in the material, to those from a scientific point of view, to others who are interested in the material from a preservation point of view.
As I did this, additional dimensions began to emerge that were really not ones that I had thought about, that surprised the heck out of me, and one is that really well acquired data, particularly if there is a visual component, if it’s visually based has an aesthetic, artistic component / dimension to it that should not under any circumstances be ignored. That aesthetic component of these rich datasets may, in fact, be one of the ways that they can survive the ravages of time and changing fads in science.
Now even if the project is not based upon visual images, which are obviously easy to present to the lay public, there are ways for the visualization of data which can also be presented in ways where there are harmonic dimensions to what you do. Those harmonies, whether they’re in art, science, in music, in related fields – doesn’t really matter – you should never ignore those harmonies, because they are really important to having people value your data as much as you do.
Getting organized oftentimes means organized not just in your mental space but in your physical space. At the time that I began this project I was asked a simple question by Ken Johnson who was trying to protect the British Museum of Natural History from a voluminous flood of material that it would have to preserve in perpetuity: What is the size of what you will generate? Well, it turns out in my case it’s not more than a cubic meter. One cubic meter of film and related materials, but that has turned into 15 three-ring binders of material: of photographs, of notebooks, of dive logs, of the digital spreadsheet files that relate to each and every one of the images. And that led to, I guess the final thing I’ll talk about right now in a general context, and that is the importance of metadata.
Now metadata is not something that a scientist collecting data usually thinks about, but it relates all the way back to the first thing I said which is: What is the context in which you place the data?
So each and every image – and, as I said, we’re talking about 7,000 or 8,000 images – each and every image has to be labelled in a way that allows all of the relevant information to be drawn up easily to determine its depth, its position within the survey, the exact date that it was taken, and all of the species lists and observations that were made in that same area at that same time, so final labeling turns out to be perhaps the most important part of this process of metadata and data labeling. I agonized for months – literally, months – trying to figure out: How can I create a filename that isn’t just a bunch of jargon? I hate jargon – it’s of no use. And in two years – (I would say cynically in two days), jargon will be of no use to anyone.
So I found myself having to make relatively long filenames but saying: Where – Jamaica; and, Which station – station 4, not S4; not JZ; not 76 as opposed to 1976. You have to do this. Otherwise there is no way for someone in 100 years to understand what it is that you’re doing.
So the way in which the data are collected was wonderful, of course, but it is worthless if it is not labelled in a way that is jargon-free and transparently obvious to someone. I began trying to think about: What would it be like for someone in a hundred years to come down and brush against these data – would they stop? Would they look at it? Would they understand the kind of value that I place in it now and that I placed in it 40 years ago when we began taking the data, and: How can we assure that the science traveler in 100 years would care about this? And I think it all comes down to: What is it called, and How is it labeled.
So all of those things come together in terms of the data archive, retrieval, and management system. They involve personal journeys of the head and the heart, to know that at some point in time you are not going to be there to make those things as exciting to others as they were to you. But you have a responsibility to the data to assure that.
My attitude increasingly is that this concept that is common in grant applications now called broader impact, is in fact of equal value to the historically important criteria of transformational impact. It used to be that when a grant was reviewed, it was reviewed for its transformational impact: What kind of impact are you going to have on the sciences; Is it the very best? And this idea: What’s the broader impact? was secondary, part of the grant that you rushed through.
However, if you think about it from the point of view of the public, you used public funds to do the work. And I think it is important to explain to the public what you did with their money. This isn’t pandering. This is respectful. And if anyone is listening to this in the early part of the 21st century, you will also realize why scientists have done an incredibly poor job of explaining why evidence-based thinking is a very special kind of relationship to the natural or to the real world. And why “fake news” is not something that we as intelligent human beings should be a part of. So archival data is an antidote to “fake news”. Archival data that’s well preserved is an antidote to the cynicism that results in evidence-based thinking being dismissed as partisan rubbish.
And that is going on and it will continue to go on. And so I think the archival process is one of the most elevated ways in which a scientist can fight against those kinds of forces of ignorance and simplicity that we must fight against to create a sustainable world. And so, that’s my defense of archival.
>>>Go to the blog Masterpost or the CRESCYNT website or NSF EarthCube.<<<