What Can Open Access Data Do for You?
Data collection and usage is an essential component of scientific research. It’s arguably the most important. Without data, we can’t make observations about the world and deduce truths about how it works.
I wrote in my previous post that research articles are the primary source of data and dissemination of hypothesis-driven research. But while research articles concisely present some data, there is almost always a larger dataset that can still be of use to the authors of the article, and others.
There is a growing movement within science for open access of all data generated for a research project or collected by government agencies. Open access data means that the raw and processed data should be available, free of charge, to anyone. Access to data is a core tenant of the scientific process and most scientific projects funded by the United States government, often through grants from the National Institutes of Health (NIH) or National Science Foundation (NSF), will ultimately be published and released in an accessible format to the public.
However, depending on where the manuscript is published, there may be a long embargo on the release of the text or data and only those with subscriptions can access it right away. Open access data means that all the data and text is released the day the manuscript is published. This also allows collaborators, and even research competitors, to further peruse this data and use it again for different questions (if applicable).
Many journals, including the umbrella Public Library of Science (PLOS) journals, have taken this endeavor as far as possible to provide access to as much of the data used in a research article as possible. PLOS also created guidelines to determine how open a journal is and thus, the articles that are published within it. PLOS has worked hard to streamline the terminology of open access data using the ‘HowOpenIsIt?’ Open Access Spectrum and their handy evaluation tool to evaluate and rate scientific journals.
But what does open access really mean? How does one even access that data? I’ll take you through an example using the PLOS website.
I went to PLOS.org and entered in the first key term that came to mind. I started working on this post on a beautiful weekday morning in Baltimore and I heard some birds chirping at each other through my window, so I did a search using the term: bird sounds. I know very little about birds so I clicked on the first link, which directed me to a research paper entitled: Automated Sound Recognition Provides Insights into Behavioral Ecology of a Tropical Bird.
Now, there are a few things to note about this article and others like it on PLOS. First, you can download the entire article as a PDF. For many journals, even those found in Nature and Science, you may already hit your first obstacle: a paywall. This means you’ll need a subscription to the journal or publisher to gain access, and this can get very costly for institutes and individuals.
Next, this article’s supplemental data is found near the bottom of the page, which is the case for many research articles like it. Here you often find raw datasets, metadata analytics, additional graphs, and/or tables cited in the article but not necessarily featured as essential figures in the main text. You should be able to download each of these files individually. In fact, this article on tropical birds has a set of supplementary files that include the actual recordings of the bird calls used in the analysis. (This one is my favorite. It sounds like a monkey.)
You’ll also notice there is an entire section labeled ‘Data Availability’, which is located just below the article’s abstract. Here you can find all the databases that the raw and processed data was uploaded to during the publication process. These databases, like Gene Expression Omnibus, Zenodo, and Data.gov, offer datasets that are free to download and explore on your own after each manuscript is accepted and published online. Forbes created a list of 33 databases that feature open source file sharing and storage and each has its own unique sets of data that are free to explore.
So, what should we do with this data? Why is open access data important?
In theory, open data should be provided with every manuscript that used public funding to support that research. This isn’t always the case in practice, however. I mentioned the restricted access by paywalls and embargoes, where data is often hidden from public view.
Open data is a check on accountability and reproducibility and it can counter the pseudoscience that’s often in the news. For instance, climate change deniers like to argue that the Earth isn’t really warming and that global temperatures don’t change. Their arguments are supported by data provided by Berkeley’s Earth Laboratory. While indeed this specific dataset supports the claim that the Earth isn’t warming, the dataset only provides air temperature recordings taken above land masses. Considering seventy percent of Earth is covered by water, this dataset is incomplete. Additional datasets, on the same website nonetheless, provide land and sea temperature data that more accurately depict what is occurring in our climate.
So open access data can go both ways, and the appropriate types of data need to be considered when applying these free documents to your own work or arguments.
Other sources of open data will even take the liberty of building analysis pipelines for you to use right away. ExAtlas was designed at the National Institutes on Aging, NIH, to provide a one-stop shop for gene expression analysis. Taking data analysis to the next level, Swedish statistician Hans Rosling built the GapMinder – an intuitive and interactive web-based algorithm that you can use to visualize raw datasets in specific contexts of health disparities. GapMinder highlights the many disparities in our world, including; age and income levels in the developed and developing world, all the way to the relationship between country GDP and gender-specific health span and longevity.
GapMinder is an amazing program to become familiar with and it uses publicly-funded datasets cataloged from around the world to generate meaningful results. It’s fairy intuitive to use and provides an additional dimension to analyze demographic outcomes by country and year across a variety of variables, including health, vaccination rates, income, GDP, education, gender, geopolitical region, and many others.
For example, below is a picture of the average life expectancy for every country on Earth as it relates to the total health spending each year (as a % of GDP). I’ve highlighted the United States as an example. In 1995, the US was spending almost 14% of its GDP on healthcare, with a return of about 76 years in life expectancy.
Now, compare that with 2010, which is seen below. You can see that the US spends about 18% of its GDP on healthcare in 2010 with a very small uptick in life expectancy. This alludes to the rising cost of healthcare in the US. Each of the other circles on the graph represents a specific country and some of them have made great gains in healthcare for little additional expenditure in the same time frame. The size of the circle is also directly proportional to the population of the country and the track of yellow dots indicates the year by year changes within the U.S.
All the raw data on GapMinder is freely available to download and you can track any country of interest over the time frame that data is available. If you want more information on the power of this program, and why it’s free for all to use, check out the two TED Talks on GapMinder: Talk 1 and Talk 2.
The U.S. government also hosts several websites that maintain public databases. Data.gov is a good place to start to see what is available for a topic of interest, from global warming to waterfowl lead exposure in Alaska and Russia. The Centers for Disease Control and Prevention (CDC) keeps a comprehensive database of health statistics and information, as does the World Health Organization.
The U.S. Environmental Protection Agency (EPA) also has its own open data website. This is not without controversy, as EPA employees are still grappling with how to respond to cultural and administrative changes due to the new Presidency. For a time, the website was even shut down but now it appears to be up and an archive of older data from previous Presidential administrations will be provided. Fearing for the loss of publicly-funded climate data, scientists around the world have banded together to download and archive climate data stored on the EPA and NOAA websites in case data were removed and/or destroyed. There is even a data repository for these datasets called Data Refuge, where open data can be cataloged, deposited, and accessed. Considering many of the scientists who advise the EPA on environmental policy were just sacked this week, this is an important endeavor.
Moving forward, it’s critical that raw and processed data be curated and provided to the public. I hope you can get a sense of how critical this is for informed-policy and that this data is readily-usable by anyone willing to take a few moments to explore with it.
Next time, we’re going to dive into some of the recent space discoveries: including planets, space biology, and the latest NASA initiatives!
An Overview of Peer Review and Science Publication
Science News and Information is a new blog featured on Cosmic Roots and Eldritch Shores. Here, you’ll find highlights of some of the most recent discoveries and breakthroughs in science and research. I’ll try and connect each topic to important societal implications, and I will do my best to remove my own opinions.
We want this space to be a source of fact. We want this space to be relevant, entertaining, and safe to explore interesting science and any underlying implications. That’s why I was so excited when Cosmic Roots and Eldritch Shores decided to put together a new science feature like this. I remember as a kid reading science fiction and fantasy and always wondering about the real science and reality behind the stories. I hope you enjoy what will appear here in the coming weeks and months, and hopefully, years.
Today we’ll start not with the excitement of the seven planets orbiting TRAPPIST-1 or the controversies of CRISPR technology (both topics I promise to return to), but with the subject of peer review and publication. Okay, I know! There’s not a great way to make the term ‘peer review and publication’ incredibly appealing. But it’s the foundation of the entire scientific enterprise and well worth a discussion. I thought this would be the best place to build our foundation as we venture to the outer rim of what we know and don’t know.
The important thing to keep in mind is that science is a process. Scientists can be wrong; we’re human after all. In the laboratory, we constantly get our hypotheses wrong, our experiments end in failure, and we knock our heads against the lab bench hoping for inspiration. Quite often we just don’t have the tools to solve the big questions and we must splash in the waist-deep waters for years until the right technology is developed to really dive into the deep end.
But when a discovery is made, it needs to be reported. This part of the scientific process is where I want to spend the rest of our discussion: peer review.
Peer-reviewed journals are the most important source of scientific information and all scientific researchers work towards publishing their findings in peer-reviewed research journals. Journals like Nature, Science, The Lancet, New England Journal of Medicine, Physical Review Letters, and Journal of the American Chemical Society all review and publish new science and often compete with one another for the most impacting work.
The process begins when a researcher believes they have enough data to convince other scientists their findings are valid and true. Depending on the subject matter, this data collection could be a small pilot project or a major research endeavor that encompasses thousands of hours of work and dozens of experiments. Typically, once the arrangement of the data is outlined, the researcher puts on his or her author’s hat and begins crafting a manuscript to present their new findings.
In a way, scientists are story-tellers and their research manuscripts present a data story. But a defense of the hypothesis is essential. Questions should be addressed, such as: Why was this experiment performed? What was observed? Why should the public care about these results? What does it mean in context of what is already known? Do the findings challenge previous findings or build upon it? Typically, the methods must be specific enough that someone else picking up the paper could repeat the experiments in their own laboratory.
Once the manuscript is completed and all the authors have signed off, it’s sent to a journal. There, it meets the first person that will review the article: the editor. Journal editors are usually experts and will read the cover letter, the abstract…ideally the entire paper…and decide right then and there if the topic of the manuscript is relevant to the journal they are working with. Just like publishing in science fiction and fantasy, certain works and topics are better fits for certain journals.
Journals like Science and Nature only publish ground-breaking work that advances a specific field or features novel approaches, methods, or technologies. Some journals are more specific: Cancer Research isn’t going to feature a paper about behavioral cognition, just like Analog probably won’t feature a classic fantasy tale about dwarves attacking a dragon’s horde…probably, unless there’s time travel!
If the editor decides to pass, the authors must decide where next to submit. However, if the editors feel the manuscript may be a good fit, the next step in the process begins. They will contact anywhere from 1-3 external reviewers to do a critical and thorough read of the entire manuscript. These reviewers are also experts in the field and contacted by the journal to provide their opinion about the quality of the science, the findings, and their interpretations. It’s the reviewers’ task to judge the entire work on its own merit.
The reviewers usually provide a written reply that is a point by point consideration of the work, with specific comments, questions, or suggested improvements to the manuscript. Comments can range from simple typographical mistakes to the proposal of several additional control experiments that must be included. Typically, the reviewers decide whether the manuscript should be accepted as is, provisionally accepted with minor corrections, provisionally accepted with major corrections, or rejected.
The editor’s collect all the reviewer’s comments for the authors and then make a final decision. It’s not unheard of for an editor to go against a reviewer, or to offer their own interpretation of the manuscript to the authors.
If a manuscript is invited for resubmission, the researchers will get a chance to look over the comments, address the concerns, and resubmit (with no guarantee of acceptance). The manuscript’s authors will write a rebuttal to the reviewers if needed, and occasionally a manuscript will bounce back between editor and author a few times. Depending on the journal, the specific guidelines, and the corrections needed, it can take anywhere from a few months to years for a research article to be published.
This dialog is arguably the most integral and important aspect in science communication. It’s essential, really, but the important point to keep in mind is that it’s not infallible. Mistakes are made and it’s not until results are reproduced in other independent labs that important findings are taken as truth in a field. Often, disagreements between laboratories and individuals can arise.
This social discord is a healthy part of the process. For example, in 2011 Science published a report that DNA, the blue print of life, could incorporate the element arsenic into the ‘backbone’ of its structure. It’s fact that DNA’s backbone contains phosphorus and by showing that arsenic could be used in its stead, the authors of this paper argued this was proof that life could have evolved elsewhere in the universe using different starting elements and molecules.
The news sent ripples throughout the scientific community. There were many skeptics and ultimately it was shown that the data could not be reproduced outside of the publishing laboratory. After debate in the field, and multiple attempts by various labs to replicate the results, the results were shown to be nothing more than anomalies. Skeptics of science will note that because of these discrepancies, science can’t be trusted. But independent validation of research results is an integral and important aspect of peer review and in these cases is called post-publication peer review. The editors and peer reviewers at each journal get the data as presentable as possible; the rest is up to the community at large.
However, breakdowns in this entire process do occur and can lead to publication of erroneous data (unfortunately, at times, due to fraud and bias by the publishing authors), which means it very important that the scientific community critically evaluate published work. A landmark paper in 2005, written by Professor John Ioannidis at Stanford University, presented the argument that a large portion of publish research is irreproducible. This set off a flurry of introspection within the scientific community to address the growing problem that some published results can’t be replicated in independent laboratories, just like the arsenic paper. There are even watchdog groups that publicly catalog retractions of journal articles that can’t be reproduced or which contain errors.
That’s not to say that scientists are willfully publishing bad data. Far from it. Papers are retracted for a variety of reasons, including for innocuous errors in experimental design or data collection. So, it’s a very good thing the scientific community polices itself and makes this known to the rest of the world.
But there is no doubt science is facing a reproducibility crisis. Major contributors to this problem are lack of appropriate use of statistics, lack of specific detail in the methods section, and lax peer review at some journals. Journals like Nature have used crisis as an opportunity to shore up their publication and review procedures, including asking for independent validation of key experiments before publication, verification of reagents and chemicals, and open access to all of the data.
Open access is a vital component for peer review and validation. Open access means that all the raw data discussed in a manuscript is provided online for anyone to access it, including you! Anyone in the world can access the data and use it for validation and repeat important experiments.
This type of validation is another aspect of post-publication peer review and it helps identify those papers that need more scrutiny. Journals such as Scientific Reports and PLOS One are entirely open access with their publications and free for anyone, anytime, to download and view.
Some researchers even use a newer innovation called pre-publication peer review. Typically, most scientists can get input on their work before it is published by presenting their data at conferences. However, draft manuscripts can be submitted and published online at places like the bioRxiv, with the hope that people provide critical commentary as the manuscript is prepared for publication elsewhere.
Taking this a step further, Nature Human Behavior published a manifesto on how to improve publication bias, reproducibility, and transparency. Nature Human Behavior has also recently established a new type of peer review process for their registered reports. Researchers can begin the publication process with this journal before any experiments are performed. The research design, methods, and introduction of a manuscript are written before any experiments and are peer reviewed by the journal’s editors and external reviewers. Then, and only then, is the experiment performed, the results analyzed and examined, and the rest of the paper written. The entire manuscript is put under peer review again to check for adherence and then published regardless of the findings, positive or negative.
In this way, Nature Human Behavior hopes to reduce experimental bias and increase reproducibility, all using the standard peer review process. I expect to see more innovations like this adopted by more journals in the coming years.
To wrap up, I hope this has helped clarify a little about what peer review really is and why it’s so important. Coming up in the coming weeks and months, we’ll explore a variety of topics and findings in science and all of it will have been examined by peer review.