What Can Open Access Data Do for You?
5/8/17

 

Data collection and usage is an essential component of scientific research. It’s arguably the most important. Without data, we can’t make observations about the world and deduce truths about how it works.

I wrote in my previous post that research articles are the primary source of data and dissemination of hypothesis-driven research. But while research articles concisely present some data, there is almost always a larger dataset that can still be of use to the authors of the article, and others.

There is a growing movement within science for open access of all data generated for a research project or collected by government agencies. Open access data means that the raw and processed data should be available, free of charge, to anyone. Access to data is a core tenant of the scientific process and most scientific projects funded by the United States government, often through grants from the National Institutes of Health (NIH) or National Science Foundation (NSF), will ultimately be published and released in an accessible format to the public.

However, depending on where the manuscript is published, there may be a long embargo on the release of the text or data and only those with subscriptions can access it right away. Open access data means that all the data and text is released the day the manuscript is published. This also allows collaborators, and even research competitors, to further peruse this data and use it again for different questions (if applicable).

Many journals, including the umbrella Public Library of Science (PLOS) journals, have taken this endeavor as far as possible to provide access to as much of the data used in a research article as possible. PLOS also created guidelines to determine how open a journal is and thus, the articles that are published within it. PLOS has worked hard to streamline the terminology of open access data using the ‘HowOpenIsIt?’ Open Access Spectrum and their handy evaluation tool to evaluate and rate scientific journals.

But what does open access really mean? How does one even access that data? I’ll take you through an example using the PLOS website.

I went to PLOS.org and entered in the first key term that came to mind. I started working on this post on a beautiful weekday morning in Baltimore and I heard some birds chirping at each other through my window, so I did a search using the term: bird sounds. I know very little about birds so I clicked on the first link, which directed me to a research paper entitled: Automated Sound Recognition Provides Insights into Behavioral Ecology of a Tropical Bird.

Now, there are a few things to note about this article and others like it on PLOS. First, you can download the entire article as a PDF. For many journals, even those found in Nature and Science, you may already hit your first obstacle: a paywall. This means you’ll need a subscription to the journal or publisher to gain access, and this can get very costly for institutes and individuals.

Next, this article’s supplemental data is found near the bottom of the page, which is the case for many research articles like it. Here you often find raw datasets, metadata analytics, additional graphs, and/or tables cited in the article but not necessarily featured as essential figures in the main text. You should be able to download each of these files individually. In fact, this article on tropical birds has a set of supplementary files that include the actual recordings of the bird calls used in the analysis. (This one is my favorite. It sounds like a monkey.)

You’ll also notice there is an entire section labeled ‘Data Availability’, which is located just below the article’s abstract. Here you can find all the databases that the raw and processed data was uploaded to during the publication process. These databases, like Gene Expression Omnibus, Zenodo, and Data.gov, offer datasets that are free to download and explore on your own after each manuscript is accepted and published online. Forbes created a list of 33 databases that feature open source file sharing and storage and each has its own unique sets of data that are free to explore.

So, what should we do with this data? Why is open access data important?

In theory, open data should be provided with every manuscript that used public funding to support that research. This isn’t always the case in practice, however. I mentioned the restricted access by paywalls and embargoes, where data is often hidden from public view.

Open data is a check on accountability and reproducibility and it can counter the pseudoscience that’s often in the news. For instance, climate change deniers like to argue that the Earth isn’t really warming and that global temperatures don’t change. Their arguments are supported by data provided by Berkeley’s Earth Laboratory.  While indeed this specific dataset supports the claim that the Earth isn’t warming, the dataset only provides air temperature recordings taken above land masses. Considering seventy percent of Earth is covered by water, this dataset is incomplete. Additional datasets, on the same website nonetheless, provide land and sea temperature data that more accurately depict what is occurring in our climate.

So open access data can go both ways, and the appropriate types of data need to be considered when applying these free documents to your own work or arguments.

Other sources of open data will even take the liberty of building analysis pipelines for you to use right away. ExAtlas was designed at the National Institutes on Aging, NIH, to provide a one-stop shop for gene expression analysis. Taking data analysis to the next level, Swedish statistician Hans Rosling built the GapMinder – an intuitive and interactive web-based algorithm that you can use to visualize raw datasets in specific contexts of health disparities. GapMinder highlights the many disparities in our world, including; age and income levels in the developed and developing world, all the way to the relationship between country GDP and gender-specific health span and longevity.

GapMinder is an amazing program to become familiar with and it uses publicly-funded datasets cataloged from around the world to generate meaningful results. It’s fairy intuitive to use and provides an additional dimension to analyze demographic outcomes by country and year across a variety of variables, including health, vaccination rates, income, GDP, education, gender, geopolitical region, and many others.

For example, below is a picture of the average life expectancy for every country on Earth as it relates to the total health spending each year (as a % of GDP). I’ve highlighted the United States as an example. In 1995, the US was spending almost 14% of its GDP on healthcare, with a return of about 76 years in life expectancy.

Source: GapMinder

Now, compare that with 2010, which is seen below. You can see that the US spends about 18% of its GDP on healthcare in 2010 with a very small uptick in life expectancy. This alludes to the rising cost of healthcare in the US.  Each of the other circles on the graph represents a specific country and some of them have made great gains in healthcare for little additional expenditure in the same time frame. The size of the circle is also directly proportional to the population of the country and the track of yellow dots indicates the year by year changes within the U.S.

Source: GapMinder

All the raw data on GapMinder is freely available to download and you can track any country of interest over the time frame that data is available. If you want more information on the power of this program, and why it’s free for all to use, check out the two TED Talks on GapMinder: Talk 1 and Talk 2.

The U.S. government also hosts several websites that maintain public databases. Data.gov is a good place to start to see what is available for a topic of interest, from global warming to waterfowl lead exposure in Alaska and Russia. The Centers for Disease Control and Prevention (CDC) keeps a comprehensive database of health statistics and information, as does the World Health Organization.

The U.S. Environmental Protection Agency (EPA) also has its own open data website. This is not without controversy, as EPA employees are still grappling with how to respond to cultural and administrative changes due to the new Presidency. For a time, the website was even shut down but now it appears to be up and an archive of older data from previous Presidential administrations will be provided. Fearing for the loss of publicly-funded climate data, scientists around the world have banded together to download and archive climate data stored on the EPA and NOAA websites in case data were removed and/or destroyed. There is even a data repository for these datasets called Data Refuge, where open data can be cataloged, deposited, and accessed. Considering many of the scientists who advise the EPA on environmental policy were just sacked this week, this is an important endeavor.

Moving forward, it’s critical that raw and processed data be curated and provided to the public. I hope you can get a sense of how critical this is for informed-policy and that this data is readily-usable by anyone willing to take a few moments to explore with it.

Next time, we’re going to dive into some of the recent space discoveries: including planets, space biology, and the latest NASA initiatives!