Home / Topics / Study Management / Data / Secondary Data Sets / Working with Secondary Data

Working with Secondary Data

Posted on December 5, 2011

Benjamin Lê Cook (bio) discusses how best to work with secondary data.


There are plenty of public-use data sets related to mental health on the web. There's the Behavioral Risk Factors survey and the National Survey of Drug Use and Health and the Medical Expenditure Panel Survey. And so HRQ [Health Care, Research, and Quality] puts out some of those. CDC puts out some of those. SAMHSA puts out some of those.

A real difficulty for getting the secondary data, even though it's on the web, is that it's kind of cryptic. So if you can imagine, the government agencies have to figure out a way to get all of that data onto the web and downloadable. So what they do is end up splitting it into lots of small zip files, with some kind of transport file, which you have to download, also.

You plug that transport file into SAS, and then you download all these numbers into some comma delimited text file, and then... So there's about 20 steps to go from what's publicly available, a publicly available data set. The government has already approved all of the HIPAA things. They've taken out all the identifiers. There's about 20 steps to go from that publically used data set into something that you can work with on Stata or Excel or SAS or some statistical program.

And so, you know, I think with practice that comes very easily, but certainly when I started out doing this stuff, it was a long, time consuming process.

You clear a bunch of space on your computer, and then you download everything to a folder. And by everything, I mean the data sets but especially the documentation. They always... they actually have done an amazing detail job, most of these government agencies, in documenting what's in the data. And then if you can wade through it, they'll tell you how to get it all into a usable form in SAS or Stata.

And so you can download all that and work your way through the documentation, and if you have the time and the patience and you know Stata and SAS well enough and you're a decent programmer, then you can eventually work this into a data set that you can use. But I never did it that way. The way that I did it was I borrowed code from friends, and I think that's a real shortcut. You have to sit down with somebody who's done this before and see how they go through all of those steps and then borrow their code about how they got the data into a usable form and then how they cleaned the data.

If you can get someone who's already cleaned a lot of the variables for you and changed them from very cryptic variable names to real names like "female" and "low income," and that's even better. That's even a second step better. They'll give you the code to clean that data, also.

Viewing Preferences




Excerpted from an interview with the researcher conducted at the 2011 NHSN Conference held in Miami, FL.

Behavioral Risk Factor Surveillance System

Medical Expenditure Panel Survey

National Survey on Drug Use & Health


Please note that the feedback is viewed only by 4researchers staff and is not intended for communication with individual contributors.


Use the form below to submit feedback about this article. If you would like a response, please be sure to include your e-mail address.

More About "Secondary Data Sets"

Accessing Health Care Data


Existing Epidemiological Data

Show All...


More From Benjamin Lê Cook (bio)


Combining Cross-Sectional and Longitudinal Data


Equalizing Variables Across Groups

Show All...