Home / Topics / Study Management / Data / Data Management / Approaches to Handling Messy Data
Erin Leahey

Approaches to Handling Messy Data

Posted on April 5, 2006

Erin Leahey (bio) discusses data 'editing' practices in the social sciences.


Q: You and your colleagues investigated data 'editing' practices in the social sciences. Why did you decide to look into this 'less visible' aspect of research?
A: We were surprised and somewhat concerned about the dissonance between a) how commonly researchers encounter messy data and b) the amount of available guidance about how to handle messy data. There are no normative standards for data editing, yet it is a common practice. Researchers are often confronted with impossible values in their dataset, including contradictory responses to survey questions. Even when the research design and collection have been well executed, data are rarely error-free. Researchers are typically left to make vital decisions about how to handle messy data on their own because IRBs, professional and organizational ethical guidelines, and methodology textbooks do not sufficiently address the issue of messy and problematic data. For this reason, we expected approaches to handling messy data to vary by social context, such as the discipline, department, university, professional networks, previous experience, and the type of research itself.

Q: I'm starting to look at my data, and I'm finding inconsistencies and impossible values. Can I make changes to the data?
A: The short answer is yes, and that many people do clean or edit their data in some way. However, cleaning or editing your data can have a significant impact on your results, and it also may raise some ethical issues. For these reasons, it is best to carefully think through and justify any changes you make, and ideally assess the robustness of your results to the changes.

In our study of tenured sociology, psychology, and anthropology faculty in research universities across the US, there was some agreement regarding the question of whether to edit a particular messy dataset. We described the messy data and its origins in a hypothetical vignette: a researcher receives inconsistent responses from participants who are hard of hearing, and would like to edit their answers based on her recollection of information that they provided during face-to-face interviews. A majority (75%) of the faculty thought that the data should not be edited in this way, and 24% had no objection to the researcher changing her data. This demonstrates some commonality in view, but also a fair bit of variation in approaches to dealing with messy data.

Q: What are the reasons for NOT editing your data?
A: Many faculty in our study cited ethical, scientific, and methodological objections to editing messy data:
  • It can compromise the integrity of your data
  • It can violate of human subjects guidelines
  • If the edits lend more support for your hypothesis, this may be viewed as "crafting" or "doctoring" your data
  • It may violate some assumptions embedded in the use of self-reported data, such as survey responses
  • It may require reliance on the researcher's memory and thereby introduce uncertainty and concerns about validity and reliability
  • It would make replication of your research difficult, unless each data-editing decision were meticulously described


Q: What recommendations did the faculty in your study give for handling messy data?
A: The question of how to edit your data is a more difficult question than whether it should be edited. While there are definite guidelines for handling missing data, there are no standardized procedures for handling messy data. Given the paucity of formal guidance on editing data, we were eager to solicit recommendations for handling messy data from the 160 faculty members we surveyed. Of the 75% of faculty who objected to the hypothetical researcher's data editing strategy posed in the vignette, most made specific recommendations on how to proceed. These included:
  • Use the edited data, but report your changes
  • Analyze and compare both sets of data (the original data and the altered data)
  • Drop/omit the messy data (either variables or observations)
  • Go back to the participant or to literature to get more information
  • Use this study as a pilot study and learn from it
  • Write up the results in methodological framework, making recommendations for future research
Establishing rigid rules for data editing is not necessary, or even possible, given the diversity of research areas and the autonomy and flexibility required for good research practices. However, some formalization of the decision-making process regarding messy data may improve scientific rigor, allow easier replication of studies, and provide a common language for reporting data editing decisions. Toward this end, researchers should carefully document all data-cleaning decisions, assess the sensitivity of their results to the data edits, and make both the edits and the effects on results transparent when reporting results from the study.

Based on published article and personal communication with the researcher in March 2006. Leahey, E., Entwisle, B., & Einaudi, P. (2003). Diversity in everyday research practice: The case of data editing. Sociological Methods and Research, 32(1), 64-89.

 

More About "Data Management"