Bryan Karazsia

Analyzing Count Data

Posted on December 16, 2008

Bryan Karazsia (bio) describes current methods for handling count data that show many non-occurences.


Q: I'm designing a study to investigate in what context adolescents use certain terms in their spoken language. In my pilot study, I found that the count data were not normally distributed; there were a lot of instances in which the adolescents did not use the terms at all. How do I analyze this kind of data?
A: Studies that look at the number of times an event occurs over a certain period of time are quite common in the social sciences (e.g., abuse of drugs among adolescents, hospitalizations due to unintentional injury), and often these data are 'zero-inflated'. In the past, researchers have generally used one of two methods to analyze this type of data; square root transformation (in which the square root of each number is taken) or dichotomization of the data into zero and non-zero count groups (in your case, those who did use certain linguistic terms, and those who did not use them at all). Neither of these solutions is ideal. Data transformation fails to address the inflation of zeros, and dichotomization ignores the meaningful variation among counts of 1, 2, 3, and so forth. That is, when counts are dichotomized, all positive counts are treated as if they are equivalent, and that might not be the case in reality.

Count data often have skewed distributions, and other methods have been developed that you might consider. The Poisson distribution, useful across a wide variety of disciplines, was designed to handle the occurrence and frequency of rare events within a particular interval of time or space (e.g. bombings in London during WWII, number of Presidential vetoes over course of presidency). There are two assumptions with Poisson: the variance and the mean of the count variable are equal, and occurrences of the counted behavior are independent of each other. Count data often violate one or both of these assumptions (for example, in your case, if the use of one term means that another term is more likely to be used also).

If your data are overdispersed (variance is greater than the mean), or if they violate the assumption of independence, you may be better off using a negative binomial distribution which relaxes these assumptions. It may be that individuals who are hospitalized for unintentional injury once are more likely to be hospitalized again due to being clumsy or because they take more risks (i.e., non-independence).

If your data are overdispersed due to a multitude of non-occurrences (zeros), then you can use a zero-inflated model (e.g., zero-inflated Poisson, zero-inflated negative binomial). If the source of overdispersion is an inflation of zeros, then zero-inflated Poisson (ZIP) may model the data more accurately. Alternatively, if the source of overdispersion is due to an inflation of zeros and other sources, then zero-inflated negative binomial (ZINB) will likely provide a more accurate fit with the observed data.

Zero-inflated models are useful beyond their statistical advantages; they have a conceptual advantage as well. They permit exploration of two interesting questions: 1- what are the differences between cases that are in an 'always zero' group and a 'not always zero' group. Cases in the always zero group will have a zero count, while cases in the not always zero group may or may not have a zero count. This analysis is similar to dichotomization, although zero-inflated models go a step further by permitting exploration of a continuous outcome simultaneously. Differences in the frequency of counts will also be models (e.g., among children likely to have positive counts, what variables predict the frequency of these counts?).

To illustrate this latter point, consider a hypothetical example in which a researcher is interested in predicting the number of apples visitors pick at an apple orchard. Some visitors may come just to visit and will not pick and apples (always zero group). Other visitors may come with the intention of picking apples (not always zero group). It is important to understand that individuals in the not always zero group may actually pick zero apples; perhaps they did not find any apples worthy of picking. The key is that individuals can have zero counts for different reasons, and zero-inflated models explore this possibility.
Q: How do I choose the best method of analyzing my count data? How do I know if my data fit the description of 'zero-inflated'?
A: There is no strict definition of how many zeros you need, or the ratio of zero to nonzero counts that is required for any of these particular models; so deciding on the best model can be tricky. However, a very attractive feature of these models is that the extent to which they accurately reflect observed data can be compared statistically. One can choose between Poisson and negative binomial models using a Likelihood-ratio (LR) test that examines the null hypothesis that the mean and variance are equal. A test called the Vuong test can be computed that compares Poisson with ZIP, or negative binomial with ZINB. Additionally, the LR test can be used to compare ZIP with ZINB. The downside to these tests is that, to my knowledge, they need to be computed post-hoc. Researchers must first run the various models and then make the aforementioned comparisons.

Please let me clarify that this process is not akin to data 'fishing'. Researchers are not running the models and then deciding which model they like best. Rather, these procedures are important for identifying which models provide the most accurate fit with observed data. And remember, when choosing the model, you should be guided by your research questions, as well as by characteristics of your data.

Based on published article and personal communication with the researcher in November 2008.
Karaszia, B., van Dulmen, M. (2008). Regression models for count data: Illustrations using logitudinal predictors of childhood injury. Journal of Pediatric Psychology, 33(10), 1076-1084.

 

More About "Selecting Statistical Analyses"