Chapter 1 Introduction

The analysis of count data has received attention from the statistical community in the last four decades. Since the seminal paper published by Nelder and Wedderburn (1972), the class of generalized linear models (GLMs) have a prominent role for regression modelling of normal and non-normal data. GLMs are fitted by a simple and efficient Newton scoring algorithm relying only on second-moment assumptions for estimation and inference. Furthermore, the theoretical background for GLMs is well established in the class of dispersion models (B. Jørgensen 1987; B. Jørgensen 1997) as a generalization of the exponential family of distributions.

In spite of the flexibility of the GLM class, the Poisson distribution is the only choice for the analysis of count data in this framework. Thus, in practice there is probably an over-emphasis on the use of the Poisson distribution for count data. A well known limitation of the Poisson distribution is its mean and variance relationship, which implies that the variance equals the mean, referred to as equidispersion. In practice, however, count data can present other features, namely underdispersion (mean > variance) and overdispersion (mean < variance). There are many different possible causes for departures from the equidispersion. Furthermore, in practical data analysis a number of these could be involved.

One possible cause of under/overdispersion is departure from the Poisson process. It is well known that the Poisson counts can be interpreted as the number of events in a given time interval where the arrival’s times are exponential distributed. In the cases where this assumption is violated the resulting counts can be under or overdispersed (W. M. Zeviani et al. 2014). Another possibility and probably more frequent cause of overdispersion is unobserved heterogeneity of experimental units. It can be due, for example, to correlation between individual responses, cluster sampling, omitted covariates and others.

In general, these departures from the Poisson distribution are manifested in the raw data as a zero-inflated or heavy-tailed count distribution. It is important to discuss the consequences of failing to take into account the under or overdispersion when analysing count data. In the case of overdispersion, the standard errors associated with the regression coefficients calculated under the Poisson assumption are too optimistic and associated hypothesis tests will tend to give false positive results by incorrectly rejecting null hypotheses. The opposite situation will appear in case of underdispersed data. In both cases, the Poisson model provides unreliable standard errors for the regression coefficients and hence potentially misleading inferences. However, the regression coefficients are still consistently estimated.

The strategies for constructing alternative count distributions are related with the causes of the non-equidispersion. When departures from the Poisson process are plausible the class of duration dependence models (Winkelmann 2003) can be employed. This class of models changes the distribution of the time between events from the exponential to more general distributions, such as gamma and inverse Gaussian. In this course, we shall discuss one example of this approach, namely, the Gamma-Count distribution (W. M. Zeviani et al. 2014). This distribution assumes that the time between events is gamma distributed, thus it can deal with under, equi and overdispersed count data.

On the other hand, if unobserved heterogeneity is present its in general implies extra variability and consequently overdispersed count data. In this case, a Poisson mixtures is commonly applied. This approach consists of include random effects on the observation level, and thus take into account the unobserved heterogeneity. Probably, the most popular example of this approach is the negative binomial model, that corresponds to a Poisson-gamma mixtures. In this course, we shall present the Poisson-Tweedie family of distributions, which in turn corresponds to Poisson-Tweedie mixtures (Bonat et al. 2016; Jørgensen and Kokonendji 2015). Finally, a third approach to deal with non-equidispersed count data consists of generalize the Poisson distribution by adding an extra parameter to model under and overdispersion. Such a generalization can be done using the class of weighted Poisson distributions (Del Castillo and Pérez-Casany 1998). One popular example of this approach is the Conway–Maxwell–Poisson distribution (COM-Poisson) (K. F. Sellers and Shmueli 2010). The COM-Poisson is a member of the exponential family, has the Poisson and geometric distributions as special cases and the Bernoulli distribution as a limiting case. It can deal with both under and overdispersed count data. Thus, given the nice properties of the COM-Poisson distribution for handling count data, we chose to present this model as part of this course.

In this course, we shall highlight and compare the flexibility of these distributions to deal with count data through a consideration of dispersion, zero-inflated and heavy tail indexes. Furthermore, we specify regression models and illustrate their application with three worked examples.

In Chapter 2 we present the properties and regression models associated with the Poisson, Gamma-count, Poisson-Tweedie and COM-Poisson distributions. Furthermore, we compare these distributions using the dispersion, zero-inflated and heavy tail indexes. Estimation and inference for these models based on the likelihood paradigm are discussed in Chapter 3. In Chapter 4, we extend the Poisson-Tweedie model using only second-moment assumptions and present the fitting algorithm based on the estimating functions approach. Chapter \(5\) presents three worked examples.