Clustered data is commonly encountered in medical research. When clustering is present we expect the observations within a cluster to be ‘more alike’. This induces correlation (`intra-cluster correlation’) between observations (observations within cluster are correlated and observations from separate clusters are independent).

Clustered data can arise from a range of different scenarios such as; repeated measurements over time clustered within patients; subjects clustered within centres such as hospitals or GP practices; clustering by specialist delivering the intervention such as surgeon or therapist. Clustering may arise since different hospitals/practitioners may attract different types of patients, because of disease severity or different socio-demographics of their catchment areas and practitioners might have differential effects on outcomes.

Many statistical methods (tests and models) are based on the assumption that observations are independent. If we apply these statistical methods on clustered data, then results may be overly precise and consequently incorrect conclusions may be drawn. With clustered data, it is important to undertake appropriate statistical analyses that account for clustering.

The simplest way to fit a model that accounts for clustering (`fixed effect’ method) is to add one binary predictor variable for each cluster (using one cluster as a reference cluster). This method is appropriate when the number of clusters is small, for instance, when analysing subjects from four different hospitals. With around 20 or more clusters, when not wanting to directly compare these specific clusters, we generally use `random effects’ models (also called mixed models or multilevel models), or `generalised estimating equations’ (GEEs). The interpretation of the coefficients from these models may differ (though only appreciably with small clusters, and never with linear outcomes when both are equivalent), so the choice of model may depend on the objective of the study.

Random effects methods give `marginal’ estimates of effects, i.e. the effect of an individual changing groups within the specified cluster and a peripheral benefit is an estimate of the between cluster variability itself. GEEs give population average effects, comparing outcomes averaged across clusters in one group with outcomes averaged across clusters in another group; hence they provide estimates of the average effect if an individual moved from one group to another, regardless of cluster.

Allowing for clustering in the analysis will generally increase the width of confidence intervals, and therefore may reduce statistical significance. To ensure your study is adequately powered it is important to perform sample size calculations that take into account the clustered nature of the data. Once you start looking for clusters, you might see them everywhere, and you might see multiple clusters. Patients may be clustered within different doctors, which are further clustered within GP practices etc. Often, we only need to allow for clustering between patients at one level. Clustering at the lowest level (e.g. doctor) is incorporated within clustering at the higher level (e.g. GP practice), so we don’t need to separately allow for it.

Authors: Victoria Cornelius, Joint Lead and Hilary Watt, Consultant Statistician, RDS London