Limit theorems are a beautiful part of probability theory, but in real life data comes in finite samples. The Poisson limit theorem says that binomial distributions \(\mu_n\) with parameters \(p=\lambda/n\) converge in distribution to a Poisson distribution \(\mathcal{P}_\lambda\). Informally, the Poisson distribution models the total number of occurances of i.i.d. Bernoulli rare events (Wikipedia discussion).

Some natural questions follow from the classical limit theorem.

- What happens if some events are dependent?
- What if the events don’t have identical distributions?
- How fast is the convergence?

I learned from Persi Diaconis that finitely many events which are “not too dependent” and occur “not too frequently” will behave like the Poisson.

Let’s introduce some notation to understand this. Suppose we have \(n\) events \(X_1, \dots, X_n\), not necessarily independent. Each event \(X_i\) either occurs or does not occur with probability \(p_i\)

\[X_i = \left\{\begin{array}{ll} 1 & \quad \text{w.p.}\ \ p_i \\ 0 & \quad \text{w.p.}\ \ 1-p_i \end{array}\right.\]We’re relaxing the independence requirement, so we can’t just multiply to get the probability that two events both occur. We’ll use \(p_{ij}\) to denote the probability that event \(X_i X_j\) occurs, i.e. both \(X_i\) and \(X_j\) happen.

Now let \(W=\sum_{i=1}^n X_i\) and let \(\lambda = \sum_{i=1}^n p_i\). The question is then *how close is \(W\) to \(\mathcal{P}_\lambda\)*?

Naturally, the answer will depend on how the events depend on each other. If each is the same event, then their sum will not behave at all like a Poisson distribution. If they are all independent and rare, then as \(n \to \infty\) we expect the difference in distribution to \(\mathcal{P}_\lambda\) to be very small because of the classical limit theorem.

To deal with dependency, we can introduce a *dependency graph*. Each node represents one of the events, and we draw edges so that a collection of events \(A\) is independent of another collection \(B\) if and only if \(A\) and \(B\) have no edges between each other in the graph. For example, in the trivial case where each event is independent, the graph has no edges. If only the first two events depend on each other, then there is an edge between nodes 1 and 2, but no other edges.

Finally, we need to introduce a metric on probability distributions to describe what it means for distributions to be “close” in a rigorous way. We’ll use the *total variation distance*, defined between two probability distributions \(\mu\) and \(\nu\) as

i.e. the largest discrepency between the probabilities that \(\mu\) and \(\nu\) assign to the same event.

At last we can state the result, gloriously free of limits:

\[|| W - \mathcal{P}_\lambda||_{TV} \leq \min\{3, \lambda^{-1}\} \left( \sum_{i=1}^n \sum_{j \in N(i)} p_{ij} + \sum_{i=1}^{n} \sum_{j \in N(i) \cup \{i\}} p_i p_j \right)\]where \(N(i)\) denotes the neighborhood of \(i\) in the dependency graph.

The proof of this approximation theorem uses Stein’s method, a popular technique in modern probability theory. For further reading, see the paper “Poisson Approximation and the Chen-Stein Method” by Arratia, Goldstein, and Gordon and the references I list here.