## Statistics and Algebra. An Example.

Written by : Matt

There is a developing field called algebraic statistics which explores probability and statistics problems involving discrete random variables using methods coming from commutative algebra and algebraic geometry. The basic point is that the parameters for such statistical models are often constrained by polynomial relationships – and these are exactly the subject of commutative algebra and algebraic geometry. I would like to learn something more about this relationship, so in this post I’ll describe one example that I worked through – it comes from a book on the subject written by Bernd Sturmfels. Disclaimer : the rest of this post is technical.

Suppose that you have three independent exponential random variables $X_1, X_2, X_3$ with parameters $\lambda_1, \lambda_2, \lambda_3$ respectively. Recall that this means $X_i$ has the density function

$f(t) = \lambda_i e^{ -\lambda_i t}$ when $t > 0$ and $0$ otherwise.

Such a random variable is often interpreted in terms of waiting time to first occurrence for a Poisson process.

Suppose that rather than observing the time of occurrence $X_i$ directly, we only get to observe whether $X_1$ and $X_2$ occur before $X_3$ or not. (I’ll leave you to think about how this could come up in practice, but it seems like it could). So what we observe is a discrete random variable $Y$ with four possible states $(0,0), (1,0), (0,1), (1,1)$ indicating whether or not variable $i$ occurs before variable three. Now for an exercise in working with probability distributions (something the students in the class I taught could hopefully do!):

$P(Y = (0,0)) = \frac{\lambda_3}{\lambda_1 + \lambda_2 + \lambda_3}$
$P(Y = (1,0)) = \frac{\lambda_1}{\lambda_1 + \lambda_2 + \lambda_3} \cdot \frac{\lambda_3}{\lambda_2 + \lambda_3}$
$P(Y = (0,1)) = \frac{\lambda_2}{\lambda_1 + \lambda_2 + \lambda_3} \cdot \frac{\lambda_3}{\lambda_1 + \lambda_3}$
$P(Y = (1,1)) = \frac{\lambda_1}{\lambda_1 + \lambda_3} \cdot \frac{\lambda_2}{\lambda_2 + \lambda_3} \cdot \frac{\lambda_1 + \lambda_2 + 2\lambda_3}{\lambda_1 + \lambda_2 + \lambda_3}$

Great. Now suppose that we sample the variable $Y$ some large number of times, and record the number of times $(a, b, c, d)$ that $Y$ takes on the four values above. Given these observances, we would like to predict the three $\lambda$ values. One way of doing this is, given a choice of values for the $\lambda$s, to compute $P(a,b,c,d | \lambda_i)$. Then we choose the $\lambda$ values to maximize this probability (this is known as the method of maximum likelihood estimation). Rather than maximize this function, we’ll maximize the log of it (which amounts to the same thing) to take advantage of the multiplicative structure of the function. You can compute this function:

$p(\lambda) = (b + d) \log (\lambda_1) + (c + d) \log (\lambda_2) + (a + b + d) \log(\lambda_3) + d \log(\lambda_1 + \lambda_2 + 2\lambda_3) - (c + d) \log(\lambda 1 + \lambda_3) - (b + d) \log(\lambda_2 + \lambda_3) - (a + b + c + d) \log(\lambda_1 + \lambda_2 + \lambda_3).$

Since the probability function involves equations of degree $0$ in the $\lambda_i$, there’s no harm in setting $\lambda_3 = 1$ and solving for $\lambda_1$ and $\lambda_2$. Any other set of solutions will be the same as these up to scaling all the $\lambda$ values. After doing this, the maximizing equations (i.e. the gradient components of $p$) are

$\frac{b + d}{\lambda_1} + \frac{d}{\lambda_1 + \lambda_2 + 2} - \frac{c + d}{\lambda_1 + 1} - \frac{a + b + c + d}{\lambda_1 + \lambda_2 + 1} = 0.$
$\frac{c + d}{\lambda_2} + \frac{d}{\lambda_1 + \lambda_2 + 2} - \frac{b + d}{\lambda_2 + 1} - \frac{a + b + c + d}{\lambda_1 + \lambda_2 + 1} = 0.$

Things are starting to look more like polynomials! If someone asked you to solve this, you would probably try to clear denominators, use one equation to solve for $\lambda_1$ and plug it into the second equation to get one equation with one unknown. The problem is that when you clear denominators, you will introduce extra solutions into your equation. That is, you will find solutions to your new equations which are not solutions to the equations you are trying to solve. In this example, you will find $(0, -1)$ is a solution to the problem with cleared denominators, but not to the original equation.

How do you know which solutions are extra and which are not? In commutative algebra, there is exactly a process for this, and it’s called the saturation of an ideal. The ideal we must use in our example is generated by the polynomials consisting of the equations above with cleared denominators, and the common denominator of all terms appearing in the two equations. (Our ideal is an ideal in the ring $\mathbb{Q}(a,b,c,d)[\lambda_1, \lambda_2]$).

Saturation of an ideal is a process which is cumbersome to carry out by hand for all but the simplest examples, but has been implemented by computer algebra packages like Macaulay, Sage, and Singular. In general it’s very computationally intensive, so it may be slow for ‘large’ ideals. One strategy that will speed it up in this example is to use actual numbers in place of $(a,b,c,d)$. For example, when $(a,b,c,d) = (1,2,3,5)$ then

$\lambda_1 = (5\lambda_2^2 + 35\lambda_2 - 16) / 56$ and $\lambda_2$ is given by a solution to
$5\lambda_2^3 + 75\lambda_2^2 - 128\lambda_2 - 128 = 0$

Then I calculate that $\lambda_2 = 2.17559$ (there is only one positive root) and $\lambda_1 = 1.49664$ (full disclosure: used Mathematica). In general the result of the saturation will allow you to solve for $\lambda_1$ in terms of $\lambda_2$ and then you will have that $\lambda_2$ is given by the zeros of some polynomial which is usually of degree 3:

$(a + 2b)(c - d)\lambda_2^3 + (ab + b^2 - 3ac - 6bc + c^2 - 2ad - 5bd + cd) \lambda_2^2 + (2a + 3b - 3c - 3d)(c + d)\lambda_2 + 2(c + d)^2 = 0$