Probability theory is the mathematical framework that underpins statistical analysis. It’s a field that has its roots in the study of gambling and uncertainty, but it’s now an essential tool for a wide range of disciplines. For computational neuroethologists, understanding probability theory is crucial for everything from designing experiments and analyzing data, to building and testing models of neural systems. In this post, we’ll cover the basics of probability theory, using examples from the field of computational neuroethology to illustrate key concepts.

Sample Space and Events

The sample space \(Ω\) is the set of all possible outcomes of a random experiment. Each outcome \(ω ∈ Ω\) can be thought of as a complete description of the state of the world at the end of the experiment. For example, in a study of rat behavior, the sample space might be the set of all possible sequences of a rat’s movements in a maze.

An event \(A\) is a subset of the sample space, i.e., \(A ⊆ Ω\). It’s a collection of possible outcomes of an experiment. For example, an event might be that a rat reaches the end of a maze.

Probability Measure

A probability measure \(P\) is a function \(P : F → R\) that assigns a probability to each event. It satisfies the following properties:

  • \(P(A) ≥ 0\), for all \(A ∈ F\)
  • \[P(Ω) = 1\]
  • If \(A_1, A_2, ...\) are disjoint events (i.e., \(A_i ∩ A_j = ∅\) whenever \(i ≠ j\)), then \(P(∪_iA_i) = ∑_iP(A_i)\)

These are known as the Axioms of Probability.

Random Variables

A random variable \(X\) is a function \(X : Ω → R\). Typically, random variables are denoted using upper case letters \(X(ω)\) or more simply \(X\) (where the dependence on the random outcome \(ω\) is implied). The value that a random variable may take on is denoted using lower case letters \(x\).

For example, in a study of rat behavior, the elements of the sample space \(Ω\) are sequences of the rat’s movements. Suppose that \(X(ω)\) is the time it takes for the rat to reach the end of the maze. Given that the maze has a fixed length, \(X(ω)\) can take only a finite number of values, so it is known as a discrete random variable.

Cumulative Distribution Functions (CDFs)

A cumulative distribution function (CDF) is a function \(F_X : R → [0, 1]\) which specifies a probability measure as, \(F_X(x) ≜ P(X ≤ x)\). Here, \(F_X(x)\) is the CDF of the random variable \(X\), \(R\) represents the set of real numbers, and \(P(X ≤ x)\) is the probability that the random variable \(X\) takes a value less than or equal to \(x\).

Example

Consider a rat navigating through a maze. Let \(X\) be the time it takes for the rat to reach the end of the maze. The CDF \(F_X(t)\) gives the probability that the rat will reach the end of the maze in time less than or equal to \(t\).

Illustration generated by DALL-E

Probability Mass Functions (PMFs)

When a random variable \(X\) takes on a finite set of possible values (i.e., \(X\) is a discrete random variable), a simpler way to represent the probability measure associated with a random variable is to directly specify the probability of each value that the random variable can assume. In particular, a probability mass function (PMF) is a function \(p_X : Ω → R\) such that

\[p_X(x) ≜ P(X = x)\]

In the case of discrete random variable, we use the notation \(Val(X)\) for the set of possible values that the random variable \(X\) may assume. For example, if \(X(ω)\) is a random variable indicating the number of heads out of ten tosses of coin, then \(Val(X) = {0, 1, 2, . . . , 10}\).

Example

Consider a rat navigating through a maze. Let \(X\) be the number of turns it takes for the rat to reach the end of the maze. The PMF \(p_X(k)\) gives the probability that the rat will reach the end of the maze in \(k\) turns.

Probability Density Functions (PDFs)

For some continuous random variables, the cumulative distribution function \(F_X(x)\) is differentiable everywhere. In these cases, we define the Probability Density Function or PDF as the derivative of the CDF, i.e.,

\[f_X(x) ≜ \frac{dF_X(x)}{dx}\]

Note here, that the PDF for a continuous random variable may not always exist (i.e., if \(F_X(x)\) is not differentiable everywhere).

According to the properties of differentiation, for very small $∆x$,

\[P(x ≤ X ≤ x + ∆x) ≈ f_X(x)∆x\]

Example

Consider a rat navigating through a maze. Let \(X\) be the time it takes for the rat to reach the end of the maze. The PDF \(f_X(t)\) gives the probability density function of the time it takes for the rat to reach the end of the maze.

Joint and Marginal Probability Mass Functions

If \(X\) and \(Y\) are discrete random variables, then the joint probability mass function \(p_{XY} : R×R → [0, 1]\) is defined by

\[p_{XY}(x, y) = P(X = x, Y = y)\]

Here, \(0 ≤ p_{XY}(x, y) ≤ 1\) for all \(x, y\), and \(\sum_{x∈Val(X)}\sum_{y∈Val(Y)} p_{XY}(x, y) = 1\).

How does the joint PMF over two variables relate to the probability mass function for each variable separately? It turns out that

\[p_X(x) = \sum_{y} p_{XY}(x, y)\]

and similarly for \(p_Y(y)\). In this case, we refer to \(p_X(x)\) as the marginal probability mass function of \(X\). In statistics, the process of forming the marginal distribution with respect to one variable by summing out the other variable is often known as “marginalization”.

Example

Consider a rat navigating through a maze. Let \(X\) be the number of turns it takes for the rat to reach the end of the maze, and let \(Y\) be the number of times the rat retraces its steps. The joint PMF \(p_{XY}(k, l)\) gives the probability that the rat will reach the end of the maze in \(k\) turns and retrace its steps l times. The marginal PMF \(p_X(k)\) gives the probability that the rat will reach the end of the maze in \(k\) turns, regardless of how many times it retraces its steps.

Joint and Marginal Probability Density Functions

Let \(X\) and \(Y\) be two continuous random variables with joint distribution function \(F_{XY}\). In the case that \(F_{XY}(x, y)\) is everywhere differentiable in both \(x\) and \(y\), then we can define the joint probability density function as,

\[f_{XY}(x, y) = \frac{\partial^2 F_{XY}(x, y)}{\partial x \partial y}\]

Like in the single-dimensional case, \(f_{XY}(x, y) ≠ P(X = x, Y = y)\), but rather,

\[\int_{x \in A} f_{XY}(x, y) dx dy = P((X, Y) ∈ A)\]

Note that the values of the probability density function \(f_{XY}(x, y)\) are always nonnegative, but they may be greater than 1. Nonetheless, it must be the case that,

\[\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{XY}(x, y) dx dy = 1\]

Analogous to the discrete case, we define

\[f_X(x) = \int_{-\infty}^{\infty} f_{XY}(x, y) dy\]

as the marginal probability density function (or marginal density) of \(X\), and similarly for \(f_Y(y)\).

Example

Consider a rat navigating through a maze. Let \(X\) be the time it takes for the rat to reach the end of the maze, and let \(Y\) be the speed of the rat. The joint PDF \(f_{XY}(t, v)\) gives the probability density function of the time and speed it takes for the rat to reach the end of the maze. The marginal PDF \(f_X(t)\) gives the probability density function of the time it takes for the rat to reach the end of the maze, regardless of its speed.

Conditional Distributions

Conditional distributions seek to answer the question, what is the probability distribution over \(Y\), when we know that \(X\) must take on a certain value \(x\)? In the discrete case, the conditional probability mass function of \(X\) given \(Y\) is simply:

\[p_{Y|X}(y|x) = \frac{p_{XY}(x, y)}{p_X(x)}\]

assuming that \(p_X(x) ≠ 0\).

In the continuous case, the situation is technically a little more complicated because the probability that a continuous random variable \(X\) takes on a specific value \(x\) is equal to zero. Ignoring this technical point, we simply define, by analogy to the discrete case, the conditional probability density of \(Y\) given \(X = x\) to be:

\[f_{Y|X}(y|x) = \frac{f_{XY}(x, y)}{f_X(x)}\]

provided \(f_X(x) ≠ 0\).

Example

Consider a rat navigating through a maze. Let \(X\) be the time it takes for the rat to reach the end of the maze, and let \(Y\) be the speed of the rat. The conditional PDF \(f_{Y|X}(y|x)\) gives the probability density function of the speed of the rat given that it reaches the end of the maze in time \(t\)

Independence

Two random variables \(X\) and \(Y\) are independent if \(F_{XY}(x, y)= FX(x)FY (y)\) for all values of \(x\) and \(y\). Equivalently,

  • For discrete random variables, \(p_{XY}(x, y) = p_X(x)p_Y(y)\) for all \(x ∈ Val(X)\), \(y ∈ Val(Y)\).
  • For discrete random variables, \(p_{Y|X}(y|x) = p_Y(y)\) whenever \(p_X(x) ≠ 0\) for all \(y ∈ Val(Y)\).
  • For continuous random variables, \(f_{XY}(x, y) = f_X(x)f_Y(y)\) for all \(x, y ∈ R\).
  • For continuous random variables, \(f_{Y|X}(y|x) = f_Y(y)\) whenever \(f_X(x) ≠ 0\) for all \(y ∈ R\).

Independent random variables often arise in machine learning algorithms where we assume that the training examples belonging to the training set represent independent samples from some unknown probability distribution. To make the significance of independence clear, consider a “bad” training set in which we first sample a single training example \((x^{(1)}, y^{(1)})\) from some unknown distribution, and then add \(m - 1\) copies of the exact same training example to the training set. In this case, we have (with some abuse of notation)

\[P((x^{(1)}, y^{(1)}), . . . .(x^{(m)}, y^{(m)})) ≠ \sum_{i=1}^{m} P(x^{(i)}, y^{(i)}).\]

Despite the fact that the training set has size \(m\), the examples are not independent! While clearly the procedure described here is not a sensible method for building a training set for a machine learning algorithm, it turns out that in practice, non-independence of samples does come up often, and it has the effect of reducing the “effective size” of the training set.

Bayes’s Rule

A useful formula that often arises when trying to derive expression for the conditional probability of one variable given another, is Bayes’s rule.

In the case of discrete random variables \(X\) and \(Y\),

\[P_{Y|X}(y|x) = \frac{P_{XY}(x, y)}{P_X(x)} = \frac{P_{X|Y}(x|y)P_Y(y)}{\sum_{y' ∈ Val(Y)} P_{X|Y}(x|y')P_Y(y')}.\]

If the random variables \(X\) and \(Y\) are continuous,

\[f_{Y|X}(y|x) = \frac{f_{XY}(x, y)}{f_X(x)} = \frac{f_{X|Y}(x|y)f_Y(y)}{\int_{-\infty}^{\infty} f_{X|Y}(x|y')f_Y(y')dy'}.\]

Example

In a Brain-Computer Interface (BCI) experiment, suppose \(X\) is the event that a rat successfully performs a task, and \(Y\) is the event that a specific pattern of neural activity is observed. We are interested in \(P(X|Y)\), the probability that the rat successfully performs the task given that the specific pattern of neural activity is observed. According to Bayes’ rule, we can calculate this as:

\[P(X|Y) = \frac{P(Y|X)P(X)}{P(Y|X)P(X) + P(Y|\neg X)P(\neg X)}\]

Here, \(P(Y|X)\) is the probability that the specific pattern of neural activity is observed given that the rat successfully performs the task, \(P(X)\) is the prior probability that the rat successfully performs the task, \(P(Y|\neg X)\) is the probability that the specific pattern of neural activity is observed given that the rat does not successfully perform the task, and \(P(\neg X)\) is the prior probability that the rat does not successfully perform the task.

This formula allows us to update our belief about the rat’s performance based on the observed neural activity. If the specific pattern of neural activity is highly indicative of successful task performance and the rat is generally successful at performing the task, \(P(X|Y)\) will be high. If the specific pattern of neural activity is not very indicative of successful task performance or the rat is generally not successful at performing the task, \(P(X|Y)\) will be lower.

Conclusion

In this post, we’ve covered the basics of probability theory, using examples from the field of computational neuroethology to illustrate key concepts. We’ve seen how these concepts can be applied to understand and analyze the behavior of rats in a maze and in a BCI experiment. Understanding these concepts is crucial for designing experiments, analyzing data, and building models in computational neuroethology. I hope that this post has provided a useful introduction to these topics and has sparked your interest in further study.