Data Philly, Feb. 2024
Dante Gates
Goal: When you think of Bayesian Inference, think beyond point estimates and confidence intervals.
Non-goals: How to …
In fact… If you wish I demonstrated how to fit Bayesian models that’s a success
Bayesian methods don’t do the same things better; they do different things, which are [sometimes] better
P(A\vert B) = \frac{P(B\vert A)P(A)}{P(B)}
P(\theta\vert D) = \frac{P(D\vert \theta)P(\theta)}{P(D)}
\underbrace{P(\theta\vert D)}_{Posterior} = \frac{\overbrace{P(D\vert \theta)}^{likelihood}\ \overbrace{P(\theta)}^{priors}}{\underbrace{P(D)}_{marginal\ likelihood}}
{\bf \underbrace{P(\theta\vert D)}_{Posterior}} = \frac{\overbrace{P(D\vert \theta)}^{likelihood}\ \overbrace{P(\theta)}^{priors}}{\underbrace{P(D)}_{marginal\ likelihood}}
Alice has scored 5 times, Bob has scored 3, what was the position of the first roll?
\underbrace{P(\theta\vert D)}_{Posterior} = \frac{\overbrace{P(D\vert \theta)}^{likelihood}\ \overbrace{P(\theta)}^{priors}}{\underbrace{P(D)}_{marginal\ likelihood}}
\underbrace{P(\theta\vert D)}_{Posterior} = \frac{\overbrace{P(D\vert \theta)}^{likelihood}\ \overbrace{P(\theta)}^{priors}}{\underbrace{P(D)}_{marginal\ likelihood}}
\underbrace{P(x\vert y,n)}_{Posterior} = \frac{\overbrace{P(y,n\vert x)}^{likelihood}\ \overbrace{P(x)}^{priors}}{\underbrace{P(y,n)}_{marginal\ likelihood}}
👋
Posterior means should be correct on average, 50% intervals should contain the true values half the time, and so forth.
If we play this game many times the posterior mean produces reliable estimates of the position of the first roll
The 90% HDI of our posterior contains x_{i} ~90% of the time as expected
We can manipulate the posterior as a random variable, e.g. we can consider expressions such as
Treating the posterior as a random variable also justifies the ability to generate predictions from posterior
\overbrace{p(\hat{y}\vert y)}^{posterior\ prediction}=\overbrace{p(\hat{y}\vert \theta)}^{generative\ model}\times\overbrace{p(\theta\vert y)}^{the\ posterior}
Bob must score 3 times in a row, thus p(B\vert x) = (1-x)^{3}
Due to our uncertainty of x we can frame this as
p(B\vert y, n)=\int_{0}^{1}{\overbrace{(1-x)^{3}}^{we\ want\ this}\times\underbrace{p(x\vert y,n)}_{we\ have\ this} dx}
We can solve for this value analytically, or…
p(B\vert y, n)=\int_{0}^{1}{\overbrace{(1-x)^{3}}^{we\ want\ this}\times\underbrace{p(x\vert y,n)}_{we\ have\ this} dx}
We can use numeric methods to generate samples from the posterior without explicitly solving the integrals required to analytically compute the posterior
p(B\vert y, n)=\int_{0}^{1}{\overbrace{(1-x)^{3}}^{we\ want\ this}\times\underbrace{p(x\vert y,n)}_{we\ have\ this} dx}
The intuition is to think of the integrand as a weighted average, which is approximated by the sample frequencies of x_{i}\sim p(x\vert y,n)
p(B\vert y, n)\approx\frac{1}{N}\sum_{i=1}^{N}{(1-x_{i})^{3}}
This opens up the possibility of modeling outcomes, via simulation, that are more difficult to express analytically, such as
We can simulate these outcomes with samples from the posterior
We also expect reliable long term behavior from the posterior predictive distributions. Thus, simulations of, e.g., point spread should exhibit these properties
The point is subtle: Bayesian inference does not actually require the generative model; all it needs from the data is the likelihood, and different generative models can have the same likelihood. But Bayesian data analysis requires the generative model to be able to perform predictive simulation and model checking, … and Bayesian workflow will consider a series of generative models
pymc
make our solutions look more programmatic than analyticIn a simple game of chance that has a chance p of paying fair odds, one should invest the fraction f^{*}, of their total wealth W that maximizes the expected exponential growth rate of their wealth
p\log(1+f) + (1-p)\log(1-f)
Proposed by John Kelly in 1956
Properties of the Kelly Criterion include: a global maximum, a point of diminishing returns and “greed” that produces inevitable ruin.
What if we want to apply the Kelly criterion to a “continuous gambling game”, such as the stock market?
If a game has a chance p(s) of paying out odds s, following the Kelly Criterion, our objective function is
\underset{f}{\argmax}\ \int_{}^{}{p(s)\log(1+fs)ds}
Where s could represent, e.g., the excess S&P returns.
Importantly, this approach requires we assign a probability distribution to s
Thorp took to modeling s analytically, and using numeric methods to handle the integral
p(s)=\begin{cases} h+\frac{1}{\sqrt{2\pi\alpha}}e^{\frac{-(s-\bar{\mu})^{2}}{2\bar{\alpha}^{2}}}& A<s<B\\ 0&\text{otherwise} \end{cases}
We’ll refer to this as the “empirical pdf”, because it’s based on the observed values \bar{\mu},\ \bar{\sigma}
Instead of choosing a fixed pair (\mu,\sigma), we can integrate over a range of plausible values p(s\vert r)=p(s\vert \mu,\sigma)p(\mu,\sigma\vert r)
Then, we find our investment strategy f^{*} by optimizing over samples of the posterior predictive growth rates
\argmax_{f} \frac{1}{N}\sum_{i=1}^{N}{\log(1+fs_{i})}
# 1. generate samples of `r` from the posterior
with model:
pm.sample_posterior_predictive(idata, ...)
# just an alias
p_s_r = idata.posterior_predictive.p_s_r.values.ravel()
# 2. define the space we want to optimize over and calculate
# the expected growth rate
f = np.linspace(0, -1/np.min(p_s_r), num=40)
expected_growth_rate = np.log(
1 + (p_s_r[:, None] * f[None, :])
).mean(axis=0) # <- "integrate" over posterior predictive: p(s|μ,σ)
# 3. find the value of `f` that resulted in the highest expected
# growth rate
f[np.argmax(expected_growth_rate)]
# define optimization problem: variable and objective
f = torch.rand(1, requires_grad=True)
def objective(f):
return torch.log(1 + p_s_r*f).mean()
# torch setup
optimizer = torch.optim.Adam([f])
# ↓↓↓ do this a lot ↓↓↓
for _ in range(N):
optimizer.zero_grad()
loss = -objective(f)
loss.backward()
optimizer.step()
Not only can we extract the usual summary statistics from the posterior predictive, but we can ask questions like, “how often do we expect this strategy to produce a loss after a 100 year investment period?”
Seeing the flexibility using a tool like torch
gives us, we can now imagine scenarios such as finding the optimal solution
f=\langle f_{1},\ldots,f_{n}\rangle
of fractional investments of a portfolio with returns
s\sim \langle N(\mu_{1},\sigma_{1}),\ldots,N(\mu_{n},\sigma_{n})\rangle
Granted, this introduces new complexity such as accounting for covariance amongst the assets, but we have a starting point
We could consider the probability that one “channel” will outperform another: i.e. p(c_{1} < c_{2})
Thompson sampling allows for “live” A/B testing, where action A is taken according to the probability that A > B.
These properties also hold for more complex generative models such time series forecasts with trend and seasonal components.
Bayesian marketing mix models have been gaining popularity. One of their primary uses cases is finding the optimal allocation of a budget across marketing channels.
Husband and Father
Philly data scientist
wrote a haiku once