What makes optimization special?

(and why should I care?)


Data Philly, Feb. 2024
Dante Gates

About me

Husband and Father

Philly data scientist

wrote a haiku once

⚠️ Disclaimer

Background

As seen on machine learning

As Seen on Machine Learning

As Seen on Machine Learning

  • Gradient methods: SGD, ADAM, …
  • Recursive algorithms: decision trees, K-Means, …
  • Loss functions included: MSE, log-loss,
  • Cross validation, Grid Search, …

As seen on machine learning

Objective function

\begin{align*} \underset{\theta}{\text{argmin}}\ \mathcal{L}(y,f(\theta)) \end{align*}

As seen on machine learning

Penalties (constraint, sort of)

\begin{align*} \underset{\theta}{\text{argmin}}\ &\mathcal{L}(y,f(\theta))\color{blue}{+\sum_{i=1}^{l}{g_{i}(\theta, \lambda_{i})}}\\ \text{where}&\\ &\color{blue}{g_{i}(\theta,\lambda_{i})>0,\ i=1\ldots l} \end{align*}

As seen on machine learning

Continous inputs, continuous outputs

\begin{align*} \underset{\theta}{\text{argmin}}\ &\mathcal{L}(y,f(\theta))+\sum_{i=1}^{l}{g_{i}(\theta, \lambda_{i})}\\ \text{where}&\\ &g_{i}(\theta,\lambda_{i})>0,\ i=1\ldots l \\ &\color{blue}{f(\theta): \mathbb{R}^{n}\to \mathbb{R}^{m}} \end{align*}

What’s so special about optimization?

Objective no longer explicitly references a target variable

\begin{align*} &\underset{\theta}{\text{argmin}}\ \color{blue}{f({\theta, X})} \\ \end{align*}

What’s so special about optimization?

Ability to require arbitrary (sort of) constraints

\begin{align*} \underset{\theta}{\text{argmin}}\ &f({\theta, X}) \\ \text{s.t.}&\\ &\color{blue}{g_{i}(\theta, X) < C_{i}, i\ldots n} \end{align*}

What’s so special about optimization?

Model parameters no longer bound to strictly reals

\begin{align*} \underset{\theta}{\text{argmin}}\ &f({\theta, X}) \\ \text{s.t.}&\\ &g_{i}(\theta, X) < C_{i}, i\ldots n \\ &\color{blue}{f(\theta, X): \{\mathbb{R},\mathbb{Z},\ldots\}^{n}\to \mathbb{R}\ \text{(Usually)}} \end{align*}

Classic example: traveling salesman

\underset{x_{i,j}}{\text{argmin}}\ \sum_{i=1}^{n}{\sum_{j=1,i\ne j}^{n}{c_{i,j}x_{i,j}}} Minimize distance traveled
x_{i,j}\in \{0, 1\} Objective variable represents path assignments
\sum_{i=1,i\ne j}^{n}{x_{i,j}}=1 All cities have exactly one incoming path
\sum_{j=1,i\ne j}^{n}{x_{i,j}}=1 All cities have exactly one outgoing path

What makes optimization special?


“ML optimization” \begin{align*} \underset{\theta}{\text{argmin}}\ &\mathcal{L}(y,f(\theta))+\sum_{i=1}^{l}{g_{i}(\theta, \lambda_{i})}\\ \text{where}&\\ &g_{i}(\theta,\lambda_{i})>0,\ i=1\ldots l \\ &f(\theta): \mathbb{R}^{n}\to \mathbb{R}^{m} \end{align*}

Mathematical optimization

\begin{align*} \underset{\theta}{\text{argmin}}\ &f({\theta, X}) \\ \text{s.t.}&\\ &g_{i}(\theta, X) < C_{i}, i\ldots n \\ &f(\theta, X): \{\mathbb{R},\mathbb{Z},\ldots\}^{n}\to \mathbb{R} \end{align*}

What makes optimization special?


“ML optimization” \begin{align*} \underset{\theta}{\text{argmin}}\ &\mathcal{L}(y,f(\theta))+\sum_{i=1}^{l}{g_{i}(\theta, \lambda_{i})}\\ \text{where}&\\ &g_{i}(\theta,\lambda_{i})>0,\ i=1\ldots l \\ &f(\theta): \mathbb{R}^{n}\to \mathbb{R}^{m} \end{align*}

Mathematical optimization

\begin{align*} \underset{\theta}{\text{argmin}}\ &f({\theta, X}) \\ \text{s.t.}&\\ &g_{i}(\theta, X) < C_{i}, i\ldots n \\ &f(\theta, X): \{\mathbb{R},\mathbb{Z},\ldots\}^{n}\to \mathbb{R} \end{align*}

Mathematical optimization is not special

Three examples

(or, the truth, a half-truth and nothing but a lie, but not necessarily in that order)

Bro, do you even read?

x: 14 pages contain underlining

t_{x}: Last underline appears on page 84

T: 324 pages total

What was the final page read?

Conceptually

x: 14 pages contain underlining

t_{x}: Last underline appears on page 84

T: 324 pages total

Is this an optimization problem?


Objective: ??

Decision variable: Rate of underlining (\lambda), final page read (\tau)

Constraints: Parameters are positive

Is this an optimization problem?


Objective: Maximize likelihood of observed data (number of underlines and final page with an underline)

Decision variable: Rate of underlining (\lambda), final page read (\tau)

Constraints: Parameters are positive

\begin{equation}\tag{7} \text{LL}(r,\alpha,a,b)=\sum_{i=1}^{n}{\ln \left[\text{L}(r,\alpha,a,b\vert X_{i}=x_{i},t_{x},T_{i})\right]}\ \ \ \ \ \ \ \ \ \ \end{equation}

Maximum likelihood estimates of the model parameters (r, α, a, b) are obtained by maximizing the loglikelihood function given in (7) above.

Excel?

This is very easy to code in Excel—see Figure 1 for complete details.

Or not

from autograd import value_and_grad
from scipy.optimize import minimize

def _fit(self, ...):
    ...
    output = minimize(
        value_and_grad(self._negative_log_likelihood),
        jac=True,
        method=None,
        tol=tol,
        x0=current_init_params,
        args=minimizing_function_args,
        options=minimize_options,
        bounds=bounds,
    )

sklearn

Pause

Immaculate Grid

Immaculate Grid

Is this an optimization problem?


Objective: Maximize number of boxes filled

Decision variable: Which players go in which boxes

Constraints: Players used once, one player per box, player must satisfy criteria

Jamie Moyer

PHI SEA STL
HOU 0 0 0
COL 1 1 1
KC 0 0 0

Book keeping

# player: n
# box: i, j
>>> s[n, i, j]
1

Decision variable

import pyomo.environ as pyo

x = []
for (n, i, j), _ in np.ndenumerate(s):
    name = f'x{n}_{i},{j}'
    setattr(m, name, pyo.Var(domain=pyo.Binary))
    x.append(getattr(m, name))
x = np.array(x).reshape(s.shape)

Each variable assigns each player n to box i,j \\x\in\{0,1\}^{n,i,j}

Objective function


    def objective(model):
        return -x.sum()
    

Maximize the number of boxes assigned to a player \\\underset{x}{\text{argmin}} -\sum_{n=1}^{N}{\sum_{i=1}^{9}{\sum_{j=1}^{9}{x_{n,i,j}}}}

Constraints

m.c1 = pyo.ConstraintList()  
for c in x.sum(axis=(1, 2)): 
    m.c1.add(c <= 1)


m.c2 = pyo.ConstraintList()           
for (n, i, j), _ in np.ndenumerate(s):
   m.c2.add(x[n,i,j] <= s[n,i,j])


m.c3 = pyo.ConstraintList()
for c in x.sum(axis=0).ravel():
    m.c3.add(c <= 1)

Each player assigned to at most one box \sum_{i=1}^{9}{\sum_{j=1}^{9}{x_{n,i,j}}}=1,\ \forall n

Player must satisfy criteria \\x_{n,i,j}\le s_{n,i,j}

Each box assigned at most one player \\\sum_{n=1}^{N}{x_{n,i,j}}\leq 1, \forall i,j

Constraints

m.c1 = pyo.ConstraintList()  
for c in x.sum(axis=(1, 2)): 
    m.c1.add(c <= 1)


m.c2 = pyo.ConstraintList()           
for (n, i, j), _ in np.ndenumerate(s):
   m.c2.add(x[n,i,j] <= s[n,i,j])


m.c3 = pyo.ConstraintList()
for c in x.sum(axis=0).ravel():
    m.c3.add(c <= 1)

Each player assigned to at most one box \sum_{i=1}^{9}{\sum_{j=1}^{9}{x_{n,i,j}}}=1,\ \forall n

Player must satisfy criteria \\x_{n,i,j}\le s_{n,i,j}

Each box assigned at most one player \\\sum_{n=1}^{N}{x_{n,i,j}}\leq 1, \forall i,j

Constraints

m.c1 = pyo.ConstraintList()  
for c in x.sum(axis=(1, 2)): 
    m.c1.add(c <= 1)


m.c2 = pyo.ConstraintList()           
for (n, i, j), _ in np.ndenumerate(s):
   m.c2.add(x[n,i,j] <= s[n,i,j])


m.c3 = pyo.ConstraintList()
for c in x.sum(axis=0).ravel():
    m.c3.add(c <= 1)

Each player assigned to at most one box \sum_{i=1}^{9}{\sum_{j=1}^{9}{x_{n,i,j}}}=1,\ \forall n

Player must satisfy criteria \\x_{n,i,j}\le s_{n,i,j}

Each box assigned at most one player \\\sum_{n=1}^{N}{x_{n,i,j}}\leq 1, \forall i,j

Constraints

m.c1 = pyo.ConstraintList()  
for c in x.sum(axis=(1, 2)): 
    m.c1.add(c <= 1)


m.c2 = pyo.ConstraintList()           
for (n, i, j), _ in np.ndenumerate(s):
   m.c2.add(x[n,i,j] <= s[n,i,j])


m.c3 = pyo.ConstraintList()
for c in x.sum(axis=0).ravel():
    m.c3.add(c <= 1)

Each player assigned to at most one box \sum_{i=1}^{9}{\sum_{j=1}^{9}{x_{n,i,j}}}=1,\ \forall n

Player must satisfy criteria \\x_{n,i,j}\le s_{n,i,j}

Each box assigned at most one player \\\sum_{n=1}^{N}{x_{n,i,j}}\leq 1, \forall i,j

Pause

Cryptograms

Is this an optimization problem?


Objective: The most matches?

Decision variable: The cipher

Constraints: Cipher is 1-1 mapping

Matching what?

Scale

A different perspective

Hello, old friend

P(w^{\prime}\vert \pi)=\sum_{w\in W}^{}{P(w^{\prime}\vert \pi, w)P(w)}

Hello, old friend

\begin{align*} P(w^{\prime}\vert \pi)&=\sum_{w\in W}^{}{P(w^{\prime}\vert \pi, w)P(w)} \\ &=\sum_{w\in W}^{}{w^{\prime}\pi^{T}w P(w)} \end{align*}

Permutation matrix

Computation

def negative_log_likelihood(π, w_prime, W, p_w, mask):
    """
    π: (n, n) permutation matrix
    w_prime: (m, L, n) vectorized text
    W: (D, L, n) vectorized dictionary
    p_w: (D,) dictionary log likelihood
    mask: (m, D) boolean mask indicating which values in dictionary
          could not possibly match w'
    """
    Xt = w_prime @ π
    matches = torch.tensordot(torch.log(Xt+ϵ), p_w, dims=[[1, 2], [1, 2]])
    return (matches + p_w[None, :]) * mask

P(w^{\prime}\vert \pi) = \sum_{w\in W}^{}{w^{\prime}\pi^{T}w P(w)}

It liiiiiives

Now what?


from scipy.optimize import linear_sum_assignment

cipher = {
    ALPHABET[i]: ALPHABET[j]
    for i, j in zip(*linear_sum_assignment(π_max, maximize=True))
}

Does it work?

‘THE GREAT SHILS HUNG POTIONMESS IN THE AIR, OVER EVERY NATION ON EARTH.’

Does it work?

THE GREAT SHIPS HUNG MOTIONLESS IN THE AIR, OVER EVERY NATION ON EARTH. MOTIONLESS THEY HUNG, HUGE, HEAVY, STEADY IN THE SKY, A BLASPHEMY AGAINST NATURE. MANY PEOPLE WENT STRAIGHT INTO SHOCK AS THEIR MINDS TRIED TO ENCOMPASS WHAT THEY WERE LOOKING AT. THE SHIPS HUNG IN THE SKY IN MUCH THE SAME WAY THAT BRICKS DON’T.

Motivating Examples From Industry

Health Care

AdTech

Takeaway

Optimization is like this picture of icicles

“ML Optimization” is like my thumb

Or maybe auto focus