Verified Self-Attention Mechanism #

This module implements the complete self-attention mechanism for Transformers with verified interval bounds.

Architecture #

Input X: [seq_len, d_model]
    │
    ├──► W_Q ──► Q: [seq_len, d_k]
    ├──► W_K ──► K: [seq_len, d_k]
    └──► W_V ──► V: [seq_len, d_v]
              │
              ▼
    Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V
              │
              ▼
    Output: [seq_len, d_v]

Key Components #

Scaled Dot-Product Attention: softmax(QK^T / √d_k) × V
Multi-Head Attention: Split into h heads, attend, concat, project

Interval Arithmetic Strategy #

Q, K, V are interval matrices (bounds on each element)
Q × K^T uses verified matmul from Matrix.lean
Softmax uses algebraic cancellation from Softmax.lean
Final multiplication with V uses matmul again

The composition of these verified operations maintains soundness.

Scaling Factor #

source

def LeanCert.ML.Attention.invSqrtDim (d_k : ℕ) (prec : ℤ) :

Core.IntervalDyadic

Compute 1/√d_k as an interval for scaling attention scores. We use a rational approximation that bounds the true value.

Equations

One or more equations did not get rendered due to their size.

Instances For

Scaled Dot-Product Attention (Vector Form) #

source

def LeanCert.ML.Attention.attentionWeights (q : IntervalVector) (K : List IntervalVector) (d_k : ℕ) (prec : ℤ) :

IntervalVector

Attention scores for a single query against all keys.

Given:

q: query vector [d_k]
K: key matrix [seq_len, d_k] (as list of key vectors)

Returns: attention weights [seq_len] after softmax

Formula: softmax(q · k_i / √d_k for all i)

Equations

One or more equations did not get rendered due to their size.

Instances For

source

def LeanCert.ML.Attention.applyAttention (weights : IntervalVector) (V : List IntervalVector) (prec : ℤ) :

IntervalVector

Apply attention weights to values.

Given:

weights: attention weights [seq_len]
V: value matrix [seq_len, d_v] (as list of value vectors)

Returns: weighted sum of values [d_v]

Formula: Σ_i weights[i] * V[i]

Equations

One or more equations did not get rendered due to their size.

Instances For

source

def LeanCert.ML.Attention.scaledDotProductAttention (Q K V : List IntervalVector) (d_k : ℕ) (prec : ℤ) :

List IntervalVector

Single-head scaled dot-product attention.

Given:

Q: query matrix [seq_len, d_k] (as list of query vectors)
K: key matrix [seq_len, d_k] (as list of key vectors)
V: value matrix [seq_len, d_v] (as list of value vectors)

Returns: output matrix [seq_len, d_v]

Formula: softmax(Q × K^T / √d_k) × V

Equations

One or more equations did not get rendered due to their size.

Instances For

Multi-Head Attention #

source

structure LeanCert.ML.Attention.MultiHeadAttentionParams :

Type

Parameters for multi-head attention

d_model : ℕ
Model dimension
num_heads : ℕ
Number of attention heads
d_k : ℕ
Key/Query dimension per head
d_v : ℕ
Value dimension per head
W_Q : List (List (List ℚ))
Query projection weights for each head: [num_heads, d_model, d_k]
W_K : List (List (List ℚ))
Key projection weights for each head: [num_heads, d_model, d_k]
W_V : List (List (List ℚ))
Value projection weights for each head: [num_heads, d_model, d_v]
W_O : List (List ℚ)
Output projection weights: [num_heads * d_v, d_model]

Instances For

source

def LeanCert.ML.Attention.instReprMultiHeadAttentionParams.repr :

MultiHeadAttentionParams → ℕ → Std.Format

Equations

One or more equations did not get rendered due to their size.

Instances For

source

instance LeanCert.ML.Attention.instReprMultiHeadAttentionParams :

Repr MultiHeadAttentionParams

Equations

LeanCert.ML.Attention.instReprMultiHeadAttentionParams = { reprPrec := LeanCert.ML.Attention.instReprMultiHeadAttentionParams.repr }

source

def LeanCert.ML.Attention.linearProject (X : List IntervalVector) (W : List (List ℚ)) (prec : ℤ) :

List IntervalVector

Linear projection: X × W^T X: [seq_len, d_in] as list of vectors W: [d_out, d_in] as list of lists Returns: [seq_len, d_out]

Equations

One or more equations did not get rendered due to their size.

Instances For

source

def LeanCert.ML.Attention.singleHead (X : List IntervalVector) (W_Q W_K W_V : List (List ℚ)) (d_k : ℕ) (prec : ℤ) :

List IntervalVector

Single attention head computation

Equations

One or more equations did not get rendered due to their size.

Instances For

source

def LeanCert.ML.Attention.concatVectors (heads : List IntervalVector) :

IntervalVector

Concatenate vectors horizontally

Equations

LeanCert.ML.Attention.concatVectors heads = List.foldl (fun (x1 x2 : LeanCert.Engine.IntervalVector) => x1 ++ x2) [] heads

Instances For

source

def LeanCert.ML.Attention.multiHeadAttention (params : MultiHeadAttentionParams) (X : List IntervalVector) (prec : ℤ) :

List IntervalVector

Multi-head attention forward pass.

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) × W_O where head_i = Attention(X × W_Q^i, X × W_K^i, X × W_V^i)

Equations

One or more equations did not get rendered due to their size.

Instances For

Real Specifications #

source

noncomputable def LeanCert.ML.Attention.Real.scaledDotProductAttention (Q K V : List (List ℝ)) (d_k : ℕ) :

List (List ℝ)

Real-valued scaled dot-product attention

Equations

One or more equations did not get rendered due to their size.

Instances For

Soundness #

source

theorem LeanCert.ML.Attention.mem_attentionWeights {q_real : List ℝ} {K_real : List (List ℝ)} {q : IntervalVector} {K : List IntervalVector} (_hq : q_real.length = List.length q) (hK : K_real.length = K.length) (d_k : ℕ) (prec : ℤ) :

have weights_real := List.map (fun (k : List ℝ) => (List.zipWith (fun (x1 x2 : ℝ) => x1 * x2) q_real k).sum / √↑d_k) K_real; have weights := attentionWeights q K d_k prec; weights_real.length ≤ List.length weights

Soundness of attention weights computation.

If query q is bounded by interval q_I, and keys K are bounded by K_I, then the attention weights (after softmax) are bounded by the computed intervals.

source

theorem LeanCert.ML.Attention.mem_scaledDotProductAttention {Q_real K_real V_real : List (List ℝ)} {Q K V : List IntervalVector} (hQ : Q_real.length = Q.length) (_hK : K_real.length = K.length) (_hV : V_real.length = V.length) (d_k : ℕ) (prec : ℤ) :

have output_real := Real.scaledDotProductAttention Q_real K_real V_real d_k; have output := scaledDotProductAttention Q K V d_k prec; output_real.length ≤ output.length

Main soundness theorem for scaled dot-product attention.

If Q, K, V real matrices are bounded element-wise by interval matrices Q_I, K_I, V_I, then the output of attention is bounded by scaledDotProductAttention Q_I K_I V_I.