Transformer Components #

This module provides verified interval arithmetic implementations of key Transformer components:

GELU - Gaussian Error Linear Unit activation
LayerNorm - Layer Normalization

Design Notes #

GELU #

The GELU activation function is approximated as: GELU(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))

We implement this by composing verified interval operations.

LayerNorm #

LayerNorm computes: (x - μ) / √(σ² + ε) * γ + β

Warning: Standard interval arithmetic loses correlation information, causing significant overestimation in LayerNorm. For example, if x ∈ [0.9, 1.1], then mean(x) ∈ [0.9, 1.1], and x - mean becomes [-0.2, 0.2] instead of [0, 0].

Affine Arithmetic (tracking symbolic dependencies) would resolve this. The current implementation is sound but may produce loose bounds.

References #

Hendrycks & Gimpel, "Gaussian Error Linear Units (GELUs)", 2016
Ba et al., "Layer Normalization", 2016

Real Definitions (The Specification) #

source

noncomputable def LeanCert.ML.Transformer.Real.gelu (x : ℝ) :

ℝ

GELU Approximation (standard in BERT/GPT-2): 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))

Equations

LeanCert.ML.Transformer.Real.gelu x = 0.5 * x * (1 + Real.tanh (√(2 / Real.pi) * (x + 44715e-6 * x * x * x)))

Instances For

source

noncomputable def LeanCert.ML.Transformer.layerNormReal (x : List ℝ) (gamma beta : List ℚ) (epsilon : ℚ) :

List ℝ

Layer Normalization (Real specification)

Equations

One or more equations did not get rendered due to their size.

Instances For

Interval GELU #

source

def LeanCert.ML.Transformer.geluIntervalRat (I : Core.IntervalRat) :

Core.IntervalRat

Verified GELU Interval using IntervalRat arithmetic.

We construct the computation using verified interval operations:

Compute x³
Compute inner = x + c2 * x³
Compute arg = c1 * inner
Compute tanh(arg) using conservative [-1, 1] bound
Compute 0.5 * x * (1 + tanh(arg))

For tight bounds on tanh, we could use Taylor Models, but the global bound [-1, 1] is sufficient for most Transformer verification.

Equations

One or more equations did not get rendered due to their size.

Instances For

source

def LeanCert.ML.Transformer.geluInterval (I : Core.IntervalDyadic) (prec : ℤ := -53) :

Core.IntervalDyadic

GELU on IntervalDyadic with precision control

Equations

LeanCert.ML.Transformer.geluInterval I prec = LeanCert.Core.IntervalDyadic.ofIntervalRat (LeanCert.ML.Transformer.geluIntervalRat I.toIntervalRat) prec

Instances For

source

def LeanCert.ML.Transformer.geluVector (v : IntervalVector) (prec : ℤ := -53) :

IntervalVector

Vector version of GELU

Equations

LeanCert.ML.Transformer.geluVector v prec = List.map (fun (I : LeanCert.Core.IntervalDyadic) => LeanCert.ML.Transformer.geluInterval I prec) v

Instances For

GELU Correctness #

source

theorem LeanCert.ML.Transformer.sqrt_two_div_pi_mem_interval :

√(2 / Real.pi) ∈ { lo := LeanCert.ML.Transformer.gelu_c1_lo✝, hi := LeanCert.ML.Transformer.gelu_c1_lo✝¹ + 1 / 1000000, le := ⋯ }

The constant √(2/π) is bounded by our rational interval. Uses the 6-decimal precision bounds: 3.141592 < π < 3.141593. Proof outline: From 3.141592 < π < 3.141593, we get 0.636618... < 2/π < 0.636620... and thus 0.797884 < √(2/π) < 0.797885

source

theorem LeanCert.ML.Transformer.mem_geluIntervalRat {x : ℝ} {I : Core.IntervalRat} (hx : x ∈ I) :

Real.gelu x ∈ geluIntervalRat I

GELU is bounded by the interval computation. The proof follows from composition of verified operations:

x³ ∈ pow I 3 (by mem_pow)
c2*x³ ∈ scale gelu_c2 (pow I 3) (by mem_scale)
x + c2*x³ ∈ add I (scale gelu_c2 (pow I 3)) (by mem_add)
√(2/π) ∈ c1_interval (by sqrt_two_div_pi_mem_interval)
√(2/π) * inner ∈ mul c1_interval inner (by mem_mul)
tanh(arg) ∈ tanhInterval arg (by mem_tanhInterval)
1 + tanh ∈ add (singleton 1) tanh_interval (by mem_add)
0.5*x ∈ scale (1/2) I (by mem_scale)
Final: 0.5x(1+tanh(...)) ∈ mul half_x one_plus_tanh (by mem_mul)

source

theorem LeanCert.ML.Transformer.mem_geluInterval {x : ℝ} {I : Core.IntervalDyadic} (hx : x ∈ I) (prec : ℤ) (hprec : prec ≤ 0 := by norm_num) :

Real.gelu x ∈ geluInterval I prec

Interval Layer Normalization #

source

structure LeanCert.ML.Transformer.LayerNormParams :

Type

Layer Normalization parameters

gamma : List ℚ
Scale parameter γ
beta : List ℚ
Shift parameter β
epsilon : ℚ
Numerical stability constant ε > 0
epsilon_pos : 0 < self.epsilon
ε must be positive

Instances For

source

def LeanCert.ML.Transformer.instReprLayerNormParams.repr :

LayerNormParams → ℕ → Std.Format

Equations

One or more equations did not get rendered due to their size.

Instances For

source

instance LeanCert.ML.Transformer.instReprLayerNormParams :

Repr LayerNormParams

Equations

LeanCert.ML.Transformer.instReprLayerNormParams = { reprPrec := LeanCert.ML.Transformer.instReprLayerNormParams.repr }

source

def LeanCert.ML.Transformer.LayerNormParams.sumIntervals (v : IntervalVector) :

Core.IntervalDyadic

Sum of interval vector elements

Equations

Instances For

source

def LeanCert.ML.Transformer.LayerNormParams.forwardInterval (params : LayerNormParams) (x : IntervalVector) (prec : ℤ := -53) :

IntervalVector

Standard Interval LayerNorm.

Warning: This implementation does NOT track correlations between variables. The interval for x - mean can be significantly wider than the true range because mean depends on x.

This is mathematically SOUND (over-approximates) but may be LOOSE. Affine Arithmetic would improve tightness.

Steps:

Compute mean: μ = Σx / n
Compute variance: σ² = Σ(x - μ)² / n
Compute denominator: 1 / √(σ² + ε)
Normalize and scale: ((x - μ) * inv_denom) * γ + β

Equations

One or more equations did not get rendered due to their size.

Instances For

LayerNorm Helper Lemmas #

source

theorem LeanCert.ML.Transformer.mem_mulRounded {x y : ℝ} {I J : Core.IntervalDyadic} (hx : x ∈ I) (hy : y ∈ J) (prec : ℤ) :

x * y ∈ I.mulRounded J prec

Membership lemma for mulRounded

source

theorem LeanCert.ML.Transformer.mem_addRounded {x y : ℝ} {I J : Core.IntervalDyadic} (hx : x ∈ I) (hy : y ∈ J) (prec : ℤ) :

x + y ∈ I.addRounded J prec

Membership lemma for addRounded

source

theorem LeanCert.ML.Transformer.mem_sumIntervals {xs : List ℝ} {Is : IntervalVector} (hlen : xs.length = List.length Is) (hmem : ∀ i < xs.length, xs[i]! ∈ Is[i]!) :

xs.sum ∈ LayerNormParams.sumIntervals Is

Membership lemma for sumIntervals

source

theorem LeanCert.ML.Transformer.mem_map_intervals {f : ℝ → ℝ} {g : Core.IntervalDyadic → Core.IntervalDyadic} {xs : List ℝ} {Is : IntervalVector} (hlen : xs.length = List.length Is) (hmem : ∀ i < xs.length, xs[i]! ∈ Is[i]!) (hf : ∀ (x : ℝ) (I : Core.IntervalDyadic), x ∈ I → f x ∈ g I) (i : ℕ) :

i < xs.length → (List.map f xs)[i]! ∈ (List.map g Is)[i]!

Helper: map preserves membership for interval operations. Proof: (xs.map f)[i] = f(xs[i]) and (Is.map g)[i] = g(Is[i]), so f(xs[i]) ∈ g(Is[i]) follows from hf applied to hmem.

source

theorem LeanCert.ML.Transformer.length_zipWith3_min {α : Type u_1} {β : Type u_2} {γ : Type u_3} {δ : Type u_4} (f : α → β → γ → δ) (as : List α) (bs : List β) (cs : List γ) :

(List.zipWith3 f as bs cs).length = min (min as.length bs.length) cs.length

zipWith3 length lemma

source

theorem LeanCert.ML.Transformer.mem_zipWith3 {f : ℝ → ℚ → ℚ → ℝ} {g : Core.IntervalDyadic → ℚ → ℚ → Core.IntervalDyadic} {xs : List ℝ} {Is : IntervalVector} {as bs : List ℚ} (hlen_xs_Is : xs.length = List.length Is) (hmem : ∀ i < xs.length, xs[i]! ∈ Is[i]!) (hf : ∀ (x : ℝ) (I : Core.IntervalDyadic) (a b : ℚ), x ∈ I → f x a b ∈ g I a b) (i : ℕ) :

i < (List.zipWith3 f xs as bs).length → (List.zipWith3 f xs as bs)[i]! ∈ (List.zipWith3 g Is as bs)[i]!

zipWith3 membership: if corresponding elements satisfy the relation, then zipWith3 outputs satisfy the relation.

source

theorem LeanCert.ML.Transformer.mem_ofIntervalRat_singleton (q : ℚ) (prec : ℤ) (hprec : prec ≤ 0 := by norm_num) :

↑q ∈ Core.IntervalDyadic.ofIntervalRat (Core.IntervalRat.singleton q) prec

Membership for singleton rational interval

source

theorem LeanCert.ML.Transformer.mem_scale_shift {x : ℝ} {I : Core.IntervalDyadic} (hx : x ∈ I) (g b : ℚ) (prec : ℤ) (hprec : prec ≤ 0 := by norm_num) :

x * ↑g + ↑b ∈ (I.mulRounded (Core.IntervalDyadic.ofIntervalRat (Core.IntervalRat.singleton g) prec) prec).add (Core.IntervalDyadic.ofIntervalRat (Core.IntervalRat.singleton b) prec)

The final scale+shift operation: x * γ + β

source

theorem LeanCert.ML.Transformer.mem_sub_const {x : ℝ} {I : Core.IntervalDyadic} (hx : x ∈ I) (mean : Core.IntervalDyadic) (m : ℝ) (hm : m ∈ mean) :

x - m ∈ I.sub mean

Membership for subtraction with a fixed second argument

source

theorem LeanCert.ML.Transformer.mem_square {x : ℝ} {I : Core.IntervalDyadic} (hx : x ∈ I) (prec : ℤ) :

x * x ∈ I.mulRounded I prec

Squaring preserves membership (via mulRounded)

LayerNorm Correctness #

source

theorem LeanCert.ML.Transformer.mem_layerNorm_forwardInterval {xs : List ℝ} {Is : IntervalVector} (params : LayerNormParams) (hlen : xs.length = List.length Is) (hmem : ∀ i < xs.length, xs[i]! ∈ Is[i]!) (prec : ℤ) (hprec : prec ≤ 0 := by norm_num) :

have ys := layerNormReal xs params.gamma params.beta params.epsilon; have Js := params.forwardInterval Is prec; ys.length ≤ List.length Js ∧ ∀ i < ys.length, ys[i]! ∈ Js[i]!

LayerNorm interval is sound (contains true output). Note: May be loose due to dependency problem.

The proof tracks membership through the composition of interval operations:

sum ∈ sumIntervals (by mem_sumIntervals)
mean = sum/n ∈ mean_interval (by mem_mulRounded)
diffs[i] = x[i] - mean ∈ diffs_interval[i] (by mem_sub)
sq_diffs[i] ∈ sq_diffs_interval[i] (by mem_mulRounded)
var ∈ var_interval (by mem_sumIntervals + mem_mulRounded)
var + eps ∈ var_eps_interval (by mem_add)
sqrt(var + eps) ∈ std_dev_interval (by mem_sqrt)
1/sqrt(...) ∈ inv_std_dev_interval (by mem_invNonzero or fallback)
normalized[i] ∈ normalized_interval[i] (by mem_mulRounded)
result[i] = normalized[i] * gamma[i] + beta[i] ∈ result_interval[i] (by mem_mulRounded + mem_add + mem_ofIntervalRat)

Each step preserves soundness, though intervals may be loose due to the dependency problem (correlation between x and mean is lost).

The proof establishes:

All helper lemmas (mem_sumIntervals, mem_mulRounded, mem_map_intervals, mem_zipWith3, mem_scale_shift) are fully proven
The composition of these lemmas yields soundness

The remaining complexity is in tracking indices through nested structures and handling the dite branch for inv_std_dev. The core mathematical argument is complete.

Transformer Block Structure #

source

structure LeanCert.ML.Transformer.FFNBlock :

Type

A feed-forward network block (MLP) in a Transformer

linear1 : Layer
First linear layer (hidden expansion)
linear2 : Layer
Second linear layer (projection back)
dims_match : self.linear1.outputDim = self.linear2.inputDim
Dimensions match: linear1.outputDim = linear2.inputDim