Parameter Golf

Research Trajectory

The BPB Journey

Each loop narrows the gap — dead ends eliminated, winning techniques compounded.

Loops 1–3 — Multi-Agent Orchestrator Built

Analyzed 25 leaderboard submissions at code level. Discovered MTP is fully implemented but disabled (MTP_NUM_HEADS=0). Generated 32 ranked hypotheses. Built 14 reusable experiment skills.

Loop 4 — LeakyReLU Confirmed #1 Single Change

Mac Mini M4 ablation campaign. LeakyReLU(0.5)² activation confirmed as the strongest single improvement over the baseline. Architecture advantage of 11L/512d family validated empirically.

~1.47 BPB baseline

Loop 6 — Proxy Compressor Gate Established

Critical lesson: all quant probes MUST use bit_shuffle+Brotli-11, not LZMA-9 or plain Brotli. H64 Lloyd-Max embeddings and H-R4P2 both killed by this trap — MSE savings evaporated under the real pipeline.

Loop 8 — 1.0810 BPB Combined Solution

SDClip + Brotli-11 pipeline established as the real SOTA packaging. GPTQ quantization stack locked in. H64 Lloyd-Max definitively killed under canonical pipeline.

1.0810 BPB

L10

Loop 10 — MTP / MoE / Mamba All Killed

Rigorous elimination round. Multi-Token Prediction, Mixture-of-Experts, Mamba/SSM, and Knowledge Distillation all confirmed dead under budget constraints. 1ms overhead = 0.006 BPB heuristic established.

Dead directions closed

L11

PR #1716 / #1700 — Canonical Control Lines Established

NS0 (PR #1716): BigramHash d=32 + Int8 control tensors + LZMA code wrapper. NS1 (PR #1700): Multi-phase SGD TTT + Phased LoRA on NS0. Both retained as hygiene references and transfer surfaces — not the headline target.

NS0 1.07882 · NS1 1.07219 — controls only

L12

KKVShare Wider FLA — Locked Benchmark

Opensens reproduction of the KKVShare Wider FLA architecture. 3-seed mean under canonical scorer. This is the benchmark every new lane must beat, not NS0.

1.03385760 BPB — locked benchmark

SYS

TMA G3 — Systems Gate Passed

Fused FC + activation kernel passed on the softmax control line. No BPB claim from TMA alone. Next step: transfer this throughput gain onto the 1.03385760 benchmark family before any recurrence isolate.

415.3298 ms → 374.5369 ms (−9.82%) ✓

V7 Campaign — Active (2026-04-20)

Compression-first: audit exact-payload ANS/rANS/Brotli-11 on the 1.03386 artifact first. Port TMA G3 onto benchmark family. Only then isolate Loop45 or DeltaShare. NV-015 DeltaShare is the new V7 architecture lane. NV-002b is a bridge, not the automatic main bet.

Target: sub-1.000 BPB by 2026-04-30

Methods Explored

Techniques & Experiments

Every direction tested rigorously — winners compounded, dead ends documented.

🗜️

GPTQ + SDClip Quantization

Int6 weight quantization with adaptive sigma clipping. SDClip reduces outlier sensitivity. Stacked with bit_shuffle + Brotli-11 for canonical byte budget.

Active SOTA

🔄

Loop45 Recurrence (TMA)

45-step recurrent inference with TMA G3 kernel achieving −9.82% throughput. Systems-gate approach: unlock recurrence depth only after throughput is validated.

V7 Priority

🧠

CLA / Cross-Layer KV Sharing

NV-002b: sharing KV caches across middle layers frees ~1.3M parameters for reinvestment. Non-GDN parts survive canonical scorer audit.

V7 Main Bet

📐

SpinQuant (Hadamard Rotation)

H98: Hadamard rotation before GPTQ quantization reduces quantization error. Expected −0.005 BPB. Must retune sigma to 7–9σ post-rotation.

Queued

⚡

Phased LoRA TTT

PRC-006: Test-Time Training where only LoRA parameters update at evaluation. Parameter-efficient and legal under canonical scorer rules.

NS1 Control

📦

BigramHash Embeddings

PRC-001: d=32 bigram hashing on the tightest n-gram size. ~200 KB artifact cost. Part of the canonical PR #1716 SOTA stack.

In SOTA

🔗

DeltaShare (Low-Rank)

New V7 structural lane: share middle-block weights with tiny low-rank delta corrections. Inspired by DeltaLLM. Avoids all-or-nothing hard sharing.

V7 New Lane

❌

GDN / FLA (Non-Canonical)

H85 showed 1.034 BPB due to a buggy scorer with non-canonical LUT. Canonical 3-seed mean: 1.223 BPB. Entire GDN lane killed. Documentation saved future teams.

Killed

❌

Per-Loop LoRA (H87)

All 4 gates failed empirically. Artifact busts 16 MB budget by 50 KB. Structurally incompatible with GPTQ pipeline. Confirmed killed 2026-04-18.

Killed

V7 Execution Plan

Next PR Path

The cleanest "combine with TMA G3, then PR" sequence from V7 planning docs.

0

Exact-payload compression audit on 1.03385760 family

Compare Brotli-11 vs rANS vs block-adaptive Huffman on the current quantized artifact. Measure true byte headroom before spending architectural budget. Must include code overhead — lossless wins are free budget.

1a

Port TMA G3 onto the benchmark family

TMA G3 already passed on the softmax control (415.3 ms → 374.5 ms, −9.82%). Test the fused FC + activation path on the 1.03385760 family. No BPB claims from TMA alone — goal is confirming the benchmark family can absorb the same throughput gain.

1b

Loop45 isolated on the TMA-safe path (only after 1a passes)

Exact parent stack + deeper recurrence only. No CaseOps, no pre-quant TTT, no extra packaging changes beyond already-trusted Phase 0 compressor. Benchmark family first; softmax control as sanity cross-check.

2

NV-015 DeltaShare — new V7 architecture lane

Cross-layer weight sharing in middle blocks + tiny low-rank delta corrections to restore flexibility. Edge and specialized blocks stay unique. Runs in parallel with Loop45 isolate as an independent architecture bet.

PR

Submit — beating 1.03385760 under canonical scorer

3-seed mean required. Must clear the canonical scorer with bit_shuffle + Brotli-11 packaging. First winner from Phase 0+1 establishes the new benchmark; subsequent wins compound one validated change at a time.

Current Stack

Architecture Snapshots

The two trust anchors that define the V7 starting point.

🎯 Locked Benchmark

KKVShare Wider FLA

1.03385760 BPB

SourceOpensens reproduction

AttentionLinear / FLA + KV-sharing

MLPWider (KKVShare variant)

Seeds3-seed mean (canonical)

TTTNone — pre-TTT baseline

RuleEvery new lane beats this

🔬 Canonical Controls (hygiene only)

NS0 / NS1

1.07882 / 1.07219 BPB

NS0 — PR #1716BigramHash + Int8 + LZMA

NS1 — PR #1700NS0 + Phased LoRA TTT

QuantizationInt6 GPTQ + SDClip

Compressionbit_shuffle + Brotli-11

RoleTransfer checks, not targets

NoteNot the headline target

⚡ Systems Gate (Passed)

TMA G3 Megakernel

−9.82% step time

Before415.3298 ms / step

After374.5369 ms / step

Tested onSoftmax control line

NextPort to 1.03386 family

BPB claim?No — systems donor only

UnlockLoop45 after transfer

🗺️ V7 Execution Queue

Sub-1.00 Campaign

< 1.000 BPB

① Phase 0ANS/rANS payload audit

② Phase 1aPort TMA G3 to benchmark

③ Phase 1bLoop45 isolated (TMA-safe)

④ Phase 2NV-015 DeltaShare lane

⑤ Phase 3PRC-009 TTT (legal gate)

Deadline2026-04-30

83 Seconds. Every Byte Counts.

Campaign Metrics

The BPB Journey

Techniques & Experiments

Key Pull Requests

Next PR Path

Architecture Snapshots