Active Research Campaign

Parameter Golf

Minimizing bits-per-byte for a 10-minute, 16 MB language model. Locked benchmark: KKVShare Wider FLA at 1.03385760. V7 active — port TMA G3, isolate recurrence, then PR.

1.03386 BPB locked benchmark  (KKVShare Wider FLA)
Watch the Intro Video View on GitHub

Campaign Overview

83 Seconds. Every Byte Counts.

A full walkthrough of the research campaign — from the problem statement through architecture experiments to the current frontier.

Campaign Metrics

1.03386 Locked benchmark BPB KKVShare Wider FLA (Opensens)
1.07882 Canonical control — NS0 PR #1716 · hygiene reference
−9.82% TMA G3 throughput gain 415.3 ms → 374.5 ms (passed)
25+ Submissions Analyzed Leaderboard & code audit
32 Hypotheses Generated Across 10+ research loops
<1.00 V7 target BPB Deadline 2026-04-30

Research Trajectory

The BPB Journey

Each loop narrows the gap — dead ends eliminated, winning techniques compounded.

L1
Loops 1–3 — Multi-Agent Orchestrator Built
Analyzed 25 leaderboard submissions at code level. Discovered MTP is fully implemented but disabled (MTP_NUM_HEADS=0). Generated 32 ranked hypotheses. Built 14 reusable experiment skills.
L4
Loop 4 — LeakyReLU Confirmed #1 Single Change
Mac Mini M4 ablation campaign. LeakyReLU(0.5)² activation confirmed as the strongest single improvement over the baseline. Architecture advantage of 11L/512d family validated empirically.
~1.47 BPB baseline
L6
Loop 6 — Proxy Compressor Gate Established
Critical lesson: all quant probes MUST use bit_shuffle+Brotli-11, not LZMA-9 or plain Brotli. H64 Lloyd-Max embeddings and H-R4P2 both killed by this trap — MSE savings evaporated under the real pipeline.
L8
Loop 8 — 1.0810 BPB Combined Solution
SDClip + Brotli-11 pipeline established as the real SOTA packaging. GPTQ quantization stack locked in. H64 Lloyd-Max definitively killed under canonical pipeline.
1.0810 BPB
L10
Loop 10 — MTP / MoE / Mamba All Killed
Rigorous elimination round. Multi-Token Prediction, Mixture-of-Experts, Mamba/SSM, and Knowledge Distillation all confirmed dead under budget constraints. 1ms overhead = 0.006 BPB heuristic established.
Dead directions closed
L11
PR #1716 / #1700 — Canonical Control Lines Established
NS0 (PR #1716): BigramHash d=32 + Int8 control tensors + LZMA code wrapper. NS1 (PR #1700): Multi-phase SGD TTT + Phased LoRA on NS0. Both retained as hygiene references and transfer surfaces — not the headline target.
NS0 1.07882 · NS1 1.07219 — controls only
L12
KKVShare Wider FLA — Locked Benchmark
Opensens reproduction of the KKVShare Wider FLA architecture. 3-seed mean under canonical scorer. This is the benchmark every new lane must beat, not NS0.
1.03385760 BPB — locked benchmark
SYS
TMA G3 — Systems Gate Passed
Fused FC + activation kernel passed on the softmax control line. No BPB claim from TMA alone. Next step: transfer this throughput gain onto the 1.03385760 benchmark family before any recurrence isolate.
415.3298 ms → 374.5369 ms (−9.82%) ✓
V7
V7 Campaign — Active (2026-04-20)
Compression-first: audit exact-payload ANS/rANS/Brotli-11 on the 1.03386 artifact first. Port TMA G3 onto benchmark family. Only then isolate Loop45 or DeltaShare. NV-015 DeltaShare is the new V7 architecture lane. NV-002b is a bridge, not the automatic main bet.
Target: sub-1.000 BPB by 2026-04-30

Techniques & Experiments

Every direction tested rigorously — winners compounded, dead ends documented.

🗜️
GPTQ + SDClip Quantization
Int6 weight quantization with adaptive sigma clipping. SDClip reduces outlier sensitivity. Stacked with bit_shuffle + Brotli-11 for canonical byte budget.
Active SOTA
🔄
Loop45 Recurrence (TMA)
45-step recurrent inference with TMA G3 kernel achieving −9.82% throughput. Systems-gate approach: unlock recurrence depth only after throughput is validated.
V7 Priority
🧠
CLA / Cross-Layer KV Sharing
NV-002b: sharing KV caches across middle layers frees ~1.3M parameters for reinvestment. Non-GDN parts survive canonical scorer audit.
V7 Main Bet
📐
SpinQuant (Hadamard Rotation)
H98: Hadamard rotation before GPTQ quantization reduces quantization error. Expected −0.005 BPB. Must retune sigma to 7–9σ post-rotation.
Queued
Phased LoRA TTT
PRC-006: Test-Time Training where only LoRA parameters update at evaluation. Parameter-efficient and legal under canonical scorer rules.
NS1 Control
📦
BigramHash Embeddings
PRC-001: d=32 bigram hashing on the tightest n-gram size. ~200 KB artifact cost. Part of the canonical PR #1716 SOTA stack.
In SOTA
🔗
DeltaShare (Low-Rank)
New V7 structural lane: share middle-block weights with tiny low-rank delta corrections. Inspired by DeltaLLM. Avoids all-or-nothing hard sharing.
V7 New Lane
GDN / FLA (Non-Canonical)
H85 showed 1.034 BPB due to a buggy scorer with non-canonical LUT. Canonical 3-seed mean: 1.223 BPB. Entire GDN lane killed. Documentation saved future teams.
Killed
Per-Loop LoRA (H87)
All 4 gates failed empirically. Artifact busts 16 MB budget by 50 KB. Structurally incompatible with GPTQ pipeline. Confirmed killed 2026-04-18.
Killed

Leaderboard Submissions

Key Pull Requests

Selected PRs on the openai/parameter-golf leaderboard, ordered by impact.

#1716 BigramHash + Int8 control tensors + LZMA code wrapper (NS0 canonical SOTA) 1.07882 Live SOTA #1700 Multi-phase SGD TTT + Phased LoRA on NS0 base (NS1 challenger) 1.07219 Challenger #1693 AttnOutGate per-loop (PRC-003) — zero-init sigmoid gate Research #1670 Adaptive per-component GPTQ sigma clip (PRC-008) Research #1687 CLA architecture — cross-layer KV sharing (NV-002b non-GDN parts) Research

V7 Execution Plan

Next PR Path

The cleanest "combine with TMA G3, then PR" sequence from V7 planning docs.

0
Exact-payload compression audit on 1.03385760 family
Compare Brotli-11 vs rANS vs block-adaptive Huffman on the current quantized artifact. Measure true byte headroom before spending architectural budget. Must include code overhead — lossless wins are free budget.
1a
Port TMA G3 onto the benchmark family
TMA G3 already passed on the softmax control (415.3 ms → 374.5 ms, −9.82%). Test the fused FC + activation path on the 1.03385760 family. No BPB claims from TMA alone — goal is confirming the benchmark family can absorb the same throughput gain.
1b
Loop45 isolated on the TMA-safe path (only after 1a passes)
Exact parent stack + deeper recurrence only. No CaseOps, no pre-quant TTT, no extra packaging changes beyond already-trusted Phase 0 compressor. Benchmark family first; softmax control as sanity cross-check.
2
NV-015 DeltaShare — new V7 architecture lane
Cross-layer weight sharing in middle blocks + tiny low-rank delta corrections to restore flexibility. Edge and specialized blocks stay unique. Runs in parallel with Loop45 isolate as an independent architecture bet.
PR
Submit — beating 1.03385760 under canonical scorer
3-seed mean required. Must clear the canonical scorer with bit_shuffle + Brotli-11 packaging. First winner from Phase 0+1 establishes the new benchmark; subsequent wins compound one validated change at a time.

Architecture Snapshots

The two trust anchors that define the V7 starting point.

🎯 Locked Benchmark
KKVShare Wider FLA
1.03385760 BPB
SourceOpensens reproduction
AttentionLinear / FLA + KV-sharing
MLPWider (KKVShare variant)
Seeds3-seed mean (canonical)
TTTNone — pre-TTT baseline
RuleEvery new lane beats this
🔬 Canonical Controls (hygiene only)
NS0 / NS1
1.07882 / 1.07219 BPB
NS0 — PR #1716BigramHash + Int8 + LZMA
NS1 — PR #1700NS0 + Phased LoRA TTT
QuantizationInt6 GPTQ + SDClip
Compressionbit_shuffle + Brotli-11
RoleTransfer checks, not targets
NoteNot the headline target
⚡ Systems Gate (Passed)
TMA G3 Megakernel
−9.82% step time
Before415.3298 ms / step
After374.5369 ms / step
Tested onSoftmax control line
NextPort to 1.03386 family
BPB claim?No — systems donor only
UnlockLoop45 after transfer
🗺️ V7 Execution Queue
Sub-1.00 Campaign
< 1.000 BPB
① Phase 0ANS/rANS payload audit
② Phase 1aPort TMA G3 to benchmark
③ Phase 1bLoop45 isolated (TMA-safe)
④ Phase 2NV-015 DeltaShare lane
⑤ Phase 3PRC-009 TTT (legal gate)
Deadline2026-04-30