This article is a companion to Paper 8 — “The Vision Basin: Cross-Modal Throughput Measurement Reveals Modality-Specific Information Extraction Rates” (2026). The full manuscript is at github.com/Windstorm-Institute/vision-basin.


Paper 7 proved that language models converge on ~4 bits per token not because of anything in the architecture, but because that’s how much information is in the training data. The natural question: what about images? What about audio? If the basin is genuinely about the data, then different data should sit at a different basin.

It does.

Three modalities, three basins

We measured throughput — how many bits of useful prediction a model extracts per source unit — across 12 models spanning language, vision, and audio. The results are not ambiguous:

Language (9 models, Pythia 70M–1.4B + GPT-2 small–XL): 0.85–1.30 bits per source byte. Larger models extract more. The old “τ = 4.16” was bits per BPE token; in tokenizer-independent units, the basin is ~1 bit per byte.

Vision (MAE-Base and MAE-Large): 1.33–1.36 bits per pixel. These are generative models that must reconstruct masked image patches — not classifiers that collapse images to labels. The classification floor is 0.02 bits per patch (essentially zero information retained).

Audio (real LJ Speech, 3,000 utterances, 5.5 hours): 1.89 bits per mel-spectrogram dimension. Music extracts even more: 2.69 bits per mel dimension. Controls are perfect: white noise and silence both extract exactly zero.

The visual shuffling cascade

Paper 6 showed that destroying linguistic structure (shuffling words) raises BPT from ~4 to ~10.8 — a structural bonus of ~6.7 bits. We ran the same experiment on images.

We trained a next-patch prediction model from scratch on STL-10 photographs (96×96), then evaluated it on progressively destroyed versions: quadrants shuffled, blocks shuffled, rows shuffled, pixels shuffled. The result: a spatial structural bonus of 0.69 bits per pixel. Spatial hierarchy is exploitable — the model gets better predictions from intact images than from scrambled ones.

The visual bonus (0.69) is about 10× smaller than language’s (~6.7), per source unit. This makes sense: language has deep compositional hierarchy — phonology, morphology, syntax, semantics, discourse, pragmatics — while images have shallower spatial hierarchy (pixel adjacency, objects, scenes). More levels of exploitable structure means more bits of bonus.

The ruler problem

Paper 7 proved that bits per token is tokenizer-dependent. We found the same problem in vision: bits per pixel depends on patch size.

We trained the same model at six patch sizes (8×8 through 48×48) and measured bits per pixel. It varies by 4× — from 1.19 at 8×8 to 0.29 at 48×48. Larger patches cover more pixels, which mechanically changes the per-pixel number without changing how much the model actually learned.

This is the visual version of Paper 7’s R5 finding: the measurement depends on the ruler. For text, the ruler is the tokenizer. For images, the ruler is the patch size. For audio, it’s the mel-spectrogram parameters. Comparing throughput across modalities requires a universal ruler that nobody has yet built.

The visual entropy tracking curve

If the visual basin is genuinely data-driven, then images with more complexity should produce higher reconstruction loss in a generative vision model. We tested this using a Masked Autoencoder (MAE-Base) on synthetic images at controlled complexity levels plus real photographs (CIFAR-10):

Image Type MAE Loss Bits/Pixel Est
Uniform gray0.0000.00
Gradient0.0000.00
CIFAR-10 (real photos)0.0710.14
Gaussian noise1.1352.14
Uniform noise1.6472.41
Random 4×4 blocks1.7972.47
Random 16×16 blocks2.6052.74

The ordering is exactly what the data-driven hypothesis predicts. Trivially predictable images (uniform, gradient) have near-zero loss. Real photographs (CIFAR-10) have very low loss because natural images have massive exploitable structure — spatial coherence, object regularity, texture patterns. Random noise is hard to predict. The most difficult case is random 16×16 blocks: each patch is a random color, and the MAE’s 16×16 patches see no useful spatial context at all.

A ViT-Base classifier confirms this from the discriminative side: output entropy is 3.6 bits for real images (the model is confident) versus 6.5–8.4 bits for synthetic patterns (the model is confused). The basin sits where the model can exploit structure.

Trained from scratch — the visual SYN-8

The pretrained-MAE results above are evaluative, which a critical reviewer would correctly flag: the model’s confidence on real images may reflect its training distribution rather than image entropy itself. To rule that out, we trained a 112-million-parameter ViT-MAE from scratch at each entropy level — the visual equivalent of Paper 7’s SYN-8 experiment. Three random seeds per level, fifteen epochs each, on synthetic images at 224×224.

Image Type Approx Entropy Eval Loss (mean ± std)
Uniform color~0 bits0.000001 ± 0.000000
Natural-like (gradients + objects)structured0.014 ± 0.000
4-color blocks~2 bits0.065 ± 0.025
16-color blocks~4 bits0.076 ± 0.004
Gaussian noise~7 bits0.058 ± 0.000
64-color pixels~6 bits0.084 ± 0.001
Uniform noise~8 bits0.083 ± 0.000

The two key results: trivially predictable images (uniform) reconstruct perfectly, and the natural-like images with exploitable spatial structure (gradients, blob-shaped objects) reconstruct six times better than uniform noise of similar entropy. The MAE learns to compress what is compressible. This is the visual equivalent of f(structural_depth) from Paper 7 — the basin sits at source entropy minus exploitable structure, in vision just as in language. Welch’s t-test comparing uniform noise against uniform color: t = 249,994, p = 1.6 × 10−11, Cohen’s d = 204,119.

The 4-bit coincidence

Natural images contain about 4.0 bits per pixel under WebP lossless compression. Natural language sits at about 4.16 bits per token. The numerical similarity is striking — but it’s an artifact.

Visual source entropy is resolution-dependent. STL-10 at 96×96 gives 5.07 bpp under PNG. The same images resized to 224×224 give 3.20 bpp — the upsampling creates spatial redundancy that compression exploits. The “4 bpp” number depends on which resolution and which compressor you use. It is not a universal constant.

Audio follows the same pattern

We measured mel-spectrogram entropy across synthetic audio of increasing complexity:

Audio Type Mel Entropy (bits/frame)
Silence0.00
AM sweep5.30
Pure 440 Hz tone5.35
Three-note chord5.37
Pink noise (1/f)6.28
White noise6.30

The same pattern holds: silence has zero entropy, structured signals (tones, chords) sit around 5.3 bits, and random noise reaches 6.3 bits.

We also tested wav2vec2 (a speech recognition model) on these signals plus 200 real utterances from LJ Speech. The result mirrors the vision finding: real speech produces output entropy of just 0.095 bits/frame — far below any synthetic signal (0.6–3.0 bits). The model is extremely confident about real speech because it was trained on speech, just as the ViT is confident about real images because it was trained on ImageNet. In both cases, the basin sits where the model can exploit learned structure. Audio throughput tracks source complexity, just as vision and language do. Each modality has its own basin, but the basin is always data-driven.

What this means

The throughput basin is not a single number shared across modalities. It is a modality-specific equilibrium — each data type has its own entropy, its own exploitable structure, and its own basin. Language sits at ~1 bit per byte. Vision sits at ~1.3 bits per pixel. Audio speech sits at ~1.9 bits per mel dimension.

The equation from Paper 7 — BPT ≈ source_entropy − f(structural_depth) — holds qualitatively for all three. But the function f() and the definition of “source unit” differ across modalities. Establishing a modality-independent throughput metric — a universal ruler — remains the central open problem.


The Vision Basin is Paper 8 of the Windstorm series.
Zenodo: DOI pending · Code & data: github.com/Windstorm-Institute/vision-basin
Download the full paper (PDF) · Grand Slam Supplementary Materials (PDF)