AIM³ Institute · Computational Biology · 8Z-LO Framework

Math Hidden
in Your DNA

The DNA Scanner discovers deterministic mathematical structure in genomic sequences with Z-scores up to 74. The FASTA Encoder turns that knowledge into compression that beats 7-Zip on 44 out of 50 genomes — including the complete 3.15 GB human genome.

Z=74

Peak Z-Score

44/50

Genomes Beat 7-Zip

3.15 GB

Largest Genome

LLM Encoders Tested

Genomes in Corpus

DNA Scanner · Mathematical Discovery

Your Genome Is Not Random

The DNA Scanner (M6.3) detects deterministic mathematical patterns in genomic sequences that persist far beyond what random chance or simple Markov models can explain. Seven validated generators — including periodic, gradient, and cellular automata patterns — produce Z-scores of 28 to 74 across a 50-genome collection spanning bacteria, yeast, plants, worms, and humans.

Fisher shuffling destroys the signal. Block shuffling preserves it. The structure is global, not local — it spans entire chromosomes, not just local tandem repeats.

The 6-Tool Pipeline

8Z-LO DNA Analysis Pipeline

DetectorFind candidates

→

DetectiveCharacterize hits

→

ScannerM6.3 · Core engine

→

ProfilerCross-genome stats

→

ValidatorMonte Carlo · FDR

→

InvisibleDNAInvisible structure

Progressive Amplification — The Strongest Evidence

Genuine signals become more significant under harder null models. This is the opposite of noise, which fades under stricter controls. Yeast chromosome I demonstrates this clearly:

Yeast chrI · Z-Score Under Progressively Harder Null Models

Fisher

z = 13.4

Markov-1

z = 20.4

Markov-2

z = 38.0

What This Means

The harder you try to explain the signal away with sophisticated random models, the stronger it gets. This progressive amplification pattern is the hallmark of genuine deterministic structure — not statistical artifact. Monte Carlo validation with 160 iterations and Benjamini-Hochberg FDR correction confirms these aren't false discoveries.

8Z-FASTA Encoder · Genomic Compression

Beat 7-Zip on 44 of 50 Genomes

The 8Z-FASTA encoder (gemZ / HYB4) applies MDL-governed compression with domain-specific transforms: 2-bit packing, nibble coding, context-3 models, and tandem repeat detection. It beats generic compressors convincingly — and the advantage increases on larger genomes.

Landmark Results

Genome	Size	8Z (.8z)	7-Zip	vs 7-Zip
A. thaliana chloroplast	154 KB	16.5 KB	19.2 KB	−14.3%
Human chr19 (gene-dense)	58.4 MB	12.0 MB	13.8 MB	−13.0%
Human chr1	230 MB	50.3 MB	59.9 MB	−15.9%
Wheat Chr3B	805 MB	116.2 MB	140.1 MB	−17.1%
Human T2T complete genome	3.15 GB	677.2 MB	767.3 MB	−11.8%

Scaling Insight

8Z's advantage over 7-Zip grows with genome size: −14.3% on 154 KB → −17.1% on 805 MB. The domain-specific transforms (NIB, periodic detection) capture more structure as the data gets bigger. On the 3.15 GB T2T human genome, 8Z saves 90 MB over 7-Zip in a single-threaded run.

Competitive Landscape

Compressor	Type	Human chr1	bpb	vs 7-Zip
GeCo3 -l5	Context-mixing AC (C)	47.1 MB	1.636	−21.3%
MFCompress -3	Finite-context (C)	48.6 MB	1.685	−18.9%
JARVIS3 -l7	Neural CM (C)	49.8 MB	1.727	−16.9%
8Z CLA (.8z)	MDL + transforms (Python)	50.3 MB	1.747	−15.9%
8Z gemZ (.8z)	MDL + transforms (Python)	51.6 MB	1.791	−13.8%
7-Zip	LZMA2 (generic)	59.9 MB	2.078	baseline
ZIP	deflate (generic)	68.4 MB	2.374	+14.2%

The GeCo3 Gap

GeCo3 (context-mixing with arithmetic coding, written in C) is ~6.8% ahead of 8Z CLA on human chr1. This is the gap to close. GeCo3 uses fundamentally different statistics; 8Z uses transforms + entropy coding. A hybrid approach — MDL-governed context mixing — is the next frontier.

Round 2 Multi-LLM Benchmark

9 AI Encoders, 1 Architecture, Total Convergence

Round 2 tested 9 encoder variants built by 8 different LLMs, all starting from the same gemZ v9.3 baseline. The result: 7 of 9 encoders converged within 1 KB of each other across 1.58 GB of data. The architecture has reached its local optimum.

Encoder Rankings

#	Encoder	LLM	Total	Ratio	vs 7-Zip
—	GeCo3	Reference (C)	266.4 MB	16.88%	−22.2%
1	DSEz	DeepSeek-V3 R1	305.4 MB	19.35%	−10.8%
1	GEMz_orig	Original v9.3	305.4 MB	19.35%	−10.8%
3	GEMz	Gemini 3 Pro	305.4 MB	19.35%	−10.8%
3	MMAz	MiniMax M2.5	305.4 MB	19.35%	−10.8%
5	KIMz	Kimi K2.5	305.4 MB	19.35%	−10.8%
6	QWEz	Qwen 3.5 Plus	305.4 MB	19.35%	−10.8%
7	GROz	Grok 4.2	305.9 MB	19.38%	−10.7%
8	GLMz	GLM 5 DeepThink	311.5 MB	19.74%	−9.0%
9	GPTz	ChatGPT 5.2	364.1 MB	23.07%	+6.3%

Convergence = Architecture Saturation

Six encoders differ by less than 1 KB total across 1.58 GB. The LLMs made near-zero meaningful changes. The gemZ v9.3 architecture is at its ceiling for this codec family. Next step: architectural evolution, not parameter tuning.

Cautionary Tale: GPTz Catastrophe

ChatGPT 5.2's rewrite discarded the NIB transform and periodic detection — the two features that account for the entire compression advantage. At 364.1 MB it's worse than plain 7-Zip (342.4 MB). LLM rewrites can destroy domain-specific features without understanding their purpose.

Full Round 2 Report →

Encoder Architecture

How gemZ v9.3 Compresses DNA

644 lines of Python. The core insight: DNA has only 4 bases (A, C, G, T), so pack 4 bases per byte (4:1 immediately), then apply domain-specific transforms before generic entropy coding.

gemZ v9.3 Encoding Pipeline

Parse FASTAHeaders + sequence

→

2-bit PackACGT → 00 01 10 11

→

Mode BattleRAW · DELTA · CTX3 · PER

→

MDL SelectSmallest wins

→

Codec BattleLZMA · Brotli · Zstd

→

SHA3-256Verify lossless

🧬

Transform

2-Bit Packing

ACGT → 00/01/10/11. Four bases per byte. Immediate 4:1 reduction on sequence data before any compression begins.

🔄

Mode Selection

Per-Block MDL Battle

Four modes compete per 1 MB block: RAW (packed), DELTA (differential), CTX3 (trigram context model), and PERIODIC (tandem repeat detection). MDL picks the cheapest.

🔁

Key Feature

Tandem Repeat Detection

PERIODIC mode finds repeating DNA patterns (period + template + sparse mismatches). Captures ~10% of human chromosome blocks. Must beat best non-math mode by 15% to win (MATH_GATE = 0.85).

⚔️

Strategy

Solid vs Split Battle

Tries compressing 9 metadata streams (sequence, headers, math, lengths, newlines, case, etc.) together vs. separately. Picks whichever is smaller. Split wins on large genomes.

Publication Target

DNA Math Paper — Peer Review

The paper documenting mathematical structure in genomic DNA is in its final stages (v3.0). Z-scores of 28–74 for mathematical generators beyond Markov-2, with progressive amplification under harder null models. Target journals: Bioinformatics, PLOS Computational Biology, or Nucleic Acids Research. Preprint on bioRxiv for immediate visibility.

📄

Status

Paper v3.0 Nearly Complete

Full manuscript with appendices covering methodology, all 50 genomes, Monte Carlo validation, progressive amplification analysis, and multi-null model hierarchy.

v3.0 Draft With Appendices

🎯