ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA
AIM³ Institute · Computational Biology · 8Z-LO Framework

Math Hidden
in Your DNA

The DNA Scanner discovers deterministic mathematical structure in genomic sequences with Z-scores up to 74. The FASTA Encoder turns that knowledge into compression that beats 7-Zip on 44 out of 50 genomes — including the complete 3.15 GB human genome.

Z=74
Peak Z-Score
44/50
Genomes Beat 7-Zip
3.15 GB
Largest Genome
9
LLM Encoders Tested
50
Genomes in Corpus
DNA Scanner · Mathematical Discovery

Your Genome Is Not Random

The DNA Scanner (M6.3) detects deterministic mathematical patterns in genomic sequences that persist far beyond what random chance or simple Markov models can explain. Seven validated generators — including periodic, gradient, and cellular automata patterns — produce Z-scores of 28 to 74 across a 50-genome collection spanning bacteria, yeast, plants, worms, and humans.

Fisher shuffling destroys the signal. Block shuffling preserves it. The structure is global, not local — it spans entire chromosomes, not just local tandem repeats.

M6.3
Scanner Version
7
Validated Generators
50
Genomes Tested
Z=74
Peak Z-Score
6
Pipeline Tools
73
Regression Tests

The 6-Tool Pipeline

8Z-LO DNA Analysis Pipeline
DetectorFind candidates
DetectiveCharacterize hits
ScannerM6.3 · Core engine
ProfilerCross-genome stats
ValidatorMonte Carlo · FDR
InvisibleDNAInvisible structure

Progressive Amplification — The Strongest Evidence

Genuine signals become more significant under harder null models. This is the opposite of noise, which fades under stricter controls. Yeast chromosome I demonstrates this clearly:

Yeast chrI · Z-Score Under Progressively Harder Null Models
Fisher
z = 13.4
Markov-1
z = 20.4
Markov-2
z = 38.0
What This Means

The harder you try to explain the signal away with sophisticated random models, the stronger it gets. This progressive amplification pattern is the hallmark of genuine deterministic structure — not statistical artifact. Monte Carlo validation with 160 iterations and Benjamini-Hochberg FDR correction confirms these aren't false discoveries.

8Z-FASTA Encoder · Genomic Compression

Beat 7-Zip on 44 of 50 Genomes

The 8Z-FASTA encoder (gemZ / HYB4) applies MDL-governed compression with domain-specific transforms: 2-bit packing, nibble coding, context-3 models, and tandem repeat detection. It beats generic compressors convincingly — and the advantage increases on larger genomes.

Landmark Results

GenomeSize8Z (.8z)7-Zipvs 7-Zip
A. thaliana chloroplast154 KB16.5 KB19.2 KB−14.3%
Human chr19 (gene-dense)58.4 MB12.0 MB13.8 MB−13.0%
Human chr1230 MB50.3 MB59.9 MB−15.9%
Wheat Chr3B805 MB116.2 MB140.1 MB−17.1%
Human T2T complete genome3.15 GB677.2 MB767.3 MB−11.8%
Scaling Insight

8Z's advantage over 7-Zip grows with genome size: −14.3% on 154 KB → −17.1% on 805 MB. The domain-specific transforms (NIB, periodic detection) capture more structure as the data gets bigger. On the 3.15 GB T2T human genome, 8Z saves 90 MB over 7-Zip in a single-threaded run.

Competitive Landscape

CompressorTypeHuman chr1bpbvs 7-Zip
GeCo3 -l5Context-mixing AC (C)47.1 MB1.636−21.3%
MFCompress -3Finite-context (C)48.6 MB1.685−18.9%
JARVIS3 -l7Neural CM (C)49.8 MB1.727−16.9%
8Z CLA (.8z)MDL + transforms (Python)50.3 MB1.747−15.9%
8Z gemZ (.8z)MDL + transforms (Python)51.6 MB1.791−13.8%
7-ZipLZMA2 (generic)59.9 MB2.078baseline
ZIPdeflate (generic)68.4 MB2.374+14.2%
The GeCo3 Gap

GeCo3 (context-mixing with arithmetic coding, written in C) is ~6.8% ahead of 8Z CLA on human chr1. This is the gap to close. GeCo3 uses fundamentally different statistics; 8Z uses transforms + entropy coding. A hybrid approach — MDL-governed context mixing — is the next frontier.

Round 2 Multi-LLM Benchmark

9 AI Encoders, 1 Architecture, Total Convergence

Round 2 tested 9 encoder variants built by 8 different LLMs, all starting from the same gemZ v9.3 baseline. The result: 7 of 9 encoders converged within 1 KB of each other across 1.58 GB of data. The architecture has reached its local optimum.

1.579 GB
Total Corpus
50
FASTA Files
9
Encoder Variants
8
Different LLMs
<1 KB
Top-7 Spread

Encoder Rankings

#EncoderLLMTotalRatiovs 7-Zip
GeCo3Reference (C)266.4 MB16.88%−22.2%
1DSEzDeepSeek-V3 R1305.4 MB19.35%−10.8%
1GEMz_origOriginal v9.3305.4 MB19.35%−10.8%
3GEMzGemini 3 Pro305.4 MB19.35%−10.8%
3MMAzMiniMax M2.5305.4 MB19.35%−10.8%
5KIMzKimi K2.5305.4 MB19.35%−10.8%
6QWEzQwen 3.5 Plus305.4 MB19.35%−10.8%
7GROzGrok 4.2305.9 MB19.38%−10.7%
8GLMzGLM 5 DeepThink311.5 MB19.74%−9.0%
9GPTzChatGPT 5.2364.1 MB23.07%+6.3%
Convergence = Architecture Saturation

Six encoders differ by less than 1 KB total across 1.58 GB. The LLMs made near-zero meaningful changes. The gemZ v9.3 architecture is at its ceiling for this codec family. Next step: architectural evolution, not parameter tuning.

Cautionary Tale: GPTz Catastrophe

ChatGPT 5.2's rewrite discarded the NIB transform and periodic detection — the two features that account for the entire compression advantage. At 364.1 MB it's worse than plain 7-Zip (342.4 MB). LLM rewrites can destroy domain-specific features without understanding their purpose.

Encoder Architecture

How gemZ v9.3 Compresses DNA

644 lines of Python. The core insight: DNA has only 4 bases (A, C, G, T), so pack 4 bases per byte (4:1 immediately), then apply domain-specific transforms before generic entropy coding.

gemZ v9.3 Encoding Pipeline
Parse FASTAHeaders + sequence
2-bit PackACGT → 00 01 10 11
Mode BattleRAW · DELTA · CTX3 · PER
MDL SelectSmallest wins
Codec BattleLZMA · Brotli · Zstd
SHA3-256Verify lossless
🧬
Transform
2-Bit Packing
ACGT → 00/01/10/11. Four bases per byte. Immediate 4:1 reduction on sequence data before any compression begins.
🔄
Mode Selection
Per-Block MDL Battle
Four modes compete per 1 MB block: RAW (packed), DELTA (differential), CTX3 (trigram context model), and PERIODIC (tandem repeat detection). MDL picks the cheapest.
🔁
Key Feature
Tandem Repeat Detection
PERIODIC mode finds repeating DNA patterns (period + template + sparse mismatches). Captures ~10% of human chromosome blocks. Must beat best non-math mode by 15% to win (MATH_GATE = 0.85).
⚔️
Strategy
Solid vs Split Battle
Tries compressing 9 metadata streams (sequence, headers, math, lengths, newlines, case, etc.) together vs. separately. Picks whichever is smaller. Split wins on large genomes.
Publication Target

DNA Math Paper — Peer Review

The paper documenting mathematical structure in genomic DNA is in its final stages (v3.0). Z-scores of 28–74 for mathematical generators beyond Markov-2, with progressive amplification under harder null models. Target journals: Bioinformatics, PLOS Computational Biology, or Nucleic Acids Research. Preprint on bioRxiv for immediate visibility.

📄
Status
Paper v3.0 Nearly Complete
Full manuscript with appendices covering methodology, all 50 genomes, Monte Carlo validation, progressive amplification analysis, and multi-null model hierarchy.
v3.0 Draft With Appendices
🎯
Strategy
Dual Track: Preprint + Journal
bioRxiv preprint for immediate scientific visibility and community feedback, simultaneous journal submission for peer-reviewed credibility. The preprint establishes priority.
bioRxiv Bioinformatics PLOS Comp Bio
Reports & Data

Read the Full Reports