The DNA Scanner discovers deterministic mathematical structure in genomic sequences with Z-scores up to 74. The FASTA Encoder turns that knowledge into compression that beats 7-Zip on 44 out of 50 genomes — including the complete 3.15 GB human genome.
The DNA Scanner (M6.3) detects deterministic mathematical patterns in genomic sequences that persist far beyond what random chance or simple Markov models can explain. Seven validated generators — including periodic, gradient, and cellular automata patterns — produce Z-scores of 28 to 74 across a 50-genome collection spanning bacteria, yeast, plants, worms, and humans.
Fisher shuffling destroys the signal. Block shuffling preserves it. The structure is global, not local — it spans entire chromosomes, not just local tandem repeats.
Genuine signals become more significant under harder null models. This is the opposite of noise, which fades under stricter controls. Yeast chromosome I demonstrates this clearly:
The harder you try to explain the signal away with sophisticated random models, the stronger it gets. This progressive amplification pattern is the hallmark of genuine deterministic structure — not statistical artifact. Monte Carlo validation with 160 iterations and Benjamini-Hochberg FDR correction confirms these aren't false discoveries.
The 8Z-FASTA encoder (gemZ / HYB4) applies MDL-governed compression with domain-specific transforms: 2-bit packing, nibble coding, context-3 models, and tandem repeat detection. It beats generic compressors convincingly — and the advantage increases on larger genomes.
| Genome | Size | 8Z (.8z) | 7-Zip | vs 7-Zip |
|---|---|---|---|---|
| A. thaliana chloroplast | 154 KB | 16.5 KB | 19.2 KB | −14.3% |
| Human chr19 (gene-dense) | 58.4 MB | 12.0 MB | 13.8 MB | −13.0% |
| Human chr1 | 230 MB | 50.3 MB | 59.9 MB | −15.9% |
| Wheat Chr3B | 805 MB | 116.2 MB | 140.1 MB | −17.1% |
| Human T2T complete genome | 3.15 GB | 677.2 MB | 767.3 MB | −11.8% |
8Z's advantage over 7-Zip grows with genome size: −14.3% on 154 KB → −17.1% on 805 MB. The domain-specific transforms (NIB, periodic detection) capture more structure as the data gets bigger. On the 3.15 GB T2T human genome, 8Z saves 90 MB over 7-Zip in a single-threaded run.
| Compressor | Type | Human chr1 | bpb | vs 7-Zip |
|---|---|---|---|---|
| GeCo3 -l5 | Context-mixing AC (C) | 47.1 MB | 1.636 | −21.3% |
| MFCompress -3 | Finite-context (C) | 48.6 MB | 1.685 | −18.9% |
| JARVIS3 -l7 | Neural CM (C) | 49.8 MB | 1.727 | −16.9% |
| 8Z CLA (.8z) | MDL + transforms (Python) | 50.3 MB | 1.747 | −15.9% |
| 8Z gemZ (.8z) | MDL + transforms (Python) | 51.6 MB | 1.791 | −13.8% |
| 7-Zip | LZMA2 (generic) | 59.9 MB | 2.078 | baseline |
| ZIP | deflate (generic) | 68.4 MB | 2.374 | +14.2% |
GeCo3 (context-mixing with arithmetic coding, written in C) is ~6.8% ahead of 8Z CLA on human chr1. This is the gap to close. GeCo3 uses fundamentally different statistics; 8Z uses transforms + entropy coding. A hybrid approach — MDL-governed context mixing — is the next frontier.
Round 2 tested 9 encoder variants built by 8 different LLMs, all starting from the same gemZ v9.3 baseline. The result: 7 of 9 encoders converged within 1 KB of each other across 1.58 GB of data. The architecture has reached its local optimum.
| # | Encoder | LLM | Total | Ratio | vs 7-Zip |
|---|---|---|---|---|---|
| — | GeCo3 | Reference (C) | 266.4 MB | 16.88% | −22.2% |
| 1 | DSEz | DeepSeek-V3 R1 | 305.4 MB | 19.35% | −10.8% |
| 1 | GEMz_orig | Original v9.3 | 305.4 MB | 19.35% | −10.8% |
| 3 | GEMz | Gemini 3 Pro | 305.4 MB | 19.35% | −10.8% |
| 3 | MMAz | MiniMax M2.5 | 305.4 MB | 19.35% | −10.8% |
| 5 | KIMz | Kimi K2.5 | 305.4 MB | 19.35% | −10.8% |
| 6 | QWEz | Qwen 3.5 Plus | 305.4 MB | 19.35% | −10.8% |
| 7 | GROz | Grok 4.2 | 305.9 MB | 19.38% | −10.7% |
| 8 | GLMz | GLM 5 DeepThink | 311.5 MB | 19.74% | −9.0% |
| 9 | GPTz | ChatGPT 5.2 | 364.1 MB | 23.07% | +6.3% |
Six encoders differ by less than 1 KB total across 1.58 GB. The LLMs made near-zero meaningful changes. The gemZ v9.3 architecture is at its ceiling for this codec family. Next step: architectural evolution, not parameter tuning.
ChatGPT 5.2's rewrite discarded the NIB transform and periodic detection — the two features that account for the entire compression advantage. At 364.1 MB it's worse than plain 7-Zip (342.4 MB). LLM rewrites can destroy domain-specific features without understanding their purpose.
644 lines of Python. The core insight: DNA has only 4 bases (A, C, G, T), so pack 4 bases per byte (4:1 immediately), then apply domain-specific transforms before generic entropy coding.
The paper documenting mathematical structure in genomic DNA is in its final stages (v3.0). Z-scores of 28–74 for mathematical generators beyond Markov-2, with progressive amplification under harder null models. Target journals: Bioinformatics, PLOS Computational Biology, or Nucleic Acids Research. Preprint on bioRxiv for immediate visibility.