Z01

Z01 Technical Documentation

What is Z01?

Z01 is a private, locally-run chat interface for large language models (AI) that goes beyond basic conversation. It remembers what you've discussed across sessions, adapts to your preferences through customizable personas, and — uniquely — lets you see what the AI is actually doing as it generates each word.

Key Capabilities

  • Persistent memory — Z01 remembers facts from past conversations and uses them in future chats, so the AI builds a genuine understanding of your context over time.
  • Personas — Create specialized AI personalities (scientist, doctor, coder, etc.) with their own voice, model preferences, and behavior. Switch instantly between them.
  • Real-time trajectory monitoring — A 3D visualization shows the AI's internal "thought path" as it writes. When the path becomes erratic, Z01 flags the response as potentially unreliable — this is hallucination detection in action.
  • Layer analysis — A diagnostic probe that tests every internal layer of the AI model with a structured prompt (math, history, creative writing, logic, references). It automatically identifies the optimal layer for monitoring and provides 8 analytical sub-charts — from statistical change-point detection to UMAP manifold anomaly scoring — giving deep insight into how the model processes different types of content.
  • Voice interaction — Speak to the AI and hear it respond with natural-sounding speech, using voice activity detection that knows when you've finished talking.
  • Model-portable memory — Your conversation history and accumulated knowledge belong to you, not to any particular model. Switch between AI models freely — all your memories, facts, and preferences carry over automatically. Try a new model without starting from scratch; go back to your previous one with everything intact.

What Makes It Novel?

Most AI chat interfaces treat the model as a black box — you type, it responds, and you hope the answer is correct. Z01 opens the box. By extracting and analyzing the model's internal hidden states at every token, Z01 provides a real-time "health monitor" for AI generation. The BRIDGE framework classifies trajectory behavior (stable, strained, fragmented, runaway) and can even steer the model back on track when it starts to drift. The layer analysis probe goes further, testing model behavior across all internal layers with multiple statistical methods to identify exactly where and why instability occurs. No other chat interface provides this level of introspection into live AI inference.

Equally important, Z01 decouples your data from the model. Memory, personas, and conversation history are stored independently — not locked to a specific AI model. This means you can migrate between models (e.g. from Mistral to MiniMax to Qwen) without losing any accumulated knowledge or context. The AI landscape evolves rapidly; Z01 ensures your investment in building up a personalised, memory-rich assistant is never lost when a better model arrives.

1. Z01: An Intelligent Functional Wrapper for Large Language Models

Z01 (from Greek ζωή, "life") is a local-first interface layer designed to augment large language model capabilities through persistent memory, user personalization, real-time inference monitoring, and multimodal interaction. The system addresses fundamental limitations in standard LLM deployments: the absence of long-term context retention, lack of user-specific adaptation, and limited transparency into model behavior during generation.

1.1 Architectural Overview

The platform operates as a stateful wrapper mediating between users and underlying language models. Z01 interposes a persistence and analysis layer that maintains conversation histories, user preferences, and accumulated knowledge across sessions. The backend (FastAPI with Granian, a Rust-based ASGI server) runs natively on the host for direct GPU access, while nginx and PostgreSQL run in Docker containers. This architecture draws on established patterns in cognitive architectures (Laird, 2012) and memory-augmented neural networks (Graves et al., 2014).

The frontend is built as modular vanilla JavaScript (ES modules) with 21 specialized modules handling chat, streaming, memory, personas, TTS, STT, trajectory visualization, token coloring, and UI state management.

1.2 Long-Term Memory System

Standard transformer models operate within fixed context windows with no native mechanism for cross-session persistence. Z01 implements a hybrid memory architecture combining:

  • Episodic Memory: Verbatim conversation storage with temporal indexing in PostgreSQL
  • Semantic Memory: Vector-embedded knowledge facts extracted from conversations via Mem0, stored in PostgreSQL with pgvector extension (Guo et al., 2022)
  • Working Memory: Session-scoped context accumulation with relevance-weighted retrieval

The memory subsystem utilizes the Mem0 framework with Ollama (llama3.2:3b for fact extraction, nomic-embed-text for 768-dimensional embeddings), implementing a retrieve-augment-generate pattern analogous to RAG architectures (Lewis et al., 2020) but operating on user-specific accumulated knowledge rather than static document corpora.

Critically, all memory layers are model-agnostic — stored in PostgreSQL independently of any particular LLM. When the user switches between models (e.g. Mistral → MiniMax → Qwen), the full memory graph transfers seamlessly. This decoupling ensures that accumulated user knowledge and conversation history survive model migrations, making the system future-proof as the LLM landscape evolves.

1.3 Temporal Memory Awareness

Memories are temporally weighted using exponential recency decay. The system over-fetches 15 candidate memories from pgvector, then re-ranks by a combined score:

scoret = similarity × exp(−λ × agedays)

where λ = 0.03 (half-life ~23 days). The top 5 are injected into the system prompt with relative timestamps (e.g., "3 days ago", "2 months ago"), enabling the LLM to reason about recency and resolve contradictions.

Supersession tracking: When new memories are extracted, the system searches for high-similarity existing memories (cosine similarity > 0.7) and marks them as superseded via metadata flags. Superseded memories receive a 90% score penalty during retrieval, preventing stale facts from competing with current ones.

Three memory modes are available: Full (recall + record), Read-only (recall only), and Off (disabled).

1.4 Personalization Through Personas

User interaction is mediated through database-backed personas—structured system prompts that establish domain expertise, communication style, and behavioral constraints. Each persona encapsulates:

  • Domain-specific instruction sets (scientific, medical, technical, general)
  • Voice and speech synthesis parameters (voice selection, speech rate in WPM)
  • Preferred LLM model — selecting a persona can auto-load a different model
  • Thinking visibility control (show/hide reasoning sections)

Personas support full CRUD operations, duplication, and voice preview from the UI. This implements prompt-based specialization (Wei et al., 2022), allowing a single base model to exhibit task-specific behaviors without fine-tuning.

1.5 Multi-Model Backend

Z01 abstracts over multiple inference backends:

  • MLX: Apple Silicon optimization via Metal GPU (primary — DeepSeek-V3.2 672B at 4-bit, ~378GB)
  • Ollama: Local model serving for smaller models
  • OpenAI-compatible: Remote API providers with configurable base URL and API key
  • GGUF models: For NVIDIA GPU deployment (Qwen3-32B, DeepSeek-R1-Distill-32B, Kimi-K2.5)

Model selection is decoupled from conversation state and memory. LLM configurations are stored in the database with hot-swap capability. Per-conversation settings allow overriding sampling parameters (temperature, top-p, top-k, min-p, repetition penalty) independently.

1.6 Streaming and Rendering

An incremental streaming markdown renderer splits LLM output into stable (committed, markdown-rendered) and tail (speculative, lightly-styled) regions. Lines are committed to the stable region only when no multi-line construct (code fence, math block) is open. Upon stream completion, the entire message is rewritten using the same formatter applied to historical messages, ensuring consistent rendering with KaTeX math, syntax highlighting, and collapsible thinking sections.

1.7 Multimodal Interaction

Text-to-Speech: Real-time streaming TTS via Pocket TTS with 8 preloaded voices. The client performs prosody-aware markdown cleaning and ICU-based sentence segmentation before sending chunks to a server-side queue. Audio playback uses the Web Audio API to avoid Safari's HTMLAudioElement latency overhead.

Speech-to-Text: Voice input via whisper.cpp (large-v3-turbo model). Audio is recorded using the Web Audio API (MediaRecorder) and transcribed server-side with 4-thread parallelism.

1.8 Auto-Tagging and Session Context

Conversations are automatically tagged with 1-4 topic tags after each response, extracted using Ollama (llama3.2:3b). Tags are stored per-conversation (max 8) and displayed on the conversation list and in the Session Info sidebar alongside memory state.

2. The Problem of Hallucination and the BRIDGE Methodology

2.1 Defining the Problem

Large language models exhibit a well-documented tendency to generate fluent but factually incorrect content—a phenomenon termed "hallucination" (Ji et al., 2023). This behavior manifests across model scales and architectures, representing a fundamental challenge for deployment in high-stakes domains. Existing taxonomies distinguish intrinsic hallucinations (contradicting source material) from extrinsic hallucinations (unverifiable claims), though both emerge from the same underlying mechanism: autoregressive generation optimized for coherence rather than factual accuracy (Maynez et al., 2020).

2.2 Limitations of Current Approaches

Prior work on hallucination detection has pursued several directions:

  • Calibration methods: Using token probabilities or entropy as uncertainty proxies (Kadavath et al., 2022). These fail when models are confidently wrong.
  • Self-consistency: Sampling multiple outputs and measuring agreement (Wang et al., 2023). Computationally expensive and unreliable for systematic errors.
  • Retrieval augmentation: Grounding generation in retrieved documents (Lewis et al., 2020). Effective but requires curated knowledge bases.
  • Learned probes: Training classifiers on internal representations (Burns et al., 2022). Requires labeled training data and may not generalize.

These approaches share a common limitation: they treat model states as independent observations rather than as points on a continuous trajectory through representation space.

2.3 The BRIDGE Framework

BRIDGE (Basin-Referenced Inference Dynamics for Grounded Evaluation) reconceptualizes hallucination detection through the lens of dynamical systems theory. The central thesis:

Hallucination is a second-order dynamical failure of latent trajectories—characterized by high curvature in representation space—rather than a first-order semantic property of individual tokens.

2.4 The Support × Curvature Decomposition

BRIDGE decomposes model behavior into two orthogonal factors:

Support (S): The distance of the current hidden state from a dynamically-computed semantic basin, measuring grounding in established context.

Curvature (κ): The second derivative of the trajectory through representation space, measuring directional consistency across tokens.

This decomposition resolves an ambiguity that single-axis methods cannot address:

Low Support, Low CurvatureCreative extrapolation — novel but coherent
Low Support, High CurvatureConfabulation — ungrounded and incoherent
High Support, Low CurvatureGrounded generation — factual and stable
High Support, High CurvatureUncertain retrieval — grounded but unstable

2.5 Online Basin Estimation

Unlike methods requiring precomputed reference vectors or contrastive datasets, BRIDGE constructs its semantic basin online from the model's own hidden states during generation. The basin centroid μt is initialized from early-generation tokens and updated via exponential moving average for states within a stability threshold:

μt+1 = (1 − β)μt + βht, when d(ht, μt) < τ

This approach requires no external supervision and adapts to each prompt's semantic context.

2.6 Excursion Detection and Classification

Tokens generating hidden states beyond the basin threshold are classified as excursions. The system uses robust z-scores (median/MAD normalization) with composite instability It = 0.5 × zjump + 0.5 × zcurvature. Excursion detection triggers at Q95 threshold with minimum 8-token spans. Recovery is assessed over the final 25 tokens. Qualitative labels assigned:

  • Stable: Excursion fraction < 5% — well-grounded generation
  • Brief detour: 5-15% fraction with recovery — transient uncertainty
  • Strained: 15-40% fraction — extended instability
  • Fragmented: Multiple discrete excursions — inconsistent reasoning
  • Runaway: ≥ 40% fraction or failure to recover — sustained hallucination

Tags are stored per-message in the database and displayed in trajectory thumbnails alongside each response.

2.7 Token-Level Instability Visualization

Individual tokens in the rendered response are colored by their instability score using a gradient from green (stable) through yellow (warning) to red (unstable). This coloring is based on the jump metric—cosine distance between consecutive hidden states—and can be toggled on/off. Tooltips display exact instability percentages for each span.

3. Implementation of Vector Insertion for Inference Guidance

3.1 Theoretical Foundation

Representation engineering (Zou et al., 2023) has demonstrated that transformer hidden states encode interpretable features that can be manipulated through additive interventions. Activation steering extends this insight to runtime control, modifying intermediate representations to influence generation behavior without weight modification (Turner et al., 2023).

BRIDGE implements a variant of activation steering with two key distinctions: interventions are (a) state-dependent rather than constant, and (b) computed relative to a dynamically-defined basin rather than precomputed steering vectors.

3.2 The Feedback Control Mechanism

At each generation step, the system computes the displacement vector from the current hidden state to the basin centroid:

Δht = μt − ht

A proportional correction is applied when the state exceeds the basin threshold:

h't = ht + α · Δht · σ(d(ht, μt) − τ)

where α is the steering strength, τ is the basin threshold, and σ is a smooth activation function. This implements a return-to-basin dynamic analogous to control-theoretic regulation (Åström & Murray, 2008).

3.3 Layer Selection via Multi-Dimensional Probe

Rather than relying on heuristic layer selection (e.g., "middle third"), Z01 includes an automated layer analysis probe that empirically determines the optimal extraction layer for each model. The probe:

  1. Sends a multi-dimensional prompt spanning five cognitive domains (Math, History, Creative, Logic, Reference/PubMed citations)
  2. Generates ~800 response tokens and uses LLM-based classification to assign each token to its content segment
  3. Performs a single memory-efficient forward pass scoring every layer on trajectory spread, segment separation, jump variance, clarity, divergence, centroid velocity, and cross-layer incoherence
  4. Selects the layer with the highest weighted composite score and persists the result per-model in the database

The probe provides a 3D trajectory visualization of all layers simultaneously (vertical Y-axis stacking with scroll navigation), and an Isolate mode that enables eight analytical sub-charts for the selected layer:

  • Laplacian Energy: Second-difference energy of per-token cosine jumps (windowed), detecting sudden directional changes
  • HMM Instability: 2-state Hidden Markov Model (Stable ↔ Unstable) with forward-backward P(Unstable|observations) and hysteresis thresholds
  • CUSUM Change-Point: Bidirectional cumulative sum on z-normalized instability ratio, detecting regime transitions (onset/recovery)
  • Entropy + KL: Shannon entropy slope ΔH and KL divergence between consecutive token distributions, with dual-axis display
  • Perplexity: Per-token PPLt = exp(Ht) with lin/log toggle; log-perplexity avoids exponential spike amplification
  • Logits: Three z-scored signals — Surprisal St = −log p(xt), Inverse Margin −Mt (top-1 vs top-2 gap), and JS Drift JSD(pt ∥ pt−1) — with EMA-based running standardization (α=0.05, warmup=20)
  • UMAP Mahalanobis: On-demand per-layer PCA(50)→UMAP(8D)→Mahalanobis D² anomaly scoring, with χ²(8, p=0.01)=20.09 threshold. Computed ~3s per layer, cached after first request
  • Geometry: Per-segment sparse geometry classification using a tiered classifier (anchor drift → density/smoothness/pivot → PC1 residual), categorizing each content segment as Anchored, Basin Wandering, or Sparse Runaway

All sub-charts share bidirectional hover cross-highlighting with the segment-colored response text, and automatically re-render when switching layers. Weight profiles and layer overrides are saved per-model. Layer analysis can also be launched directly from the persona editor when selecting a model.

3.4 Curvature-Weighted Intervention

The steering magnitude incorporates trajectory curvature as a secondary signal:

αeff = α · (1 + γ · κt)

where κt is the instantaneous curvature and γ is a sensitivity parameter. High-curvature states—indicative of confabulation—receive stronger corrective intervention, while smooth trajectories through unfamiliar territory receive minimal interference.

3.5 Visualization and Monitoring

The three-dimensional trajectory visualization projects hidden states via PCA with auto-camera orientation, enabling real-time observation of:

  • Basin location and extent (dynamically updated)
  • Trajectory path with curvature-coded coloring (green → yellow → red)
  • Segment thickness proportional to semantic jump distance
  • Excursion events with onset/recovery markers
  • Direction arrows showing temporal flow every 15 tokens

A timeline panel below the 3D view displays curvature (red) and jump (blue) metrics over time with warning bands. Each assistant message includes a clickable trajectory thumbnail (100×24px) that opens the full 3D modal.

Post-hoc re-projection: Hidden states are saved as .npy files, enabling re-analysis with different algorithms (PCA, t-SNE, UMAP) without re-generating responses.

Layer analysis probe: The multi-layer probe provides an all-layers 3D view (88 layers stacked vertically with scroll navigation), segment-colored trajectories consistent across layers, and eight analytical sub-charts (Laplacian, HMM, CUSUM, Entropy, Perplexity, Logits, UMAP, Geometry) accessible in Isolate mode. Cross-layer comparison reveals how trajectory quality evolves through the network stack.

3.6 Limitations and Future Directions

Current limitations include: (a) computational overhead of per-token hidden state extraction during streaming, (b) sensitivity to PCA projection artifacts in low-variance regions, (c) the need for empirical tuning of basin and intervention parameters, and (d) UMAP manifold analysis requiring on-demand computation (~3s per layer) rather than real-time. Future work will explore learned basin definitions, multi-layer intervention, and integration with retrieval-augmented generation for hybrid grounding.

References

Åström, K. J., & Murray, R. M. (2008). Feedback Systems: An Introduction for Scientists and Engineers. Princeton University Press. [link]

Burns, C., Ye, H., Klein, D., & Steinhardt, J. (2022). Discovering Latent Knowledge in Language Models Without Supervision. arXiv:2212.03827. [arxiv]

Elhage, N., et al. (2022). Toy Models of Superposition. Transformer Circuits Thread. [link]

Graves, A., Wayne, G., & Danihelka, I. (2014). Neural Turing Machines. arXiv:1410.5401. [arxiv]

Guo, Y., et al. (2022). pgvector: Open-source vector similarity search for PostgreSQL. [github]

Ji, Z., et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1-38. [acm]

Kadavath, S., et al. (2022). Language Models (Mostly) Know What They Know. arXiv:2207.05221. [arxiv]

Laird, J. E. (2012). The Soar Cognitive Architecture. MIT Press. [link]

Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. [arxiv]

Maynez, J., et al. (2020). On Faithfulness and Factuality in Abstractive Summarization. ACL 2020. [acl]

Turner, A., et al. (2023). Activation Addition: Steering Language Models Without Optimization. arXiv:2308.10248. [arxiv]

Wang, X., et al. (2023). Self-Consistency Improves Chain of Thought Reasoning. ICLR 2023. [arxiv]

Wei, J., et al. (2022). Finetuned Language Models Are Zero-Shot Learners. ICLR 2022. [arxiv]

Zou, A., et al. (2023). Representation Engineering: A Top-Down Approach to AI Transparency. arXiv:2310.01405. [arxiv]

Welcome to Z01

Enter to send, Shift+Enter for new line
Latent Space Timeline
Start
End
Basin
Recovered
Dense
Sparse
Latent Trajectory
Hover over points to see tokens
Start
End
Basin
Recovered
Dense
Sparse
3D Latent Verse
Hover over a centroid to see full prompt

Document Library

Drop a PDF here or click to browse

No documents uploaded yet.