Z01 is a private, locally-run chat interface for large language models (AI) that goes beyond basic conversation. It remembers what you've discussed across sessions, adapts to your preferences through customizable personas, and — uniquely — lets you see what the AI is actually doing as it generates each word.
Most AI chat interfaces treat the model as a black box — you type, it responds, and you hope the answer is correct. Z01 opens the box. By extracting and analyzing the model's internal hidden states at every token, Z01 provides a real-time "health monitor" for AI generation. The BRIDGE framework classifies trajectory behavior (stable, strained, fragmented, runaway) and can even steer the model back on track when it starts to drift. The layer analysis probe goes further, testing model behavior across all internal layers with multiple statistical methods to identify exactly where and why instability occurs. No other chat interface provides this level of introspection into live AI inference.
Equally important, Z01 decouples your data from the model. Memory, personas, and conversation history are stored independently — not locked to a specific AI model. This means you can migrate between models (e.g. from Mistral to MiniMax to Qwen) without losing any accumulated knowledge or context. The AI landscape evolves rapidly; Z01 ensures your investment in building up a personalised, memory-rich assistant is never lost when a better model arrives.
Z01 (from Greek ζωή, "life") is a local-first interface layer designed to augment large language model capabilities through persistent memory, user personalization, real-time inference monitoring, and multimodal interaction. The system addresses fundamental limitations in standard LLM deployments: the absence of long-term context retention, lack of user-specific adaptation, and limited transparency into model behavior during generation.
The platform operates as a stateful wrapper mediating between users and underlying language models. Z01 interposes a persistence and analysis layer that maintains conversation histories, user preferences, and accumulated knowledge across sessions. The backend (FastAPI with Granian, a Rust-based ASGI server) runs natively on the host for direct GPU access, while nginx and PostgreSQL run in Docker containers. This architecture draws on established patterns in cognitive architectures (Laird, 2012) and memory-augmented neural networks (Graves et al., 2014).
The frontend is built as modular vanilla JavaScript (ES modules) with 21 specialized modules handling chat, streaming, memory, personas, TTS, STT, trajectory visualization, token coloring, and UI state management.
Standard transformer models operate within fixed context windows with no native mechanism for cross-session persistence. Z01 implements a hybrid memory architecture combining:
The memory subsystem utilizes the Mem0 framework with Ollama (llama3.2:3b for fact extraction, nomic-embed-text for 768-dimensional embeddings), implementing a retrieve-augment-generate pattern analogous to RAG architectures (Lewis et al., 2020) but operating on user-specific accumulated knowledge rather than static document corpora.
Critically, all memory layers are model-agnostic — stored in PostgreSQL independently of any particular LLM. When the user switches between models (e.g. Mistral → MiniMax → Qwen), the full memory graph transfers seamlessly. This decoupling ensures that accumulated user knowledge and conversation history survive model migrations, making the system future-proof as the LLM landscape evolves.
Memories are temporally weighted using exponential recency decay. The system over-fetches 15 candidate memories from pgvector, then re-ranks by a combined score:
scoret = similarity × exp(−λ × agedays)
where λ = 0.03 (half-life ~23 days). The top 5 are injected into the system prompt with relative timestamps (e.g., "3 days ago", "2 months ago"), enabling the LLM to reason about recency and resolve contradictions.
Supersession tracking: When new memories are extracted, the system searches for high-similarity existing memories (cosine similarity > 0.7) and marks them as superseded via metadata flags. Superseded memories receive a 90% score penalty during retrieval, preventing stale facts from competing with current ones.
Three memory modes are available: Full (recall + record), Read-only (recall only), and Off (disabled).
User interaction is mediated through database-backed personas—structured system prompts that establish domain expertise, communication style, and behavioral constraints. Each persona encapsulates:
Personas support full CRUD operations, duplication, and voice preview from the UI. This implements prompt-based specialization (Wei et al., 2022), allowing a single base model to exhibit task-specific behaviors without fine-tuning.
Z01 abstracts over multiple inference backends:
Model selection is decoupled from conversation state and memory. LLM configurations are stored in the database with hot-swap capability. Per-conversation settings allow overriding sampling parameters (temperature, top-p, top-k, min-p, repetition penalty) independently.
An incremental streaming markdown renderer splits LLM output into stable (committed, markdown-rendered) and tail (speculative, lightly-styled) regions. Lines are committed to the stable region only when no multi-line construct (code fence, math block) is open. Upon stream completion, the entire message is rewritten using the same formatter applied to historical messages, ensuring consistent rendering with KaTeX math, syntax highlighting, and collapsible thinking sections.
Text-to-Speech: Real-time streaming TTS via Pocket TTS with 8 preloaded voices. The client performs prosody-aware markdown cleaning and ICU-based sentence segmentation before sending chunks to a server-side queue. Audio playback uses the Web Audio API to avoid Safari's HTMLAudioElement latency overhead.
Speech-to-Text: Voice input via whisper.cpp (large-v3-turbo model). Audio is recorded using the Web Audio API (MediaRecorder) and transcribed server-side with 4-thread parallelism.
Conversations are automatically tagged with 1-4 topic tags after each response, extracted using Ollama (llama3.2:3b). Tags are stored per-conversation (max 8) and displayed on the conversation list and in the Session Info sidebar alongside memory state.
Large language models exhibit a well-documented tendency to generate fluent but factually incorrect content—a phenomenon termed "hallucination" (Ji et al., 2023). This behavior manifests across model scales and architectures, representing a fundamental challenge for deployment in high-stakes domains. Existing taxonomies distinguish intrinsic hallucinations (contradicting source material) from extrinsic hallucinations (unverifiable claims), though both emerge from the same underlying mechanism: autoregressive generation optimized for coherence rather than factual accuracy (Maynez et al., 2020).
Prior work on hallucination detection has pursued several directions:
These approaches share a common limitation: they treat model states as independent observations rather than as points on a continuous trajectory through representation space.
BRIDGE (Basin-Referenced Inference Dynamics for Grounded Evaluation) reconceptualizes hallucination detection through the lens of dynamical systems theory. The central thesis:
Hallucination is a second-order dynamical failure of latent trajectories—characterized by high curvature in representation space—rather than a first-order semantic property of individual tokens.
BRIDGE decomposes model behavior into two orthogonal factors:
Support (S): The distance of the current hidden state from a dynamically-computed semantic basin, measuring grounding in established context.
Curvature (κ): The second derivative of the trajectory through representation space, measuring directional consistency across tokens.
This decomposition resolves an ambiguity that single-axis methods cannot address:
| Low Support, Low Curvature | Creative extrapolation — novel but coherent |
| Low Support, High Curvature | Confabulation — ungrounded and incoherent |
| High Support, Low Curvature | Grounded generation — factual and stable |
| High Support, High Curvature | Uncertain retrieval — grounded but unstable |
Unlike methods requiring precomputed reference vectors or contrastive datasets, BRIDGE constructs its semantic basin online from the model's own hidden states during generation. The basin centroid μt is initialized from early-generation tokens and updated via exponential moving average for states within a stability threshold:
μt+1 = (1 − β)μt + βht, when d(ht, μt) < τ
This approach requires no external supervision and adapts to each prompt's semantic context.
Tokens generating hidden states beyond the basin threshold are classified as excursions. The system uses robust z-scores (median/MAD normalization) with composite instability It = 0.5 × zjump + 0.5 × zcurvature. Excursion detection triggers at Q95 threshold with minimum 8-token spans. Recovery is assessed over the final 25 tokens. Qualitative labels assigned:
Tags are stored per-message in the database and displayed in trajectory thumbnails alongside each response.
Individual tokens in the rendered response are colored by their instability score using a gradient from green (stable) through yellow (warning) to red (unstable). This coloring is based on the jump metric—cosine distance between consecutive hidden states—and can be toggled on/off. Tooltips display exact instability percentages for each span.
Representation engineering (Zou et al., 2023) has demonstrated that transformer hidden states encode interpretable features that can be manipulated through additive interventions. Activation steering extends this insight to runtime control, modifying intermediate representations to influence generation behavior without weight modification (Turner et al., 2023).
BRIDGE implements a variant of activation steering with two key distinctions: interventions are (a) state-dependent rather than constant, and (b) computed relative to a dynamically-defined basin rather than precomputed steering vectors.
At each generation step, the system computes the displacement vector from the current hidden state to the basin centroid:
Δht = μt − ht
A proportional correction is applied when the state exceeds the basin threshold:
h't = ht + α · Δht · σ(d(ht, μt) − τ)
where α is the steering strength, τ is the basin threshold, and σ is a smooth activation function. This implements a return-to-basin dynamic analogous to control-theoretic regulation (Åström & Murray, 2008).
Rather than relying on heuristic layer selection (e.g., "middle third"), Z01 includes an automated layer analysis probe that empirically determines the optimal extraction layer for each model. The probe:
The probe provides a 3D trajectory visualization of all layers simultaneously (vertical Y-axis stacking with scroll navigation), and an Isolate mode that enables eight analytical sub-charts for the selected layer:
All sub-charts share bidirectional hover cross-highlighting with the segment-colored response text, and automatically re-render when switching layers. Weight profiles and layer overrides are saved per-model. Layer analysis can also be launched directly from the persona editor when selecting a model.
The steering magnitude incorporates trajectory curvature as a secondary signal:
αeff = α · (1 + γ · κt)
where κt is the instantaneous curvature and γ is a sensitivity parameter. High-curvature states—indicative of confabulation—receive stronger corrective intervention, while smooth trajectories through unfamiliar territory receive minimal interference.
The three-dimensional trajectory visualization projects hidden states via PCA with auto-camera orientation, enabling real-time observation of:
A timeline panel below the 3D view displays curvature (red) and jump (blue) metrics over time with warning bands. Each assistant message includes a clickable trajectory thumbnail (100×24px) that opens the full 3D modal.
Post-hoc re-projection: Hidden states are saved as .npy files, enabling re-analysis with different algorithms (PCA, t-SNE, UMAP) without re-generating responses.
Layer analysis probe: The multi-layer probe provides an all-layers 3D view (88 layers stacked vertically with scroll navigation), segment-colored trajectories consistent across layers, and eight analytical sub-charts (Laplacian, HMM, CUSUM, Entropy, Perplexity, Logits, UMAP, Geometry) accessible in Isolate mode. Cross-layer comparison reveals how trajectory quality evolves through the network stack.
Current limitations include: (a) computational overhead of per-token hidden state extraction during streaming, (b) sensitivity to PCA projection artifacts in low-variance regions, (c) the need for empirical tuning of basin and intervention parameters, and (d) UMAP manifold analysis requiring on-demand computation (~3s per layer) rather than real-time. Future work will explore learned basin definitions, multi-layer intervention, and integration with retrieval-augmented generation for hybrid grounding.
Åström, K. J., & Murray, R. M. (2008). Feedback Systems: An Introduction for Scientists and Engineers. Princeton University Press. [link]
Burns, C., Ye, H., Klein, D., & Steinhardt, J. (2022). Discovering Latent Knowledge in Language Models Without Supervision. arXiv:2212.03827. [arxiv]
Elhage, N., et al. (2022). Toy Models of Superposition. Transformer Circuits Thread. [link]
Graves, A., Wayne, G., & Danihelka, I. (2014). Neural Turing Machines. arXiv:1410.5401. [arxiv]
Guo, Y., et al. (2022). pgvector: Open-source vector similarity search for PostgreSQL. [github]
Ji, Z., et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1-38. [acm]
Kadavath, S., et al. (2022). Language Models (Mostly) Know What They Know. arXiv:2207.05221. [arxiv]
Laird, J. E. (2012). The Soar Cognitive Architecture. MIT Press. [link]
Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. [arxiv]
Maynez, J., et al. (2020). On Faithfulness and Factuality in Abstractive Summarization. ACL 2020. [acl]
Turner, A., et al. (2023). Activation Addition: Steering Language Models Without Optimization. arXiv:2308.10248. [arxiv]
Wang, X., et al. (2023). Self-Consistency Improves Chain of Thought Reasoning. ICLR 2023. [arxiv]
Wei, J., et al. (2022). Finetuned Language Models Are Zero-Shot Learners. ICLR 2022. [arxiv]
Zou, A., et al. (2023). Representation Engineering: A Top-Down Approach to AI Transparency. arXiv:2310.01405. [arxiv]
Z01 · Local-first LLM Interface · Multi-Model · Memory-Augmented · Apple Silicon + NVIDIA
Drop a PDF here or click to browse