It would be accurate to say that this is where the actual coding begins. For that very reason, I added this post on “Architecture Design” at the absolute end of the roadmap. Working on this feels like adding a few more AI-powered team members to my workflow. However, the AI only plans and develops to the extent of what I already know and what I propose.
I used to believe that AI would eventually steal all our jobs, leaving human developers with nothing to do. Now, I see that is not the case. Instead, we are entering a world where highly skilled developers will thrive, while the rest are filtered out. This shift will likely happen across all industries. To survive, a developer can no longer just be good at coding; you must possess deep domain knowledge, understand software engineering fundamentals, and have solid real-world development experience. (Perhaps that sounds too obvious?)

- Project Repository: github.com/zafrem/bastion-rag
Series Name: Bastion-RAG – Project Security RAG
- [Bastion-RAG] Project Security RAG
- [Bastion-RAG 0] Get help from AI (Architecture Design) – Here!
- [Bastion-RAG 1 – Sentinel]
- Prompt Injection Defense
- Metadata Filtering
- [Bastion-RAG 2 – Vault]
- Multi-tenancy
- Deterministic De-identification
- [Bastion-RAG 3 – Navigator]
- Hybrid Reranking
- Logical Partitioning
- [Bastion-RAG 4 – Archor]
- Embedding Noise Injection
- Embedding Model Bias Verification
- [Bastion-RAG 5 – Tracker]
- Data Lineage Tracking
- Honey-token Injection
- [Bastion-RAG Demo]
목차
I embarked on the development of the [Bastion-RAG] Project — Secure RAG with the ambitious goal of implementing highly complex AI security governance layers—such as prompt injection defense, deterministic tokenization, and differential noise injection—into a functional, production-ready system. Witnessing the exponential performance leaps of modern AI coding assistants, I was thoroughly optimistic that the design and implementation would be a breeze.
I thought, “With an AI assistant, I can build an enterprise governance layer in no time.” In reality, the code implementation itself materialized at an unimaginable speed. However, when it came to architectural design, there were still far too many variables to consider.
Establishing an architecture that satisfies strict enterprise compliance (such as PIPA and GDPR) while completely minimizing runtime latency overhead on live data pipelines was anything but magical. Every time a security control layer was blindly appended, the system latency driven past standard metrics skyrocketed exponentially. To resolve these bottlenecks, I had to engage in countless grueling design debates with my AI assistant, completely destroying and rebuilding the core architecture three times from scratch.
This post is not just a dry list of technical specifications. It is a raw engineering story documenting how the pipeline’s structure evolved through rigorous AI question-and-answer debates, how we meticulously measured execution overhead, and what finally led us to construct this secure RAG governance framework. (And it really is raw—I am writing this post while actively working on the project!)

2. Evolutionary Analysis: The Dramatic Structural Journey from v1 to v3
Reviewing the architectural design prompt records exchanged with the AI assistant reveals a continuous cycle of technical friction, rapid prototyping, and a stark realization of my own initial research gaps.

2.1 [Version 1.0] The Swamp of Functional Fragmentation and Asymmetry (Failure)
- Architectural Profile: A unidirectional defense architecture focused almost entirely on the input pipeline. Each security component (Sentinel-IN, Vault-IN) operated as a disconnected microservice or independent container. Even the Honey-Token intrusion detection system was planned as an isolated function confined entirely within the Tracker module.
- The AI Debate:
- Me: “I want to completely decouple the security modules. Let’s write the inbound gateway in Go, but isolate the embedding generation and reranking components into a separate Python model server that we call via HTTP.”
- AI: “This represents a textbook Microservice Architecture Design. You will gain perfect independent deployment boundaries across all domains.”
- The Brutal Reality and Collapse: The moment I ran integration benchmarks on the physical implementation, I was hit with absolute despair. Every single user retrieval request triggered a chain of external microservice calls, accumulating massive HTTP network hop delays and serialization/deserialization overhead. The p95 latency skyrocketed beyond 300ms, making real-time processing impossible. The even bigger oversight was the complete lack of asymmetry. Because we only secured the inbound path, we completely ignored the ‘output security gap’—leaving the system vulnerable to the LLM leaking raw personal data or hallucinating restricted enterprise secrets in the final generated response.
2.2 [Version 2.0] Discovery of Symmetrical Integration and Cross-Cutting Dynamics (Transition)
- Architectural Profile: This phase marked a complete structural overhaul. We tightly coupled the decoupled input (Phase 1) and output (Phase 2) modules into a unified, bidirectional symmetric architecture hosted entirely within a single Go service container. Furthermore, we recognized that Honey-Tokens, Multi-Tenancy, and Data Lineage could not be treated as a single module’s burden; they were elevated to full, cross-cutting Tier 3 system-wide coordinators.
- The AI Debate:
- Me: “We successfully eliminated the network hop overhead by merging everything into a single Go process, but now our machine learning operations are completely broken. The Go ecosystem lacks native primitives for WEAT bias analysis or Laplacian noise distributions. I am writing hundreds of lines of code manually just to reinvent the wheel. The codebase is becoming an unmaintainable mess!”
- AI: “Abandoning language independence entirely creates a rigid ceiling. You must redesign a hybrid interface contract that preserves performance isolation while natively embracing the Python machine learning ecosystem.”
- The Limitations of the Transition:While the single-language Go consolidation successfully resolved network latency, it introduced a catastrophic maintenance wall due to heavy machine learning serving overhead and highly complex CGO bindings.
2.3 [Version 3.0] Completion of the High-Performance Polyglot Wire Contract (Current)
- Architectural Profile: Retaining the symmetrical dual-phase pipeline and absolute module independence (Standalone Value) established in v2.0, the architecture evolved into a highly optimized, hybrid polyglot structure. High-speed text pattern matching and cryptographic token mapping are assigned exclusively to Go, while complex vector embeddings and matrix numerical operations are managed entirely by Python.
- The AI-Driven Breakthrough:
- The Decisive Question (Me): “How can we seamlessly tap into the rich Python machine learning ecosystem without introducing the devastating HTTP network communication latency we suffered in v1?”
- The AI Solution (The Core of v3): “You must shatter the conventional assumption of running a separate model sidecar or microservice. Redesign the Navigator (Search) and Anchor (Security) modules as self-contained Python processes, and host the
sentence-transformersandCrossEncodermodels directly inside the process memory. By serving them in-process, you execute inference with a single function call, completely eliminating network hops. To bridge the multi-language boundary cleanly, utilize a gRPC infrastructure but replace the binary protobuf layer with a highly transparent JSON Codec contract on the wire. This delivers maximum structural flexibility alongside rigid type safety.”

3. [Bastion-RAG 0] Virtual Emulation Simulation for Architecture Design
Through three separate architectural restructurings, I learned a painful but invaluable lesson: “You must completely validate your configuration’s compliance integrity and exception handling paths before writing a single line of code.” To institutionalize this philosophy, we established [Bastion-RAG 0] as a proactive architectural auditing layer at the entry point of the framework.
When enterprise security policies and compliance constraints are supplied, [Bastion-RAG 0] runs a virtual emulation of the pipeline’s event stream over a high-performance NATS event bus topology without loading heavy ML models or spinning up concrete services.
Once a developer inputs their proposed pipeline configuration, the [Bastion-RAG 0] audit engine dynamically generates a real-time Virtual Ingestion Trace Log to diagnose data lineage flows and isolate potential architectural bottlenecks:
[Bastion-RAG 0: Virtual Emulator Ingestion Trace Log]
- [emulator/Sentinel-IN] INFO: Prompt input validated. Status: PASSED. Injection score: 0.05[cite: 7]
- [emulator/Vault-Phase1] INFO: Multi-strategy anonymization executed.[cite: 6]
- Input matching: "Hong Gildong" -> KR_NAME_8f3d2a (PERSON)[cite: 7]
- Input matching: "hong@naver.com" -> EMAIL_c3a91f (EMAIL)[cite: 7]
- [emulator/Navigator] INFO: Executing structural pre-filtering isolation.
- Injected filters: tenant_id=acme, collections=[customer_docs][cite: 7]
- [emulator/Anchor-IN] INFO: Differential noise injection executed. Sigma applied: 0.01
- [emulator/Vault-Phase2] INFO: Evaluating selective detokenization via OPA policy rules.
- [emulator/Sentinel-OUT] INFO: Grounding and hallucination checks completed. Status: PASSED.
4. Architectural Principles and Hard Constraints for Total Isolation
Hammered out through intense technical debates with my AI assistant and codified directly into our core 01_architecture-principles.md foundation document, the absolute architectural constraints of the Bastion-RAG framework are defined as follows:
- Core Functional Autonomy (Standalone Value): Every single module must be architected as a self-contained, autonomous unit capable of delivering meaningful security value even if attached directly and exclusively to an LLM. In accordance with the principles of Graceful Degradation, the failure of peripheral modules must never cause a cascading collapse of the primary data path.
- Strict Prohibition of Direct Coupling (Forbidden Coupling): Modules are strictly prohibited from instantiating or executing direct API calls to alternate modules[cite: 11]. For example, the Navigator search layer does not hold a reference to the Vault; it relies entirely on a zero-trust data flow contract, where required user permission tokens are fetched by the upstream orchestrator and cleanly passed directly into the request payload[cite: 11].
- Non-Invasive Observability Architecture (Non-Invasive Observer): The Tracker module, which aggregates system audit records and maps data lineage paths, must never introduce even a single millisecond of synchronous block time to the primary data path[cite: 11]. It functions strictly as a non-invasive observer (the “CCTV” of the framework), consuming fire-and-forget JSON event streams pushed asynchronously over a decoupled NATS message bus[cite: 11].

5. Conclusion: The AI Assistant Is an Incredible Debating Partner, Not a Blindly Trusted Tool
Initially, the AI coding assistant pushed me toward a very generic, textbook microservice pattern that entirely ignored the strict latency constraints of an enterprise runtime environment, causing significant friction. However, by establishing rigid technical boundaries—such as microsecond-level latency targets and language-specific ecosystem limits—and treating the AI as an unyielding debugging partner, we successfully co-architected a highly innovative, modular polyglot topology (v3).
It was an exhausting mountain to climb. The entire design phase alone consumed roughly four weeks. At one point, I fell into the trap of thinking, “Wow, this Architecture Design is absolutely flawless,” only to start writing the physical implementation, realize a core dependency flaw, and have to roll back two entire weeks of progress. Furthermore, navigating modern LLM flaws—such as absolute token context limits and context window degradation—helped me acquire profound, practical intuition regarding LLM troubleshooting.
Years ago, I managed a massive refactoring project for a highly successful legacy application. During that refactoring, we managed to shrink the total codebase to 1/5 of its original size while building tens of thousands of automated test cases. We transformed a product deployment that used to take a full month of manual verification into an automated pipeline that completed comprehensive validation in under three hours. Ever since completing that specific engineering milestone, my ability to analyze, dissect, and re-architect large-scale software systems improved dramatically.
In much the same way, building a framework like Bastion-RAG from the ground up alongside an LLM has given me a deep, fundamental understanding of how the new generation of developers builds applications—and it has ensured that I remain completely adaptable in this rapidly changing era.