API Layer
WHY: Entry point for all requests
HOW: FastAPI async endpoints handle validation, auth, and routing
SCALE: Horizontal scaling behind load balancer
Handles JWT auth, rate limiting, logging, and request normalization. Integrated with API Gateway for throttling.
Embedding Service
WHY: Convert text → vector space
HOW: Azure OpenAI embedding model
SCALE: Batch async processing to reduce cost
Chunking strategy (recursive splitting), batch embedding, and caching frequently used embeddings in Redis.
Vector DB (FAISS)
WHY: Fast similarity search
HOW: IndexFlatIP + cosine similarity
SCALE: Sharding + memory optimization
Top-K retrieval with approximate nearest neighbor optimization. Index stored in memory for ultra-low latency.
Reranker
WHY: Improve relevance
HOW: Cross-encoder model ranks retrieved docs
SCALE: GPU inference
Uses transformer-based reranker to filter noisy FAISS results before passing to LLM.
LLM Layer
WHY: Generate final answer
HOW: Azure OpenAI GPT
SCALE: Token optimization + caching
Prompt engineering, context injection, and token limit optimization to reduce cost.
Guardrails
WHY: Prevent hallucination
HOW: NVIDIA NeMo policies
SCALE: Rule-based + LLM hybrid
Validates output, enforces policies, blocks unsafe responses, and ensures compliance.