Warden — Self-Healing Kubernetes Security Agent
Production-grade Kubernetes security at ~$2 total cloud spend — two independent detection layers, AI-driven triage, and automated remediation proven end-to-end on AKS. Warden combines OPA Gatekeeper admission control and Falco eBPF runtime detection with Claude Sonnet 4.6 triage: threats detected in milliseconds, low-severity incidents auto-patched in under 3 seconds, high-severity incidents escalated with a Claude-drafted runbook for human review.
3 sec
Detection → Auto-patch
~$2
Cloud spend (all sessions)
2
Security layers
100%
Audit trail
Core Technologies
Architecture Components
- Python FastAPI webhook server receiving alerts from Falco and OPA Gatekeeper
- OPA Gatekeeper ConstraintTemplates (Rego) blocking privileged containers, unverified registries, and non-root violations at admission
- Falco DaemonSet using eBPF kernel probes for real-time runtime detection of shell spawns, privilege escalation, and suspicious syscalls
- Falcosidekick forwarding Falco events to the Warden webhook endpoint inside the cluster
- Claude Sonnet 4.6 for structured alert triage — severity classification and remediation recommendation
- Azure Key Vault storing the Claude API key, injected at runtime via managed identity — never in source control
- Prometheus metrics tracking alert volume by severity, auto-patch outcomes, and Claude API latency
- Terraform provisioning the AKS cluster and Azure supporting resources
- Azure DevOps pipeline for infrastructure deployment
Problem
Kubernetes clusters generate a constant stream of security events — admission violations, runtime anomalies, syscall alerts — that require expert triage, manual response, and accurate severity judgment. Without automation, security teams are buried in noise, slow to respond to real threats, and unable to scale coverage.
- Runtime security events require immediate triage, but manual review is slow and error-prone at volume.
- Admission control and runtime detection are often siloed with no unified response layer connecting them.
- Low-severity incidents that could be auto-remediated still consume engineer time, reducing capacity for real threats.
Solution
A Python FastAPI webhook server that sits between Kubernetes security tooling and on-call engineers. Warden receives Falco and OPA alerts, classifies severity with Claude Sonnet 4.6, and acts — auto-patching low-severity incidents immediately and surfacing AI-drafted runbooks for high-severity threats that require a human decision.
- Layer 1 — OPA Gatekeeper: Rego policies block non-compliant workloads at admission — no privileged containers, no unverified registries, no non-root violations.
- Layer 2 — Falco: eBPF kernel probes detect threats at runtime — shell spawns, privilege escalation, and suspicious syscall patterns.
- Claude Sonnet 4.6 triage: structured severity classification — low severity triggers auto-patch, high severity generates a runbook for human review.
- Azure Key Vault for secret management: the Claude API key is injected at runtime, never stored in source control or container images.

OPA Gatekeeper blocking a privileged container at admission — Layer 1 enforcement working on AKS.
Security Design
- Secrets management: Claude API key stored in Azure Key Vault, injected at runtime via managed identity — never in source control, never in container images, never in environment variables passed at build time.
- Least-privilege RBAC: Warden agent has only the Kubernetes permissions it needs — patch pods in the warden-system namespace, nothing else. Defined explicitly in a ClusterRole manifest.
- Zero-trust between components: Falcosidekick communicates with Warden via in-cluster HTTP only — no external exposure, no auth tokens in transit.
- Audit trail by design: every Claude triage decision is logged with user context, original payload, severity classification, and action taken — not as an afterthought but as a required output of the triage function.
- Namespace isolation: Falco runs in its own namespace with elevated permissions; the BlockPrivilegedContainers constraint explicitly excludes the falco namespace to prevent policy self-conflict.
Observability & Operations
- Prometheus metrics exposed at /metrics: alert_count by severity label, auto_patch_total (success/failure), claude_api_latency_seconds (histogram), warden_webhook_requests_total.
- What each metric tells you operationally: alert_count spike = active threat or noisy rule; auto_patch failure rate climbing = Kubernetes API permissions issue or cluster instability; Claude API latency p95 > 2s = triage pipeline degrading, consider fallback severity classification.
- Grafana dashboard: three panels — alerts by severity over time (bar chart), auto-patch success rate (stat panel), Claude API latency p50/p95 (time series). Dashboard JSON exported to repo at /docs/grafana-dashboard.json.
- Alerting intent: in a production deployment, alert on auto_patch_failure_total > 0 (immediate page), claude_api_latency_seconds p95 > 3s (warning), and alert_count rate > 10/minute sustained (potential attack or rule misconfiguration).
Outcome
End-to-end Kubernetes security automation proven on a live AKS cluster: Falco detected a shell spawn, Falcosidekick forwarded the alert to Warden, Claude triaged it as low severity, Warden auto-patched and returned HTTP 200 — all within 3 seconds. Total cloud spend: ~$2.
- Full AKS pipeline proven: shell spawn → Falcosidekick → Claude triage (severity=low) → auto-patch → HTTP 200.
- Two-layer security coverage: OPA admission control and Falco runtime detection operating independently on the same cluster.
- Total cloud spend across all sessions: ~$2 — production-grade Kubernetes security at near-zero cost.
- All AI triage decisions logged with full reasoning chain, original alert payload, and action taken — complete audit trail.

Falco detecting a shell spawn inside a running container — Layer 2 runtime detection triggering the full triage pipeline.
Real-World Use Cases
The Warden architecture applies directly to any organisation running Kubernetes in a regulated or security-sensitive environment.
Healthcare Kubernetes Clusters
HIPAA-regulated environments need documented proof that runtime threats are detected, contained, and logged within seconds. Warden's audit trail — every triage decision recorded with original payload, severity classification, and action taken — satisfies compliance requirements that manual response processes can't.
Financial Services & PCI-DSS
SOC 2 and PCI-DSS compliance requires continuous runtime monitoring and automated remediation workflows with full audit evidence. Warden's two-layer detection and structured Claude reasoning chain provides the documented response record auditors require.
Multi-Tenant SaaS Platforms
In a shared Kubernetes cluster, a compromised tenant workload can attempt to pivot to adjacent namespaces. Warden's runtime detection catches lateral movement attempts — shell spawns, privilege escalation, sensitive file reads — before they succeed.
Government & Defence Contractors
FedRAMP and ITAR environments require continuous monitoring with documented remediation workflows and zero manual gaps in the security response chain. Warden's approval gate for high-severity incidents ensures human oversight is enforced by architecture, not policy.
Key Learnings & Decisions
Kubernetes Security
- OPA Gatekeeper and Falco address different threat surfaces — admission control stops misconfigurations; runtime detection catches adversarial behaviour that only appears after a workload is running.
- Namespace-scoped exclusions are the right pattern for security tooling that needs elevated permissions — Falco's DaemonSet is privileged, so the falco namespace is excluded from BlockPrivilegedContainers constraints.
- Parameterised OPA ConstraintTemplates add schema validation complexity that can fail across Kubernetes versions; for a small known-registry list, hardcoded Rego is simpler and more reliable.
AI-Driven Automation
- Claude Sonnet 4.6 works as a reliable triage component when given structured inputs and constrained output formats — it classifies alerts consistently with defined severity rubrics.
- Human approval gates for high-severity actions are non-negotiable; auto-remediation without bounds is a liability, not a feature.
- Full logging of AI reasoning with the original alert payload enables post-incident review and builds trust in automated decisions over time.
Secrets & Operations
- Secrets injected from external vaults can be silently corrupted — validate API key format and length at startup before accepting traffic, not reactively after the first 401.
- The ConfigMap-mounted code pattern enables rapid iteration on cluster-internal tooling without custom image builds — acceptable for development, not for production.
- WSL2 eBPF limitations mean local Falco detection is not possible; design the dev loop around real cluster validation for eBPF-dependent tooling from the start.
Implementation Milestones
A breakdown of the key tasks and milestones that brought this project to life.
AKS Cluster Provisioning
CompleteAKS cluster and supporting Azure resources provisioned with Terraform. Azure Key Vault configured for runtime secret injection.
Key Tasks Completed
Terraform AKS + Azure Resources
Cluster provisioned, Key Vault wired up, and Azure DevOps pipeline configured. Infrastructure ready for Warden deployment.
Azure Key Vault Integration
Claude API key stored in Key Vault and injected at runtime. Startup validation added to catch corrupted secrets before traffic is accepted.
OPA Gatekeeper Policies
CompleteAdmission control layer implemented with Rego ConstraintTemplates. Namespace exclusions configured for Falco's privileged DaemonSet.
Key Tasks Completed
BlockPrivilegedContainers ConstraintTemplate
Rego policy blocks privileged containers cluster-wide. Falco namespace excluded to allow Falco's own DaemonSet to run.
AllowedImageRegistries ConstraintTemplate
Registries hardcoded in Rego after parameterised schema validation errors on AKS. Simpler and more reliable.
Falco Runtime Detection
CompleteFalco DaemonSet deployed on AKS with eBPF probes active. Falcosidekick configured to forward events to the Warden webhook endpoint.
Key Tasks Completed
Falco DaemonSet on AKS
Falco running on every node with eBPF probes. WSL2 eBPF limitations documented — full detection validated on-cluster only.
Falcosidekick Webhook Forwarding
Falcosidekick configured to POST structured Falco events to the Warden webhook endpoint inside the cluster.
Warden Webhook Server
CompleteFastAPI webhook server deployed via ConfigMap-mounted code. Claude Sonnet 4.6 triage integrated with auto-patch and runbook generation.
Key Tasks Completed
FastAPI Webhook Endpoint
Webhook server receives Falco and OPA alert payloads and routes to the triage pipeline.
Claude Sonnet Triage Integration
Structured triage with severity classification. Low severity triggers auto-patch; high severity generates a runbook for human review.
ConfigMap-Mounted Agent Pattern
Agent code mounted as a Kubernetes ConfigMap and executed via a base Python image — no custom image build required for development iterations.
End-to-End Pipeline Proven
CompleteFull AKS pipeline validated: Falco detected a shell spawn, Falcosidekick forwarded to Warden, Claude triaged severity=low, auto-patch applied, HTTP 200 returned.
Key Tasks Completed
AKS End-to-End Demo
Shell spawn in a container triggered Falco. Alert forwarded via Falcosidekick. Claude returned severity=low. Warden auto-patched and returned HTTP 200. Pipeline complete.
Case Study Published
CompletePortfolio case study written and published to adventuringghost.com.
Key Tasks Completed
Portfolio Entry
Case study published. Grafana Cloud dashboard screenshot captured showing warden_alerts_total spike from live Claude triage run.
Evidence of Completion

Grafana Cloud dashboard showing warden_alerts_total — the high-severity ShellSpawnedInContainer alert recorded after Claude triage, pushed via Grafana Alloy from the local agent.
Monitoring & Analysis
Prometheus Metrics
Custom metrics track alert volume by severity, auto-patch success and failure rate, Claude API response latency, and runbook generation count. Scraped by the cluster's Prometheus instance and queryable for post-incident analysis.
Triage Audit Log
Every AI triage decision is logged with the original alert payload, Claude's severity classification, the action taken, and a timestamp — giving a complete audit trail for post-incident review and trust-building in automated decisions.
Warden webhook handler — receive alert, triage with Claude, act on severity
Part of a larger arc
The AI Security & Resilience Stack
Three independent projects that together cover the full surface of an AI-augmented infrastructure stack. Warden secures the Kubernetes runtime — Falco and OPA detecting threats as they happen, Claude triaging before an engineer is paged. Covenant controls access at the application layer — OPA as the hard gate between JWT identity and Claude, policy in code not prompts. Watershed closes the loop at the edge — async telemetry buffered through connectivity loss, with Claude flagging anomalies before the data reaches the cloud. Each project stands alone; together they tell one story.
~$2.00
Warden (AKS)
$0.00
Covenant (local Docker)
~$0.05
Watershed (AWS IoT Core)
~$2.05
Combined
Related project
Covenant — Policy-Enforced AI Access Control
OPA as the hard gate between JWT identity and Claude — the AI doesn't decide who sees what
Related project
Watershed — Edge-Resilient IoT Telemetry Pipeline
Async Python agent with offline buffering and AI anomaly detection — built for edge environments where connectivity is unreliable