Warden — Self-Healing Kubernetes Security Agent
A self-healing Kubernetes security agent deployed on Azure Kubernetes Service. Warden is a Python FastAPI webhook server that receives Falco runtime security alerts and OPA Gatekeeper admission violations, triages each incident with Claude Sonnet 4.6, auto-patches low-severity threats immediately, and drafts remediation runbooks for high-severity incidents requiring human approval. Two-layer security coverage proven end-to-end on a live AKS cluster for ~$2 in total cloud spend.
Core Technologies
Architecture Components
- Python FastAPI webhook server receiving alerts from Falco and OPA Gatekeeper
- OPA Gatekeeper ConstraintTemplates (Rego) blocking privileged containers, unverified registries, and non-root violations at admission
- Falco DaemonSet using eBPF kernel probes for real-time runtime detection of shell spawns, privilege escalation, and suspicious syscalls
- Falcosidekick forwarding Falco events to the Warden webhook endpoint inside the cluster
- Claude Sonnet 4.6 for structured alert triage — severity classification and remediation recommendation
- Azure Key Vault storing the Claude API key, injected at runtime via managed identity — never in source control
- Prometheus metrics tracking alert volume by severity, auto-patch outcomes, and Claude API latency
- Terraform provisioning the AKS cluster and Azure supporting resources
- Azure DevOps pipeline for infrastructure deployment
Problem
Kubernetes clusters generate a constant stream of security events — admission violations, runtime anomalies, syscall alerts — that require expert triage, manual response, and accurate severity judgment. Without automation, security teams are buried in noise, slow to respond to real threats, and unable to scale coverage.
- Runtime security events require immediate triage, but manual review is slow and error-prone at volume.
- Admission control and runtime detection are often siloed with no unified response layer connecting them.
- Low-severity incidents that could be auto-remediated still consume engineer time, reducing capacity for real threats.
Solution
A Python FastAPI webhook server that sits between Kubernetes security tooling and on-call engineers. Warden receives Falco and OPA alerts, classifies severity with Claude Sonnet 4.6, and acts — auto-patching low-severity incidents immediately and surfacing AI-drafted runbooks for high-severity threats that require a human decision.
- Layer 1 — OPA Gatekeeper: Rego policies block non-compliant workloads at admission — no privileged containers, no unverified registries, no non-root violations.
- Layer 2 — Falco: eBPF kernel probes detect threats at runtime — shell spawns, privilege escalation, and suspicious syscall patterns.
- Claude Sonnet 4.6 triage: structured severity classification — low severity triggers auto-patch, high severity generates a runbook for human review.
- Azure Key Vault for secret management: the Claude API key is injected at runtime, never stored in source control or container images.

OPA Gatekeeper blocking a privileged container at admission — Layer 1 enforcement working on AKS.
Outcome
End-to-end Kubernetes security automation proven on a live AKS cluster: Falco detected a shell spawn, Falcosidekick forwarded the alert to Warden, Claude triaged it as low severity, Warden auto-patched and returned HTTP 200 — all within 3 seconds. Total cloud spend: ~$2.
- Full AKS pipeline proven: shell spawn → Falcosidekick → Claude triage (severity=low) → auto-patch → HTTP 200.
- Two-layer security coverage: OPA admission control and Falco runtime detection operating independently on the same cluster.
- Total cloud spend across all sessions: ~$2 — production-grade Kubernetes security at near-zero cost.
- All AI triage decisions logged with full reasoning chain, original alert payload, and action taken — complete audit trail.

Falco detecting a shell spawn inside a running container — Layer 2 runtime detection triggering the full triage pipeline.
Key Learnings & Decisions
Kubernetes Security
- OPA Gatekeeper and Falco address different threat surfaces — admission control stops misconfigurations; runtime detection catches adversarial behaviour that only appears after a workload is running.
- Namespace-scoped exclusions are the right pattern for security tooling that needs elevated permissions — Falco's DaemonSet is privileged, so the falco namespace is excluded from BlockPrivilegedContainers constraints.
- Parameterised OPA ConstraintTemplates add schema validation complexity that can fail across Kubernetes versions; for a small known-registry list, hardcoded Rego is simpler and more reliable.
AI-Driven Automation
- Claude Sonnet 4.6 works as a reliable triage component when given structured inputs and constrained output formats — it classifies alerts consistently with defined severity rubrics.
- Human approval gates for high-severity actions are non-negotiable; auto-remediation without bounds is a liability, not a feature.
- Full logging of AI reasoning with the original alert payload enables post-incident review and builds trust in automated decisions over time.
Secrets & Operations
- Secrets injected from external vaults can be silently corrupted — validate API key format and length at startup before accepting traffic, not reactively after the first 401.
- The ConfigMap-mounted code pattern enables rapid iteration on cluster-internal tooling without custom image builds — acceptable for development, not for production.
- WSL2 eBPF limitations mean local Falco detection is not possible; design the dev loop around real cluster validation for eBPF-dependent tooling from the start.
Implementation Milestones
A breakdown of the key tasks and milestones that brought this project to life.
AKS Cluster Provisioning
CompleteAKS cluster and supporting Azure resources provisioned with Terraform. Azure Key Vault configured for runtime secret injection.
Key Tasks Completed
Terraform AKS + Azure Resources
Cluster provisioned, Key Vault wired up, and Azure DevOps pipeline configured. Infrastructure ready for Warden deployment.
Azure Key Vault Integration
Claude API key stored in Key Vault and injected at runtime. Startup validation added to catch corrupted secrets before traffic is accepted.
OPA Gatekeeper Policies
CompleteAdmission control layer implemented with Rego ConstraintTemplates. Namespace exclusions configured for Falco's privileged DaemonSet.
Key Tasks Completed
BlockPrivilegedContainers ConstraintTemplate
Rego policy blocks privileged containers cluster-wide. Falco namespace excluded to allow Falco's own DaemonSet to run.
AllowedImageRegistries ConstraintTemplate
Registries hardcoded in Rego after parameterised schema validation errors on AKS. Simpler and more reliable.
Falco Runtime Detection
CompleteFalco DaemonSet deployed on AKS with eBPF probes active. Falcosidekick configured to forward events to the Warden webhook endpoint.
Key Tasks Completed
Falco DaemonSet on AKS
Falco running on every node with eBPF probes. WSL2 eBPF limitations documented — full detection validated on-cluster only.
Falcosidekick Webhook Forwarding
Falcosidekick configured to POST structured Falco events to the Warden webhook endpoint inside the cluster.
Warden Webhook Server
CompleteFastAPI webhook server deployed via ConfigMap-mounted code. Claude Sonnet 4.6 triage integrated with auto-patch and runbook generation.
Key Tasks Completed
FastAPI Webhook Endpoint
Webhook server receives Falco and OPA alert payloads and routes to the triage pipeline.
Claude Sonnet Triage Integration
Structured triage with severity classification. Low severity triggers auto-patch; high severity generates a runbook for human review.
ConfigMap-Mounted Agent Pattern
Agent code mounted as a Kubernetes ConfigMap and executed via a base Python image — no custom image build required for development iterations.
End-to-End Pipeline Proven
CompleteFull AKS pipeline validated: Falco detected a shell spawn, Falcosidekick forwarded to Warden, Claude triaged severity=low, auto-patch applied, HTTP 200 returned.
Key Tasks Completed
AKS End-to-End Demo
Shell spawn in a container triggered Falco. Alert forwarded via Falcosidekick. Claude returned severity=low. Warden auto-patched and returned HTTP 200. Pipeline complete.
Case Study Published
In ProgressPortfolio case study written and published to adventuringghost.com.
Key Tasks Completed
Portfolio Entry
Case study content complete. Publishing to portfolio site now.
Monitoring & Analysis
Prometheus Metrics
Custom metrics track alert volume by severity, auto-patch success and failure rate, Claude API response latency, and runbook generation count. Scraped by the cluster's Prometheus instance and queryable for post-incident analysis.
Triage Audit Log
Every AI triage decision is logged with the original alert payload, Claude's severity classification, the action taken, and a timestamp — giving a complete audit trail for post-incident review and trust-building in automated decisions.