Warden — Self-Healing Kubernetes Security Agent

Production-grade Kubernetes security at ~$2 total cloud spend — two independent detection layers, AI-driven triage, and automated remediation proven end-to-end on AKS. Warden combines OPA Gatekeeper admission control and Falco eBPF runtime detection with Claude Sonnet 4.6 triage: threats detected in milliseconds, low-severity incidents auto-patched in under 3 seconds, high-severity incidents escalated with a Claude-drafted runbook for human review.

3 sec

Detection → Auto-patch

~$2

Cloud spend (all sessions)

2

Security layers

100%

Audit trail

Core Technologies

PythonFastAPIKubernetes / AKSFalcoOPA GatekeeperClaude Sonnet 4.6Azure Key VaultPrometheusTerraformAzure DevOps

Architecture Components

  • Python FastAPI webhook server receiving alerts from Falco and OPA Gatekeeper
  • OPA Gatekeeper ConstraintTemplates (Rego) blocking privileged containers, unverified registries, and non-root violations at admission
  • Falco DaemonSet using eBPF kernel probes for real-time runtime detection of shell spawns, privilege escalation, and suspicious syscalls
  • Falcosidekick forwarding Falco events to the Warden webhook endpoint inside the cluster
  • Claude Sonnet 4.6 for structured alert triage — severity classification and remediation recommendation
  • Azure Key Vault storing the Claude API key, injected at runtime via managed identity — never in source control
  • Prometheus metrics tracking alert volume by severity, auto-patch outcomes, and Claude API latency
  • Terraform provisioning the AKS cluster and Azure supporting resources
  • Azure DevOps pipeline for infrastructure deployment
WARDEN SECURITY ARCHITECTURELAYER 1 — ADMISSION CONTROLDeveloper / CIsubmits workloadOPA GatekeeperAdmission Controlblocks bad configsK8s API Serveraccepts approved workloadsLAYER 2 — RUNTIME DETECTION & RESPONSERunning Podsactive workloadsFalcoRuntime DetectioneBPF syscall probesWarden AgentFastAPI webhookClaude SonnetAI Triageseverity classificationlow severityhigh severityAuto-patchK8s API patch appliedApproval Gaterunbook → human review

Problem

Kubernetes clusters generate a constant stream of security events — admission violations, runtime anomalies, syscall alerts — that require expert triage, manual response, and accurate severity judgment. Without automation, security teams are buried in noise, slow to respond to real threats, and unable to scale coverage.

  • Runtime security events require immediate triage, but manual review is slow and error-prone at volume.
  • Admission control and runtime detection are often siloed with no unified response layer connecting them.
  • Low-severity incidents that could be auto-remediated still consume engineer time, reducing capacity for real threats.

Solution

A Python FastAPI webhook server that sits between Kubernetes security tooling and on-call engineers. Warden receives Falco and OPA alerts, classifies severity with Claude Sonnet 4.6, and acts — auto-patching low-severity incidents immediately and surfacing AI-drafted runbooks for high-severity threats that require a human decision.

  • Layer 1 — OPA Gatekeeper: Rego policies block non-compliant workloads at admission — no privileged containers, no unverified registries, no non-root violations.
  • Layer 2 — Falco: eBPF kernel probes detect threats at runtime — shell spawns, privilege escalation, and suspicious syscall patterns.
  • Claude Sonnet 4.6 triage: structured severity classification — low severity triggers auto-patch, high severity generates a runbook for human review.
  • Azure Key Vault for secret management: the Claude API key is injected at runtime, never stored in source control or container images.
Solution visual

OPA Gatekeeper blocking a privileged container at admission — Layer 1 enforcement working on AKS.

Security Design

  • Secrets management: Claude API key stored in Azure Key Vault, injected at runtime via managed identity — never in source control, never in container images, never in environment variables passed at build time.
  • Least-privilege RBAC: Warden agent has only the Kubernetes permissions it needs — patch pods in the warden-system namespace, nothing else. Defined explicitly in a ClusterRole manifest.
  • Zero-trust between components: Falcosidekick communicates with Warden via in-cluster HTTP only — no external exposure, no auth tokens in transit.
  • Audit trail by design: every Claude triage decision is logged with user context, original payload, severity classification, and action taken — not as an afterthought but as a required output of the triage function.
  • Namespace isolation: Falco runs in its own namespace with elevated permissions; the BlockPrivilegedContainers constraint explicitly excludes the falco namespace to prevent policy self-conflict.

Observability & Operations

  • Prometheus metrics exposed at /metrics: alert_count by severity label, auto_patch_total (success/failure), claude_api_latency_seconds (histogram), warden_webhook_requests_total.
  • What each metric tells you operationally: alert_count spike = active threat or noisy rule; auto_patch failure rate climbing = Kubernetes API permissions issue or cluster instability; Claude API latency p95 > 2s = triage pipeline degrading, consider fallback severity classification.
  • Grafana dashboard: three panels — alerts by severity over time (bar chart), auto-patch success rate (stat panel), Claude API latency p50/p95 (time series). Dashboard JSON exported to repo at /docs/grafana-dashboard.json.
  • Alerting intent: in a production deployment, alert on auto_patch_failure_total > 0 (immediate page), claude_api_latency_seconds p95 > 3s (warning), and alert_count rate > 10/minute sustained (potential attack or rule misconfiguration).

Outcome

End-to-end Kubernetes security automation proven on a live AKS cluster: Falco detected a shell spawn, Falcosidekick forwarded the alert to Warden, Claude triaged it as low severity, Warden auto-patched and returned HTTP 200 — all within 3 seconds. Total cloud spend: ~$2.

  • Full AKS pipeline proven: shell spawn → Falcosidekick → Claude triage (severity=low) → auto-patch → HTTP 200.
  • Two-layer security coverage: OPA admission control and Falco runtime detection operating independently on the same cluster.
  • Total cloud spend across all sessions: ~$2 — production-grade Kubernetes security at near-zero cost.
  • All AI triage decisions logged with full reasoning chain, original alert payload, and action taken — complete audit trail.
Outcome visual

Falco detecting a shell spawn inside a running container — Layer 2 runtime detection triggering the full triage pipeline.

Real-World Use Cases

The Warden architecture applies directly to any organisation running Kubernetes in a regulated or security-sensitive environment.

Healthcare Kubernetes Clusters

HIPAA-regulated environments need documented proof that runtime threats are detected, contained, and logged within seconds. Warden's audit trail — every triage decision recorded with original payload, severity classification, and action taken — satisfies compliance requirements that manual response processes can't.

Financial Services & PCI-DSS

SOC 2 and PCI-DSS compliance requires continuous runtime monitoring and automated remediation workflows with full audit evidence. Warden's two-layer detection and structured Claude reasoning chain provides the documented response record auditors require.

Multi-Tenant SaaS Platforms

In a shared Kubernetes cluster, a compromised tenant workload can attempt to pivot to adjacent namespaces. Warden's runtime detection catches lateral movement attempts — shell spawns, privilege escalation, sensitive file reads — before they succeed.

Government & Defence Contractors

FedRAMP and ITAR environments require continuous monitoring with documented remediation workflows and zero manual gaps in the security response chain. Warden's approval gate for high-severity incidents ensures human oversight is enforced by architecture, not policy.

Key Learnings & Decisions

Kubernetes Security

  • OPA Gatekeeper and Falco address different threat surfaces — admission control stops misconfigurations; runtime detection catches adversarial behaviour that only appears after a workload is running.
  • Namespace-scoped exclusions are the right pattern for security tooling that needs elevated permissions — Falco's DaemonSet is privileged, so the falco namespace is excluded from BlockPrivilegedContainers constraints.
  • Parameterised OPA ConstraintTemplates add schema validation complexity that can fail across Kubernetes versions; for a small known-registry list, hardcoded Rego is simpler and more reliable.

AI-Driven Automation

  • Claude Sonnet 4.6 works as a reliable triage component when given structured inputs and constrained output formats — it classifies alerts consistently with defined severity rubrics.
  • Human approval gates for high-severity actions are non-negotiable; auto-remediation without bounds is a liability, not a feature.
  • Full logging of AI reasoning with the original alert payload enables post-incident review and builds trust in automated decisions over time.

Secrets & Operations

  • Secrets injected from external vaults can be silently corrupted — validate API key format and length at startup before accepting traffic, not reactively after the first 401.
  • The ConfigMap-mounted code pattern enables rapid iteration on cluster-internal tooling without custom image builds — acceptable for development, not for production.
  • WSL2 eBPF limitations mean local Falco detection is not possible; design the dev loop around real cluster validation for eBPF-dependent tooling from the start.

Implementation Milestones

A breakdown of the key tasks and milestones that brought this project to life.

AKS Cluster Provisioning

Complete

AKS cluster and supporting Azure resources provisioned with Terraform. Azure Key Vault configured for runtime secret injection.

Key Tasks Completed

  • Terraform AKS + Azure Resources

    Cluster provisioned, Key Vault wired up, and Azure DevOps pipeline configured. Infrastructure ready for Warden deployment.

  • Azure Key Vault Integration

    Claude API key stored in Key Vault and injected at runtime. Startup validation added to catch corrupted secrets before traffic is accepted.

OPA Gatekeeper Policies

Complete

Admission control layer implemented with Rego ConstraintTemplates. Namespace exclusions configured for Falco's privileged DaemonSet.

Key Tasks Completed

  • BlockPrivilegedContainers ConstraintTemplate

    Rego policy blocks privileged containers cluster-wide. Falco namespace excluded to allow Falco's own DaemonSet to run.

  • AllowedImageRegistries ConstraintTemplate

    Registries hardcoded in Rego after parameterised schema validation errors on AKS. Simpler and more reliable.

Falco Runtime Detection

Complete

Falco DaemonSet deployed on AKS with eBPF probes active. Falcosidekick configured to forward events to the Warden webhook endpoint.

Key Tasks Completed

  • Falco DaemonSet on AKS

    Falco running on every node with eBPF probes. WSL2 eBPF limitations documented — full detection validated on-cluster only.

  • Falcosidekick Webhook Forwarding

    Falcosidekick configured to POST structured Falco events to the Warden webhook endpoint inside the cluster.

Warden Webhook Server

Complete

FastAPI webhook server deployed via ConfigMap-mounted code. Claude Sonnet 4.6 triage integrated with auto-patch and runbook generation.

Key Tasks Completed

  • FastAPI Webhook Endpoint

    Webhook server receives Falco and OPA alert payloads and routes to the triage pipeline.

  • Claude Sonnet Triage Integration

    Structured triage with severity classification. Low severity triggers auto-patch; high severity generates a runbook for human review.

  • ConfigMap-Mounted Agent Pattern

    Agent code mounted as a Kubernetes ConfigMap and executed via a base Python image — no custom image build required for development iterations.

End-to-End Pipeline Proven

Complete

Full AKS pipeline validated: Falco detected a shell spawn, Falcosidekick forwarded to Warden, Claude triaged severity=low, auto-patch applied, HTTP 200 returned.

Key Tasks Completed

  • AKS End-to-End Demo

    Shell spawn in a container triggered Falco. Alert forwarded via Falcosidekick. Claude returned severity=low. Warden auto-patched and returned HTTP 200. Pipeline complete.

Case Study Published

Complete

Portfolio case study written and published to adventuringghost.com.

Key Tasks Completed

  • Portfolio Entry

    Case study published. Grafana Cloud dashboard screenshot captured showing warden_alerts_total spike from live Claude triage run.

Evidence of Completion

Evidence of project completion

Grafana Cloud dashboard showing warden_alerts_total — the high-severity ShellSpawnedInContainer alert recorded after Claude triage, pushed via Grafana Alloy from the local agent.

Monitoring & Analysis

Prometheus Metrics

Custom metrics track alert volume by severity, auto-patch success and failure rate, Claude API response latency, and runbook generation count. Scraped by the cluster's Prometheus instance and queryable for post-incident analysis.

Triage Audit Log

Every AI triage decision is logged with the original alert payload, Claude's severity classification, the action taken, and a timestamp — giving a complete audit trail for post-incident review and trust-building in automated decisions.

Warden webhook handler — receive alert, triage with Claude, act on severity

Loading code...

Part of a larger arc

The AI Security & Resilience Stack

Three independent projects that together cover the full surface of an AI-augmented infrastructure stack. Warden secures the Kubernetes runtime — Falco and OPA detecting threats as they happen, Claude triaging before an engineer is paged. Covenant controls access at the application layer — OPA as the hard gate between JWT identity and Claude, policy in code not prompts. Watershed closes the loop at the edge — async telemetry buffered through connectivity loss, with Claude flagging anomalies before the data reaches the cloud. Each project stands alone; together they tell one story.

~$2.00

Warden (AKS)

$0.00

Covenant (local Docker)

~$0.05

Watershed (AWS IoT Core)

~$2.05

Combined