All posts

Writing

Speccing a Workstation for Robotics DevSecOps: Decisions, Not Defaults

Every component has a reason that traces back to the actual workload. CUDA requirements, power envelope constraints, and the Deploy-Prove-Destroy logic applied to hardware.


The previous post in this series made the case for building a dedicated workstation before anything else in the Stewardship Node project. This post is where the argument gets concrete.

Hardware specs without reasoning are just a parts list. What I want to document here is the decision process — the constraints that shaped each choice, the tradeoffs I accepted deliberately, and the ones I deferred. This is the same approach I used in the Nomad Edge cost architecture writeup, and the same approach I'd expect from any serious infrastructure design document.

The Hard Requirements

Before selecting any components, I defined the minimum performance floor that would make this machine worth building:

GPU: enough VRAM to run Isaac Sim with a full humanoid model loaded. NVIDIA Isaac Sim recommends a minimum of 8GB VRAM for basic simulation. A full bipedal robot model with articulated limbs, sensor simulation (RealSense, LiDAR), and physics is not a basic workload. 12GB is the functional minimum. 16GB+ is comfortable. This requirement alone eliminates most of the GPU market.

NVIDIA GPU required. ROS 2 GPU acceleration, Isaac Sim, and the LeRobot training pipeline all depend on CUDA. AMD's ROCm support has improved, but the robotics software stack is built on CUDA assumptions in enough places that betting on ROCm for a primary development machine introduces unnecessary friction. This is not a philosophical preference — it's a dependency graph.

CPU: sustained multi-core performance under mixed workloads. Simulation, compilation, Terraform runs, and containerized services all want cores simultaneously. I want a CPU that handles long-duration parallel loads without thermal throttling, which means paying attention to the TDP and the cooling solution, not just the benchmark.

RAM: 64GB minimum. This isn't conservative — it's the floor for running a local Kubernetes cluster, an active simulation session, a browser with documentation open, and a code editor at the same time without watching the swap file.

Storage: NVMe primary, with room for expansion. Simulation assets, Docker image caches, and Terraform state can eat through a small drive quickly. 2TB NVMe as the OS and working drive, with secondary storage for archives and training datasets.

CPU: AMD Ryzen 9

The AMD Ryzen 9 7900X or 7950X sits in the right intersection of multi-core throughput, power efficiency at load, and AM5 platform longevity. The AM5 socket will be supported through the next several CPU generations, which matters when you're building a machine intended to last through a multi-year project.

Intel's competing parts are strong, but AMD's performance-per-watt advantage at sustained loads is measurable, and the AM5 power profile integrates better with the long-term goal of running this machine on a renewable power system. At peak load a Ryzen 9 7900X draws around 170W. An equivalent Intel i9-13900K draws significantly more. Over a year of daily use that difference compounds.

The decision is partly about today's performance and partly about the power envelope when this machine eventually lives on a 48V LiFePO4 battery bank.

GPU: NVIDIA RTX 4070 Ti or 4080

The RTX 4070 Ti offers 12GB GDDR6X VRAM, full CUDA support, and a TDP of 285W. The 4080 steps up to 16GB VRAM and broader headroom at 320W. The 4GB difference matters for larger simulation scenes and training runs.

My current thinking lands on the 4070 Ti as the initial build, with the architecture designed to allow a GPU upgrade when the simulation workload grows beyond what 12GB handles comfortably. This is the Deploy-Prove-Destroy logic applied to hardware: don't overbuild before you've measured the actual load. The 4070 Ti handles everything on the immediate roadmap. When Kevin reaches the training phase and 12GB becomes a real constraint, the case is pre-justified and the upgrade path is clear.

The power envelope comparison matters here too. 285W vs 320W — across the full build, this is the difference between a 550W typical load and 650W. For a machine running on solar with a limited panel array, that 100W gap is meaningful.

Cooling and Case

This gets less attention in architecture writeups than it deserves. A machine that thermally throttles under sustained load is a machine that performs worse than its spec sheet suggests. For a workstation running simulation sessions measured in hours, not minutes, a high-quality 360mm AIO or a well-specified air cooler for the CPU is not optional — it's the component that determines whether the Ryzen 9 runs at rated boost clocks or 200MHz below them under load.

The case is selected for airflow, not aesthetics. Front-to-back airflow with a positive pressure configuration, enough 3.5" bays for the secondary storage expansion, and PCIe clearance for the GPU. No glass panels required — this machine is in a workshop, not a showroom.

The Power Number

Total system draw under sustained full load: approximately 500–600W depending on exact component selection and concurrent workload.

At 8 hours/day of active use, that's 4–4.8 kWh daily. A well-sized off-grid system for a northern Alberta climate needs to account for this load alongside the rest of the community power budget. This number goes directly into the Stewardship Node power architecture calculations — a concrete data point, not an estimate.

What I'm Not Doing

I'm not building a NAS into the primary workstation. Storage expansion lives on a dedicated machine.

I'm not running the edge compute nodes (the Raspberry Pi 4B units in the Nomad Cluster) on the workstation. Those are separate physical devices with dedicated purposes. The workstation is for development and simulation. The cluster is for running services.

I'm not buying everything at once. The CPU, motherboard, RAM, and storage go in first. The GPU follows. This is financial discipline applied to hardware purchasing — the same way I sequence cloud infrastructure builds.

The Point

Every component in this list has a reason that traces back to the actual workload. NVIDIA because the robotics stack requires CUDA. AMD because the power envelope matters downstream. 64GB RAM because I've measured what the concurrent workload actually needs. The cooling solution because sustained performance under load is a different spec than peak benchmark performance.

A workstation for serious robotics and infrastructure work isn't a consumer purchase. It's an infrastructure node. Design it accordingly.


Next in the series: the network layer — how Nomad Edge topology concepts apply when the edge is a community LAN.