Designing continuity for aging GPU fleets
Continuum GPU is an early-stage project exploring how AI teams and data centers can keep older GPU clusters online longer — with better spares strategy, failure modeling, and uptime planning.
We’re not offering commercial services yet. Right now, we’re talking to operators, engineers, and founders to deeply understand how post-warranty GPU hardware is managed.
The problem we’re seeing
As AI workloads grow, many teams are running on “mixed-generation” clusters with aging GPUs, limited spares, and expiring support windows. When a single GPU or node fails mid-training, there’s often no clear plan — just scrambling.
When hardware fails, teams often don’t know if they’re down for 30 minutes or 3 days. That uncertainty adds stress to every long-running job.
Spares are bought ad hoc, tracked in spreadsheets, and not modeled across racks, generations, or utilization.
It’s hard to see which nodes are becoming unreliable, which GPUs are “suspect,” and where risk is quietly accumulating in the cluster.
As hardware moves out of its original support window, teams are left stitching together a strategy from vendors, used parts, and internal effort.
What we’re exploring
Continuum GPU is not a live service yet. We’re working with operators to understand what a responsible, high-signal offering for post-warranty GPU fleets should look like.
A model for predictable continuity
Instead of “we’ll see what breaks,” we’re designing an approach that models fleet health, spares pools, and recovery paths so teams can make informed decisions about cost vs. risk.
- Failure and degradation modeling over time
- Spares planning across generations and SKUs
- Playbooks for handling GPU/node failure scenarios
- Clear cost/benefit framing for extending hardware life
Where conversations are most useful
We’re especially interested in teams who:
- Run their own on-prem or colo GPU clusters
- Have a mix of older and newer GPUs
- Feel pain when a single node drops during training
- Want clarity on what “good” lifecycle strategy looks like
If that’s you, we’d love to learn how you’re handling things today — and share what we’re seeing across other operators.
Who we’re talking to
We’re currently having off-the-record conversations with people responsible for keeping AI workloads running on real hardware.
Teams training models on in-house or colo clusters who feel every hour of downtime.
Operators who host mixed-generation GPU nodes for customers and want better lifecycle visibility.
University and lab environments running long jobs on hardware that is aging but still critical.
The people who get pinged when a node dies at 3 a.m., and want fewer surprises.
Interested in an early conversation?
If you’re responsible for a GPU fleet (even a small one) and want to talk about lifecycle, spares, or uptime risk, we’d love to learn from you. These are research conversations — not a commitment to buy anything.
What a conversation looks like
Typically 20–30 minutes, founder-to-founder or operator-to-operator. We’ll ask about:
- How you handle GPU/node failures today
- What you do when hardware goes out of warranty
- How you think about spares, RMA, and swaps
- What “peace of mind” would look like for you
In return, we’ll share aggregated patterns (anonymized) from other teams we talk to.