This page contains press release content distributed by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

Jason Ruedy ‘The Home Loan Arranger’ Explains Why Denver Homeowners Are Refinancing FHA Loans to Remove PMI

Jason Ruedy ‘The Home Loan Arranger’ Explains Why Denver Homeowners Are Refinancing FHA Loans to Remove PMI

Denver Mortgage Expert Jason Ruedy “The Home Loan Arranger” Says Many Colorado Homeowners Are Refinancing FHA Loans to

March 12, 2026

Global Commercial Air Purifier Market Size to Reach USD 26.70 Billion by 2030 | Arizton

Global Commercial Air Purifier Market Size to Reach USD 26.70 Billion by 2030 | Arizton

Industry Analysis Report, Regional Outlook, Growth Potential, Price Trends, Competitive Market Share & Forecast

March 12, 2026

Propstore’s Collectible Posters Auction Delivers Jaw-Dropping Results: Iconic Posters Sell for Multiples of Estimates

Propstore’s Collectible Posters Auction Delivers Jaw-Dropping Results: Iconic Posters Sell for Multiples of Estimates

Watching collectors compete for these rare posters was like seeing a blockbuster play out live. ”— Grey Smith, Director

March 12, 2026

AIRENGY ACQUIRES FIRST REVENUE-GENERATING ELECTRICITY ASSETS IN MOVE TO BUILD EUROPEAN RENEWABLE ENERGY PLATFORM

AIRENGY ACQUIRES FIRST REVENUE-GENERATING ELECTRICITY ASSETS IN MOVE TO BUILD EUROPEAN RENEWABLE ENERGY PLATFORM

Airengy (TASE:ARNG)This is the first transaction of its kind for Airengy, signaling the start of ownership and

March 12, 2026

Shore Legal Access Marks 20 Years of Expanding Access to Justice on Maryland’s Eastern Shore

Shore Legal Access Marks 20 Years of Expanding Access to Justice on Maryland’s Eastern Shore

Nonprofit Has Served 40,000+ Residents, Mobilized Hundreds of Volunteer Attorneys and Helped Secure Nearly $10 Million

March 12, 2026

Edlio Launches Native AI Chatbot for School Websites

Edlio Launches Native AI Chatbot for School Websites

New feature gives families instant answers, saves staff time, and provides schools with actionable insights …Works

March 12, 2026

StayTheVillages.com Launches Vacation Rental Platform for The Villages, FL

StayTheVillages.com Launches Vacation Rental Platform for The Villages, FL

New listing platform provides Villages property owners with AI-powered tools and flat-fee subscription model We built

March 12, 2026

New Memoir How Did I Get Here from There by Rosalyn Wilson Explores Resilience, Family Trauma, and the Cost of Survival

New Memoir How Did I Get Here from There by Rosalyn Wilson Explores Resilience, Family Trauma, and the Cost of Survival

A powerful memoir of trauma, early responsibility, estranged family ties, and one woman’s journey toward resilience and

March 12, 2026

Carolina’s Roofing Association Invests in Strategic Planning to Better Serve Roofing Contractors

Carolina’s Roofing Association Invests in Strategic Planning to Better Serve Roofing Contractors

Through Splash Omnimedia’s GPS Framework, the association aligned leadership around a vision for supporting roofing

March 12, 2026

Visage Laser & Skin Care to Host Bloom & Glow Social, Spring Client Appreciation Event in Anaheim Hills

Visage Laser & Skin Care to Host Bloom & Glow Social, Spring Client Appreciation Event in Anaheim Hills

Red carpet, DJ, refreshments, VIP swag, raffle prizes, and event-only specials at Visage Laser & Skin Care on March 21. RSVP via Eventbrite. ANAHEIM, CA,…

March 12, 2026

Bodkin Design Develops Drone Inspection System to Safeguard Navy Communications

Bodkin Design Develops Drone Inspection System to Safeguard Navy Communications

Machine learning and AI boosts Navy antenna and guy wire safety with smarter inspections. By catching corrosion early,

March 12, 2026

John R. Wood Christie’s International Real Estate invests in Fort Myers Beach, Fla. with new office location

John R. Wood Christie’s International Real Estate invests in Fort Myers Beach, Fla. with new office location

This location allows us to serve customers directly on the island while supporting fellow local business owners who are

March 12, 2026

Thread and Pia Announce Integration Delivering No-Touch Ticket Resolution for MSPs

Thread and Pia Announce Integration Delivering No-Touch Ticket Resolution for MSPs

New integration combines agentic AI and zero-touch automation to dramatically reduce ticket volume, resolution time,

March 12, 2026

4 New Communication Features from Ballot Bliss Increase Voting Participation by up to 80%

4 New Communication Features from Ballot Bliss Increase Voting Participation by up to 80%

As HOAs and nonprofits struggle with low turnout, an internal study from Ballot Bliss suggests that prioritizing voter

March 12, 2026

Bellydance Evolution Brings World-Premiere Swan Lake to Los Angeles for One Night Only

Bellydance Evolution Brings World-Premiere Swan Lake to Los Angeles for One Night Only

Internationally acclaimed Bellydance Evolution brings its groundbreaking production of Swan Lake to Los Angeles for

March 12, 2026

Will Dempster, Partner & Head of Production at Mischief to Lead NYF Advertising Awards Film Craft Executive Jury

Will Dempster, Partner & Head of Production at Mischief to Lead NYF Advertising Awards Film Craft Executive Jury

Ten Industry Leaders Convene to Define the Year’s Best in Film Craft Film craft is where creativity meets obsession.

March 12, 2026

CRKT® Unveils Three Reimagined Folders in Advance of Blade Show Texas

CRKT® Unveils Three Reimagined Folders in Advance of Blade Show Texas

A collection of best-sellers, refreshed with new blade treatments and accents, CRKT folders marry form and function

March 12, 2026

EPC Introduces EPC91202 Evaluation Board: High-Performance 50 ARMS Three-Phase BLDC Inverter Powered by eGaN®

EPC Introduces EPC91202 Evaluation Board: High-Performance 50 ARMS Three-Phase BLDC Inverter Powered by eGaN®

100 V GaN-based inverter reference design delivering 50 ARMS phase current, integrated sensing, and PWM operation up to

March 12, 2026

Berkshire Hathaway HomeServices Robert Paul Properties Welcomes The Narrowland Group to its Cape Cod Team

Berkshire Hathaway HomeServices Robert Paul Properties Welcomes The Narrowland Group to its Cape Cod Team

WELLFLEET, MA, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Berkshire Hathaway HomeServices Robert Paul

March 12, 2026

TorchStone Global and Ontic announce exclusive strategic partnership

TorchStone Global and Ontic announce exclusive strategic partnership

Alliance designates TorchStone as Ontic’s preferred partner combining elite protective intelligence with

March 12, 2026

ArchLynk KAI for Product Classification Now Available on the SAP® Store

ArchLynk KAI for Product Classification Now Available on the SAP® Store

AI-powered solution automates global trade product classification and improves compliance accuracy SAN JOSE, CA, UNITED

March 12, 2026

DiaGen AI Inc. Appoints Life Sciences and AI Expert Aditya Tallapragada to Advisory Board

DiaGen AI Inc. Appoints Life Sciences and AI Expert Aditya Tallapragada to Advisory Board

Appointment brings deep life sciences, clinical, & global market expertise to DiaGen as it scales its AI-driven

March 12, 2026

DARCO International Introduces AllRound Shoe™ to Support Comfortable Healing and Everyday Mobility

DARCO International Introduces AllRound Shoe™ to Support Comfortable Healing and Everyday Mobility

DARCO’s AllRound Shoe™ offers a protective closed-toe design, increased midfoot volume, and improved comfort for

March 12, 2026

Rising Hot Tub Ownership Driving Demand for Replacement Spa Covers as The Cover Brothers Emerges as a Leader

Rising Hot Tub Ownership Driving Demand for Replacement Spa Covers as The Cover Brothers Emerges as a Leader

Growing demand for hot tubs and swim spas across North America is increasing the need for custom replacement spa covers

March 12, 2026

Fichi Wins a 2026 Men’s Health Food Award for Fig & Walnut Bites

Fichi Wins a 2026 Men’s Health Food Award for Fig & Walnut Bites

Fichi announced that its Fig & Walnut Bites have been selected as a winner in the 2026 Men's Health Food Awards in

March 12, 2026

Save Ohio Bees Awards Seed Grant to Cuyahoga Valley National Park

Save Ohio Bees Awards Seed Grant to Cuyahoga Valley National Park

Degraded Turf To Be Naturalized for Pollinators Because one in four native bee species is at risk of extinction,

March 12, 2026

Autom8ion Lab Helps Businesses Eliminate Manual Work Through Custom AI Infrastructure

Autom8ion Lab Helps Businesses Eliminate Manual Work Through Custom AI Infrastructure

Autom8ion Lab delivers custom AI systems and automation workflows that streamline operations, reduce overhead, and

March 12, 2026

Würkin Stiffs, the Collar-Stay Pioneer Featured on Shark Tank, Works Its Way To New Ownership Through Website Closers

Würkin Stiffs, the Collar-Stay Pioneer Featured on Shark Tank, Works Its Way To New Ownership Through Website Closers

Würkin Stiffs, the iconic men’s accessories brand known for reinventing the modern collar stay, has been successfully

March 12, 2026

Two Joints Surf Co. Rides the Wave Into New Ownership in Creative-Led Deal Brokered by Website Closers

Two Joints Surf Co. Rides the Wave Into New Ownership in Creative-Led Deal Brokered by Website Closers

Two Joints Surf Co., the unapologetically bold and irreverent coastal lifestyle brand known for its surf-inspired

March 12, 2026

The BookFest Spring 2026 Schedule Features Dr. Abraham George, Lee Wind, Christina Baker Kline, Carla A. Harris and More

The BookFest Spring 2026 Schedule Features Dr. Abraham George, Lee Wind, Christina Baker Kline, Carla A. Harris and More

The free online literary adventure returns April 11th–12th with keynotes, conversations, panels, author insights, and

March 12, 2026

Author Elizabeth M. Lykins Encourages Compassion, Kindness, and Community Action in a Time of Global Uncertainty

Author Elizabeth M. Lykins Encourages Compassion, Kindness, and Community Action in a Time of Global Uncertainty

The author says knowledge, compassion, and everyday action empower individuals amid global challenges. LOS ANGELES, CA,

March 12, 2026

Celebrity Hairstylist Sean James Announces Top Spring Haircuts for a Transformational New Look in 2026

Celebrity Hairstylist Sean James Announces Top Spring Haircuts for a Transformational New Look in 2026

LOS ANGELES, CA, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Celebrity hairstylist Sean James has released his

March 12, 2026

With Energy Prices Rapidly Rising, Ireland’s Thermahood® Highlights Power of Energy-Saving Recessed Lighting Hoods

With Energy Prices Rapidly Rising, Ireland’s Thermahood® Highlights Power of Energy-Saving Recessed Lighting Hoods

BOCA RATON, FL, UNITED STATES, March 12, 2026 /EINPresswire.com/ — As the world grapples with an influx of price

March 12, 2026

Leomhann Enterprises Prepares for 2026 Graduate Hiring Surge

Leomhann Enterprises Prepares for 2026 Graduate Hiring Surge

Leomhann Enterprises is hiring college grads for entry-level roles in sales and marketing as part of its 2026 growth

March 12, 2026

SacValley MedShare Welcomes Becky Shoemaker as QHIO Program and Project Director

SacValley MedShare Welcomes Becky Shoemaker as QHIO Program and Project Director

SacValley MedShare appoints Becky Shoemaker as QHIO Program and Project Director, strengthening the it's leadership in

March 12, 2026

Nancy G. Brinker and Eric Brinker Leadership Gift Helps Generate $5 Million for Promise Fund at Palm Beach Event

Nancy G. Brinker and Eric Brinker Leadership Gift Helps Generate $5 Million for Promise Fund at Palm Beach Event

Funds will expand breast and cervical cancer screenings, patient navigation, and follow-up care for women facing

March 12, 2026

Williams Data Management Expands Secure Mobile Shredding Capacity in Los Angeles

Williams Data Management Expands Secure Mobile Shredding Capacity in Los Angeles

Secure Mobile Shredding Expansion in Los Angeles Supports Safe Document Disposal and Compliance for Local Businesses

March 12, 2026

Kleinschmidt’s Dana Postlewait Receives Pamela E. Klatt Award from Northwest Hydroelectric Association

Kleinschmidt’s Dana Postlewait Receives Pamela E. Klatt Award from Northwest Hydroelectric Association

His intelligence, humility, and optimism consistently shine in complex environments—bringing people together,

March 12, 2026

Catastrophe AI™ Launches Smart AI-Guided Platform to Help Insurers Prepare for the Next Earthquake

Catastrophe AI™ Launches Smart AI-Guided Platform to Help Insurers Prepare for the Next Earthquake

Smart inspection workflows and real-time documentation help carriers respond faster, scale large-loss claims, and

March 12, 2026

Affinity Counseling of Colorado Launches New Website to Expand Access to Trauma-Informed Somatic Therapy Across Colorado

Affinity Counseling of Colorado Launches New Website to Expand Access to Trauma-Informed Somatic Therapy Across Colorado

Denver-based virtual therapy practice unveils a redesigned website highlighting trauma-informed, relational care for

March 12, 2026