Google Cloud Run Generally Available!

Quick Take: Google just flipped the switch: NVIDIA GPU support for Cloud Run is now Generally Available! This means you can run your AI workloads on a serverless, pay-per-second platform that scales to zero, with no quota requests needed for NVIDIA L4 GPUs. Think rapid startup, full streaming support, multi-regional deployments, and now, GPU power for Cloud Run jobs (in preview) for tasks like model fine-tuning and batch inference.


🚀 The Crunch

🎯 Why This Matters: Google Cloud Run just made deploying GPU-accelerated AI workloads dead simple and seriously cost-effective. With NVIDIA L4 GPUs now GA, you get pay-per-second billing, scale-to-zero (no idle costs!), and rapid startup (under 5s for GPU instances). This means you can spin up AI inference, model fine-tuning, or batch processing jobs without wrestling with infrastructure or breaking the bank.

💰
Pay-Per-Second & Scale-to-Zero
Only pay for GPU resources you consume, down to the second. Instances scale to zero when idle, eliminating costs for sporadic workloads. Huge for cost optimization!
Rapid Startup & Scaling
Go from zero to a GPU instance (drivers installed) in under 5 seconds. Achieved ~19s TTFT for a gemma3:4b model from cold start. Respond to demand instantly.
🚫
No Quota Needed for L4 GPUs
Get immediate access to NVIDIA L4 GPU acceleration for your Cloud Run services. Just use --gpu 1 or check the console box. No waiting for quota approvals!

⚡ Developer Tip: Immediately try deploying an open model like Ollama with gcloud run deploy my-global-service --image ollama/ollama --port 11434 --gpu 1 --regions us-central1,europe-west1,asia-southeast1. The “no quota needed” for L4 GPUs is a massive green light to experiment with GPU acceleration for your existing or new Cloud Run services without any upfront hassle.

Critical Caveats & Considerations

  • L4 GPUs No Quota: This specific “no quota request needed” applies to NVIDIA L4 GPUs.
  • Cloud Run Jobs GPU in Preview: While services are GA, using GPUs with Cloud Run Jobs is currently in private preview – sign up required.
  • Zonal Redundancy Pricing: Default zonal redundancy offers resilience. Opting out for a lower price means best-effort failover for GPU workloads during zonal outages.
  • Region Availability: Currently in 5 regions (us-central1, europe-west1, europe-west4, asia-southeast1, asia-south1), with more planned.

🔬 The Dive

Serverless GPUs: No Longer a Pipedream. Developers have long loved Google Cloud Runfor its simplicity, flexibility, and killer scalability. Now, Google Cloud is bringing that same magic to GPU-accelerated workloads. The general availability of NVIDIA GPU support for Cloud Run is a significant milestone, making powerful AI and compute tasks more accessible and cost-effective than ever.

Source: Live Demo

Google showcased the incredible scalability of Cloud Run with GPUs in a live demo at Google Cloud Next ’25, scaling a Stable Diffusion service from 0 to 100 GPU instances in just four minutes. This kind of rapid, on-demand scaling is precisely what modern AI applications need to handle fluctuating loads efficiently.

Early adopters like vivo, Wayfair, and Midjourney are already reporting significant benefits, from reduced operational costs and faster iteration cycles to the ability to process millions of images efficiently. Wayfair, for instance, highlighted an 85% reduction in cost.

🎯 TLDR;: Google Cloud Run + NVIDIA GPUs = GA! Get pay-per-second, scale-to-zero AI power with no quota needed for L4s. Rapid startups, multi-region deploys, and GPU jobs (preview) make serverless AI simpler & cheaper. Go build something awesome!

Tom Furlanis
Researcher. Narrative designer. Wannabe Developer.
Twenty years ago, Tom was coding his 1st web applications in PHP. But then he left it all to pursue studies in humanities. Now, two decades later, empowered by his coding assistants, a degree in AI ethics and a plethora of unrealized dreams, Tom is determined to develop his apps. Developer heaven or bust? Stay tuned to discover!