Quick Take: Google just flipped the switch: NVIDIA GPU support for Cloud Run is now Generally Available! This means you can run your AI workloads on a serverless, pay-per-second platform that scales to zero, with no quota requests needed for NVIDIA L4 GPUs. Think rapid startup, full streaming support, multi-regional deployments, and now, GPU power for Cloud Run jobs (in preview) for tasks like model fine-tuning and batch inference.
🚀 The Crunch
🎯 Why This Matters: Google Cloud Run just made deploying GPU-accelerated AI workloads dead simple and seriously cost-effective. With NVIDIA L4 GPUs now GA, you get pay-per-second billing, scale-to-zero (no idle costs!), and rapid startup (under 5s for GPU instances). This means you can spin up AI inference, model fine-tuning, or batch processing jobs without wrestling with infrastructure or breaking the bank.
--gpu 1
or check the console box. No waiting for quota approvals!⚡ Developer Tip: Immediately try deploying an open model like Ollama with gcloud run deploy my-global-service --image ollama/ollama --port 11434 --gpu 1 --regions us-central1,europe-west1,asia-southeast1
. The “no quota needed” for L4 GPUs is a massive green light to experiment with GPU acceleration for your existing or new Cloud Run services without any upfront hassle.
Critical Caveats & Considerations
- L4 GPUs No Quota: This specific “no quota request needed” applies to NVIDIA L4 GPUs.
- Cloud Run Jobs GPU in Preview: While services are GA, using GPUs with Cloud Run Jobs is currently in private preview – sign up required.
- Zonal Redundancy Pricing: Default zonal redundancy offers resilience. Opting out for a lower price means best-effort failover for GPU workloads during zonal outages.
- Region Availability: Currently in 5 regions (us-central1, europe-west1, europe-west4, asia-southeast1, asia-south1), with more planned.
🔬 The Dive
Serverless GPUs: No Longer a Pipedream. Developers have long loved Google Cloud Runfor its simplicity, flexibility, and killer scalability. Now, Google Cloud is bringing that same magic to GPU-accelerated workloads. The general availability of NVIDIA GPU support for Cloud Run is a significant milestone, making powerful AI and compute tasks more accessible and cost-effective than ever.
Google showcased the incredible scalability of Cloud Run with GPUs in a live demo at Google Cloud Next ’25, scaling a Stable Diffusion service from 0 to 100 GPU instances in just four minutes. This kind of rapid, on-demand scaling is precisely what modern AI applications need to handle fluctuating loads efficiently.
Early adopters like vivo, Wayfair, and Midjourney are already reporting significant benefits, from reduced operational costs and faster iteration cycles to the ability to process millions of images efficiently. Wayfair, for instance, highlighted an 85% reduction in cost.
🎯 TLDR;: Google Cloud Run + NVIDIA GPUs = GA! Get pay-per-second, scale-to-zero AI power with no quota needed for L4s. Rapid startups, multi-region deploys, and GPU jobs (preview) make serverless AI simpler & cheaper. Go build something awesome!