Google Cloud Run Generally Available!

Quick Take: Google just flipped the switch: NVIDIA GPU support for Cloud Run is now Generally Available! This means you can run your AI workloads on a serverless, pay-per-second platform that scales to zero, with no quota requests needed for NVIDIA L4 GPUs. Think rapid startup, full streaming support, multi-regional deployments, and now, GPU power for Cloud Run jobs (in preview) for tasks like model fine-tuning and batch inference.

🚀 The Crunch

🎯 Why This Matters: Google Cloud Run just made deploying GPU-accelerated AI workloads dead simple and seriously cost-effective. With NVIDIA L4 GPUs now GA, you get pay-per-second billing, scale-to-zero (no idle costs!), and rapid startup (under 5s for GPU instances). This means you can spin up AI inference, model fine-tuning, or batch processing jobs without wrestling with infrastructure or breaking the bank.

💰

Pay-Per-Second & Scale-to-Zero

Only pay for GPU resources you consume, down to the second. Instances scale to zero when idle, eliminating costs for sporadic workloads. Huge for cost optimization!

⚡

Rapid Startup & Scaling

Go from zero to a GPU instance (drivers installed) in under 5 seconds. Achieved ~19s TTFT for a gemma3:4b model from cold start. Respond to demand instantly.

🚫

No Quota Needed for L4 GPUs

Get immediate access to NVIDIA L4 GPU acceleration for your Cloud Run services. Just use --gpu 1 or check the console box. No waiting for quota approvals!

⚡ Developer Tip: Immediately try deploying an open model like Ollama with gcloud run deploy my-global-service --image ollama/ollama --port 11434 --gpu 1 --regions us-central1,europe-west1,asia-southeast1. The “no quota needed” for L4 GPUs is a massive green light to experiment with GPU acceleration for your existing or new Cloud Run services without any upfront hassle.

Critical Caveats & Considerations

L4 GPUs No Quota: This specific “no quota request needed” applies to NVIDIA L4 GPUs.
Cloud Run Jobs GPU in Preview: While services are GA, using GPUs with Cloud Run Jobs is currently in private preview – sign up required.
Zonal Redundancy Pricing: Default zonal redundancy offers resilience. Opting out for a lower price means best-effort failover for GPU workloads during zonal outages.
Region Availability: Currently in 5 regions (us-central1, europe-west1, europe-west4, asia-southeast1, asia-south1), with more planned.

🔬 The Dive

Serverless GPUs: No Longer a Pipedream. Developers have long loved Google Cloud Runfor its simplicity, flexibility, and killer scalability. Now, Google Cloud is bringing that same magic to GPU-accelerated workloads. The general availability of NVIDIA GPU support for Cloud Run is a significant milestone, making powerful AI and compute tasks more accessible and cost-effective than ever.

Source: Live Demo

Google showcased the incredible scalability of Cloud Run with GPUs in a live demo at Google Cloud Next ’25, scaling a Stable Diffusion service from 0 to 100 GPU instances in just four minutes. This kind of rapid, on-demand scaling is precisely what modern AI applications need to handle fluctuating loads efficiently.

Early adopters like vivo, Wayfair, and Midjourney are already reporting significant benefits, from reduced operational costs and faster iteration cycles to the ability to process millions of images efficiently. Wayfair, for instance, highlighted an 85% reduction in cost.

🎯 TLDR;: Google Cloud Run + NVIDIA GPUs = GA! Get pay-per-second, scale-to-zero AI power with no quota needed for L4s. Rapid startups, multi-region deploys, and GPU jobs (preview) make serverless AI simpler & cheaper. Go build something awesome!

🚀 Quickstarts

💰 Best Practices

Listed in: #Google #News

🚀 The Crunch

Critical Caveats & Considerations

🔬 The Dive

Nemotron-Personas: NVIDIA provides you with 600k synthetic personas!

Magistral: Mistral's Specialized, High-fidelity Reasoning Model Is Live!

Runner H: AI Execution Engine In Free Beta