on
Edge AI Without the Trade‑offs: A Practical Multi‑Cloud Pattern Using Azure Arc Edge RAG and Google Distributed Cloud
Hybrid and multi-cloud used to be about “where to run VMs.” In 2025, it’s increasingly about “where to run AI inference and retrieval.” Two announcements this year make that clear: Microsoft’s Edge RAG (Retrieval-Augmented Generation) arrived in public preview as an Azure Arc extension, designed to keep data and models local on Azure Local (formerly Azure Stack HCI). And Google made Gemini available on-premises through Google Distributed Cloud (GDC), including an air‑gapped option. Together they mark a turning point for edge AI architectures that need cloud-scale tooling without giving up control, privacy, or latency. (learn.microsoft.com)
This article is a hands-on blueprint for teams who want to design a pragmatic, multi-cloud edge AI pattern—mixing Azure Arc Edge RAG with GDC Gemini—without getting lost in vendor hype.
Why this matters now
- Edge RAG runs as an Arc-enabled Kubernetes extension, providing a turnkey RAG pipeline that keeps the data plane local and supports CPU or GPU. Microsoft’s docs state customer content stays on-prem and only system metadata (like subscription ID and cluster names) flows to Microsoft, addressing data control and sovereignty needs. It’s currently in preview and supported on Azure Local infrastructure. (learn.microsoft.com)
- Google announced Gemini “anywhere” on GDC: GA on air-gapped and preview on connected deployments, bringing managed Gemini endpoints inside your data center with NVIDIA Blackwell support. If you need top-tier models but can’t move data, this is a big deal. (cloud.google.com)
- Azure Stack HCI was rebranded as Azure Local, unifying Microsoft’s distributed infrastructure story and clarifying that you own the hardware and operations while using Azure tools for lifecycle and policy. (learn.microsoft.com)
- Multi-cloud connectivity is improving too: Google’s Cross‑Cloud Network highlights up to 40% performance gains versus the public internet—useful when your edge solution spans providers. (cloud.google.com)
The bottom line: you can now keep sensitive data at the edge, run modern LLMs on-prem, and still use cloud-native operations across providers.
The reference pattern in a nutshell
Here’s a simple, repeatable pattern that works across U.S. enterprise constraints like latency SLAs, data residency, and vendor diversity:
- Site A (Microsoft stack):
- Azure Local instance with AKS enabled by Arc.
- Edge RAG extension deployed to an Arc-enabled cluster.
- Local embeddings and vector store, documents on an on-prem NFS share.
- Optional “bring your own model” (BYOM) endpoint exposed via an OpenAI‑compatible API (Azure AI Foundry, Foundry Local, or Ollama). (learn.microsoft.com)
- Site B (Google stack):
- Google Distributed Cloud (connected or air‑gapped).
- Managed Gemini endpoint on-prem for high‑capability model access.
- Optional Agentspace search in preview for unified on‑prem search. (cloud.google.com)
- Control and governance:
- Azure Arc controls policy, RBAC, and extensions across Arc-connected clusters—including those running outside Azure—while GDC provides its own managed control plane on the Google side. (learn.microsoft.com)
- Cross-cloud network paths sized for low-latency fallbacks and burst offload; use private networking where available. (cloud.google.com)
This gives you two self-sufficient edge stacks (one Microsoft‑aligned, one Google‑aligned) that can cooperate when needed, without a single point of failure in any one cloud.
What changed in 2025
- Edge RAG public preview: Microsoft packaged RAG for on-prem as an Arc extension—data stays local, works with CPU/GPU, and offers a ready-made UI and pipeline. Public preview landed in May 2025. (learn.microsoft.com)
- Azure Local rename: Azure Stack HCI is now Azure Local, emphasizing a unified distributed infrastructure portfolio managed with Arc—no service disruption, same pricing and APIs. (learn.microsoft.com)
- Gemini on-prem via GDC: Google moved from “coming soon” at Next ’25 to “available” for the air‑gapped flavor by August 28, 2025; connected is in preview. (cloud.google.com)
For architecture decisions, the preview/GA details matter: today, Edge RAG supports AKS on Azure Local; it isn’t a “run anywhere” extension for any Arc‑connected cluster. GDC provides managed Gemini endpoints and scales from single servers to racks, connected or air‑gapped. (learn.microsoft.com)
A step‑by‑step way to pilot
The fastest path to value is to deploy each stack where it’s strongest, then integrate.
1) Prepare Azure Local and Arc
-
Confirm your Azure Local environment (the new name for Azure Stack HCI) and plan your AKS Arc cluster sizing. For Edge RAG, Microsoft documents minimum node pool sizes (e.g., three GPU nodes plus three CPU nodes, or a CPU‑only alternative) that many pilots overlook. (learn.microsoft.com)
-
Attach your Kubernetes cluster to Azure Arc and enable secure cluster connect:
az extension add --name connectedk8s az connectedk8s connect --name edge-west --resource-group rg-edge # Secure remote access (Cluster Connect): az connectedk8s proxy -n edge-west -g rg-edge kubectl get pods -AThis deploys Arc agents and enables remote access through Azure RBAC and a reverse proxy, even behind firewalls, if you choose that pattern. (learn.microsoft.com)
2) Deploy Edge RAG locally
-
Use the portal or CLI to install the Edge RAG extension; after deployment you should see extension types microsoft.arc.rag and microsoft.extensiondiagnostics. (learn.microsoft.com)
- Decide on your model path:
- Microsoft‑provided (e.g., Phi/Mistral) for simplicity, or
- BYOM via an OpenAI‑compatible endpoint. Microsoft provides a clear BYOM guide (Azure AI Foundry, Foundry Local, KAITO, Ollama) and the BYOM endpoint is configured during the Edge RAG extension setup. (learn.microsoft.com)
- Network and observability:
- For load balancing and certificates, Microsoft’s guidance includes enabling MetalLB and installing cert/trust managers (or a bundled IoT operations extension) on the cluster. Follow these documented steps so your RAG UI and APIs are reachable and observable inside your site network. (learn.microsoft.com)
3) Stand up Google Distributed Cloud with Gemini
-
Choose connected or air‑gapped. Air‑gapped is GA (and authorized for U.S. Secret/Top Secret missions); connected is in preview as of August 28, 2025. (cloud.google.com)
-
Provision a managed Gemini endpoint on GDC. Google’s Next ’25 blog introduced the roadmap in April; the August follow-up confirms availability and details like NVIDIA Blackwell support and confidential computing options. (cloud.google.com)
4) Keep data local; control egress
For compliance and cost, pin your data plane on‑prem and scope egress tightly. As a simple example for the Arc cluster:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-external-egress
namespace: arc-rag
spec:
podSelector: {}
policyTypes: [Egress]
egress:
- to:
# Allow only your BYOM endpoint or internal services
- namespaceSelector: { matchLabels: { name: default } }
- ipBlock: { cidr: 10.0.0.0/8 } # internal ranges
If you must call a cloud endpoint (e.g., Azure AI Foundry or a GDC connector), enumerate only those FQDNs and ports via egress gateways or explicit allow‑lists, and prefer private links where available. For cross-cloud connectivity, leverage managed offerings; Google highlights performance benefits with Cross‑Cloud Network over public internet paths. (cloud.google.com)
A few gotchas to plan for
-
Edge RAG platform scope. Today, Edge RAG is documented to run on Azure Arc‑enabled Kubernetes on Azure Local. It’s not a generic Arc extension for any EKS/GKE cluster you attach to Arc. Plan your site where Azure Local makes sense, rather than assuming “Arc means anywhere” for this extension. (learn.microsoft.com)
-
BYOM details matter. Edge RAG’s BYOM expects an OpenAI‑compatible chat completions endpoint. If you’re using Azure AI Foundry, Foundry Local, or Ollama, follow the BYOM doc and set API‑key auth accordingly. (learn.microsoft.com)
-
Capacity sizing isn’t optional. The AKS Arc node pool guidance for Edge RAG (CPU vs. GPU counts and SKUs) affects stability and latency. Start with Microsoft’s documented minimums. (learn.microsoft.com)
-
Data services at the edge are shifting. Azure SQL Edge retires September 30, 2025; Microsoft recommends options like SQL Server Express/Standard or SQL Managed Instance enabled by Azure Arc for an on‑prem managed experience. If your RAG pipeline relies on local tables or event storage, plan migrations early. (azure.microsoft.com)
-
Patching and operations. For Windows Server 2025 at the edge, hotpatching is now GA for Arc‑connected machines (July 16, 2025) at $1.50 per CPU core/month—useful to remove reboot windows in distributed sites. (techcommunity.microsoft.com)
Minimal “hello RAG at the edge” workflow (Azure side)
1) Connect your cluster to Arc and verify:
az connectedk8s connect --name edge-west --resource-group rg-edge
az connectedk8s proxy -n edge-west -g rg-edge
kubectl get ns
2) Prep AKS Arc node pools per guidance (CPU‑only or CPU+GPU):
# Example (PowerShell) from MS docs for CPU pool
$cpuPoolName="<CPU Pool Name>"
$k8scluster="<AKS Arc Cluster>"
$rg="<Resource Group>"
$cpuVmSku="Standard_D8s_v3"
$cpuNodeCount=6
az aksarc nodepool add --name $cpuPoolName --cluster-name $k8scluster -g $rg --node-count $cpuNodeCount --node-vm-size $cpuVmSku
3) Deploy Edge RAG and choose BYOM or a Microsoft‑provided model. After deployment, confirm the extension shows up as microsoft.arc.rag. (learn.microsoft.com)
4) If using BYOM, set up an OpenAI‑compatible endpoint (Azure AI Foundry, Foundry Local, Ollama) and plug its URL+API key into Edge RAG’s configuration. Microsoft’s doc provides exact formats (e.g., Azure AI Foundry chat completions URL). (learn.microsoft.com)
5) Configure MetalLB, certs/trust, and DNS so the local RAG UI/API are reachable within your site and through your SSO boundary (Entra). (learn.microsoft.com)
Minimal “hello Gemini on-prem” workflow (Google side)
1) Pick GDC mode:
- Air‑gapped (GA), or
- Connected (preview).
2) Provision the managed Gemini endpoint on GDC and confirm NVIDIA hardware support if you need higher throughput. Confidential computing options are available. (cloud.google.com)
3) Optionally pilot Agentspace search (preview) for out‑of‑box, permission‑aware enterprise search over on‑prem data. (cloud.google.com)
4) Wire up cross‑cloud traffic only where necessary (e.g., for comparisons, observability export, or failover). Consider private or managed network options for predictable latency and cost. (cloud.google.com)
Operating the pattern: a short checklist
- Identity and RBAC
- Use Entra RBAC with Arc‑connected clusters where supported; scope to least privilege. (learn.microsoft.com)
- Patching
- Turn on Windows Server hotpatching for Arc‑connected servers to avoid reboot windows at remote sites. (techcommunity.microsoft.com)
- Observability
- Use Microsoft’s recommended cert/trust setup for the Arc cluster; on GDC, lean on managed logs and auditing. (learn.microsoft.com)
- Data store choices
- If you were relying on Azure SQL Edge, plan migrations (SQL Server or Arc‑enabled SQL MI). Revisit resource limits and storage performance for vector databases used by RAG. (azure.microsoft.com)
- Networking
- Validate egress allow‑lists and DNS early; combine with cross-cloud networking where truly needed. (cloud.google.com)
When to prefer which stack
- Choose Edge RAG on Azure Local when:
- You need a turnkey, local RAG experience aligned with Azure identity/policy.
- You want to host a small or mid‑sized model locally and keep costs predictable, with the option to plug in OpenAI‑compatible endpoints. (learn.microsoft.com)
- Choose GDC Gemini when:
- You need Google’s latest managed models on‑prem, including GPU‑accelerated endpoints and confidential computing.
- You’re building with Google’s Vertex and Agentspace patterns and want the on‑prem equivalent. (cloud.google.com)
In many enterprises, you’ll do both—matching each site’s constraints and skill sets.
Final thoughts
Edge AI in 2025 isn’t “cloud versus on‑prem.” It’s “cloud‑operated, locally executed”—with vendor choice per site. Microsoft’s Edge RAG gives you a managed, on‑prem RAG pipeline anchored in Azure Arc and Azure Local. Google’s GDC brings Gemini behind your firewall with options for connected or fully air‑gapped operations. With careful egress controls, identity, and consistent operations, you can deploy edge AI where it belongs—next to the data and users—without sacrificing enterprise guardrails or multi‑cloud flexibility. (learn.microsoft.com)
If you’re starting this month:
- Pick one pilot site for each stack.
- Keep data local, wire up just enough cross‑cloud to compare results and monitor.
- Bake in patching and migration plans (e.g., for SQL Edge) up front.
Small, well‑scoped wins will snowball—and you’ll avoid the lock‑in trap while delivering measurable value fast. (azure.microsoft.com)
Resources to bookmark:
- Edge RAG overview and “What’s new” pages. (learn.microsoft.com)
- Azure Local rename FAQ. (learn.microsoft.com)
- Arc‑enabled Kubernetes overview and quickstart. (learn.microsoft.com)
- GDC: Gemini on‑prem (April announcement, August availability). (cloud.google.com)
- Windows Server hotpatching for Arc‑connected servers. (techcommunity.microsoft.com)
This is an edge AI pattern you can actually run—and expand—today.