Over the years I have acquired a healthy scepticism about expecting things in IT to “just work”, but one of the joys of my job as a GCP Engineer is that I am pleasantly surprised from time to time, and often it is with GCP networking. And my surprise is always followed by an investigation to figure out why…
A recent example was with a large Australian company with GCP resources in the Sydney and Melbourne regions (actually australia-southeast1 and australia-southeast2). They have an interconnect back to their on-premises network in each region as shown below.
You will notice that the Sydney interconnect is down for maintenance, which is exactly as I found it when I started a project for this customer. I should clarify that the entire environment is pre-production and there are no public Internet addresses used in the GCP environment.
When I first heard that a Sydney interconnect was offline, I assumed that perhaps there was redundant interconnectivity available through another zone in the Sydney region, but that was not the case. I knew that no loss of connectivity had been reported so it was obvious that the Google network must be routing traffic to and from the Sydney zone via the Melbourne interconnect. This was clearly a good thing in my customer’s case but I was still curious to know under what circumstances this “just works”. After all, customers who have their own expensive private WAN don’t necessarily want or expect to use Google Cloud as a WAN and incur egress charges for this. And what if there are many regions involved? Would you expect every interconnect in every region to be able to route to every other region using Google’s network? So it’s not necessarily a question of one simple behaviour being right for all situations.
First of all, I want to emphasise that unlike other major cloud providers Google’s VPC’s are global and subnets are regional. No magic or effort is required to ensure that your VPC traffic routes correctly to subnets in zones all across the globe. But the question here is under what circumstances will Google exchange routes with your on-premises network so that your hybrid network as a whole routes globally?
The key to this behaviour is a Google VPC’s dynamic routing mode feature. This mode has a default mode of regional. This setting is used by Cloud Routers within the VPC to determine whether they should advertise all subnets via BGP or just the subnets in the same region as the Cloud Router. To clarify, a Cloud Router is not actually a router, as such: it is just a Google software service that provides a control plane, but actual routing is built into VPC itself. When you configure an interconnect which will use the BGP dynamic routing protocol you need a Cloud Router for that VPC to peer with the external router at the external end of the interconnect link.
So if the dynamic routing mode for a VPC is set to regional then all Cloud Routers used by that VPC will advertise only routes from that Cloud Router’s region. If my customer’s VPC had been running in the default regional mode then the type of resilience that I observed would not have worked. Instead, all their VPC’s were configured in global mode which meant that all VPC subnets in both regions were advertised to the Melbourne on-premises router. This ensured that the Sydney on-premises router could learn from the Melbourne router that the Sydney GCP subnets were reachable via Melbourne. Of course if the interconnect wasn’t down then the Sydney on-premises router would have preferred the Sydney interconnect.
Those who are familiar with BGP routing will realise too that when using global mode you could still restrict advertising of routes and tweak preferences in your BGP configuration. So global mode still allows all the flexibility built into BGP itself.
The key takeaway for me from this experience was that dynamic routing mode can be key to ensuring that network resiliency works as planned when you need it.