For most workloads that I’ve deployed in Azure that have required load balancing, for the Azure Load Balancer (ALB) used in those architectures, the out of the box experience or the default configuration was used. The load balancer service is great like that, whereby for the majority of scenarios it just works out of the box. I’m sure this isn’t an Azure only experience either. The other public cloud providers have a great out of the box load balancing service that would work with just about any service without in depth configuration.
You can see that I’ve been repetitive on the point around out of the box experience. This is where I think I’ve become complacent in thinking that this out of the box experience should work in the majority of circumstances.
My problem, as outlined in this blog post, is that I’ve experienced both the Azure Service Manager (ASM) load balancer and the Azure Resource Manager (ARM) load balancer not working as intended…
UPDATE 2018-07-13 - The circumstances in both of the examples given in this blog post assume that the workloads in the backend pool are configured correctly AND that the Azure Load Balancer is also configured correctly as per Microsoft recommendations. The specific solution that I’ve outlined came about after making sure that all the settings were checked, checked again and also had Microsoft Premier Support validate the config.
From what I’ve been told by Microsoft Premier Support, the Azure Load Balancer has had a 5-tuple distribution algorithm, based on source IP, source port, destination IP, destination port and protocol type, since its inception. However, that was certainly not the case as I’ll explain in the next paragraph. While this 5-tuple mode should in theory work well with just about any scenario, because at the end of the day the distribution is still round robin between endpoints, the stickiness of sessions to those endpoints comes into play where that can cause some issues.
In a blog post 1 from way back when, Microsoft outlines that to accommodate RDS Gateway, the distribution mode options for the ALB have been updated. There are a total of 3 distribution modes: 5-tuple (mentioned before), 3-tuple (source IP, destination IP and protocol) and 2-tuple (source IP and destination IP)2.
Now that we have established the configuration options and roughly when they came about, lets get stuck into the impact in two scenarios, months apart and in both ASM Classic and ARM deployment modes…
Earlier this year at a customer, we ran into a problem where we had a number of Azure workloads hard reset. This was either the cause of an outage in the region, some scheduled or unscheduled maintenance that had to occur. Nothing to serious sounding until we found that the Network Device Enrolment Server (NDES) was not able to accept traffic from the Web Application Proxy (WAP) server that was inline and “north” of the server. The WAP itself was in a Cloud Service (so ASM/Classic environment here) where there was multiple WAP servers that leveraged Load Balanced Sets (or the Azure Internal Load Balancer) as part of the Cloud Service.
The odd thing that happened was that since the outage/scheduled/unscheduled maintenance had happened, inbound NDES traffic via the WAP suddenly became erratic. Certificates that were requested via NDES (from Intune in this circumstance) were for the most part not being completed. So ensued, a long and enjoyable Microsoft Premier case that involved the Azure Product Group (sarcasm intended).
I’ll keep this short and sweet as I would rather save you, dear reader, the time of not having to relive that incident. The outcome was as follows:
It was determined that the ASM Cloud Service Load Balanced Set (or Azure Load Balancer, Azure Internal Load Balancer) configuration was set to the out of the box default of 5-tuple distribution. While this implementation of the WAP + NDES solution was in production for at least 2-3 years, working without fault or issue, was not the correct configuration. It was determined that the correct configuration for this setup was to leverage either 2-tuple (source and destination IPs) distribution, or 3-tuple (source IP, destination IP and protocol). We went with the more specific 3-tuple and that resolved connectivity issues.
The PowerShell to execute this solution (setting the ASM Cloud Service LBSet to 3-tuple, source IP+destinationIP+protocol) is as follows:
Set-AzureLoadBalancedEndpoint -ServiceName "[CloudServiceX]" -LBSetName "[LBSetX]" -Protocol tcp –LoadBalancerDistribution "sourceIPprotocol"
The specific parameter (-LoadBalancerDistribution for the specific PowerShell cmdlets) which sets the load balancer distribution algorithm, has the Valid values of:
Recently, in another work stream, we ran into the same issue. However, the circumstances were slightly different. The problem parameters this time were:
Having gone through the load balancing distribution mode issue only a few months earlier, I had it fresh in my mind. I suggested to investigate that. After the parameter was changed, in this second instance, we were able to resolve the issue again and get the intended work load working as intended via the Azure Load Balancer.
With Azure Resource Manager, theres a couple of ways you can go about the configuration change. The most common way would be to change the JSON template which is quick and easy. The below is an example of the section around load balancing rules which has the specific “loadDistribtuion” parameter that would need to be changed. The ARMARM load balancer has basically the same configuration options as the ASM counterpart around this setting; sourceIP, sourceIPProtocol. However, the only difference is that there is a “Default” option which is the ARM equivalent to the ASM or “None” (default = 5-tuple)[New-AzureRmLoadBalancerRuleConfig(https://docs.microsoft.com/en-us/powershell/module/azurerm.network/new-azurermloadbalancerruleconfig?view=azurermps-6.4.0)].
"loadBalancingRules": [
{
"name": "[concat(parameters('loadBalancers_EXAMPLE_name')]",
"etag": "W/\"[XXXXXXXXXXXXXXXXXXXXXXXXXXX]\"",
"properties": {
"provisioningState": "Succeeded",
"frontendIPConfiguration": {
"id": "[parameters('loadBalancers_EXAMPLE_id')]"
},
"frontendPort": XX,
"backendPort": XX,
"enableFloatingIP": false,
"idleTimeoutInMinutes": X,
"protocol": "TCP",
"loadDistribution": "SourceIP",
"backendAddressPool": {
"id": "[parameters('loadBalancers_EXAMPLE_id_1')]"
},
"probe": {
"id": "[parameters('loadBalancers_EXAMPLE_id_2')]"
}
}
}
],
The alternative option would be to just user PowerShell. To do that, you can execute the following:
Get-AzureRmLoadBalancer -Name [LBName] -ResourceGroupName [RGName] | Set-AzureRmLoadBalancerRuleConfig -LoadDistribution "[Parameter]"
For the most part I would usually go with the default for any configuration. Through this exercise I have come to question the load balancing requirements to be specific around this distribution mode to avoid any possible fault. Certainly, this is a practice that should extend to every aspect of Azure. The only challenge is balancing questioning every configuration and/or simply going with the defaults. Happy balancing! (Pun intended).