I recently attempted to stand up a secondary domain controller for my home in Windows Azure. I used the cross-premise VPN connectivity option to establish an IPSec VPN between my home Juniper SRX210 gateway router (connected to Comcast cable modem service) and the Windows Azure’s network. On Azure, I used the vNet concept and have extended my home network into Azure using RFC1918 address space.
Part of Azure’s recommended SRX configuration guide includes this setting: “set security flow tcp-mss ipsec-vpn mss 1350″. It, however, doesn’t mention anything else about MTU, fragmentation, local MTU settings, etc.
When I went to perform directory replication, the new domain controller would hang for at least 5 minutes and then error out with a DC replication error message similar to “Active Directory could not replicate the directory partition CN=Configuration….from the remote domain controller “server name” “The remote procedure call was cancelled””. I also had issues transferring files using basic windows file shares.
I googled the error and various links pointed me to setting up different DC sites and putting the replica DC in that site, manipulating the cost for that site (which increases timeout), etc, but nothing worked.
Eventually, I ran a packet capture to see whether communication was happening and what communication was happening — this is when I realized that there were tons of TCP retransmissions being seen on the server in Azure.
I spent some more time troubleshooting and finally came to the conclusion that I had fragmentation issues and resorted to using ping to test the maximum packet size that wouldn’t be fragmented. My first assumption was that the tcp-mss ipsec-vpn mss setting would alleviate any fragmentation issues and there was no need to adjust my hosts MTU settings. Well, I was wrong.
What the tcp-mss ipsec-vpn command actually does is change the TCP MSS in the TCP SYN packet as it is leaving the router. I suspect that packets leaving my primary DC were packet sizes that exceeded 1350, thus resulting in fragmentation and for whatever reasons they were not re-assembled correctly, delayed in re-assembly or not transmitted properly – although I haven’t yet realized the exact technical issue.
I decided to try changing the MTU on the servers themselves, and this fixed the problem and DC replication happened within minutes.
So, if you’re using Azure or IPSec and doing DC replication – make sure you set your MSS and your MTU!