I was recently asked by a customer for some help with their new VNX 5500 File + ESXi farm. This was a greenfield design, and so the customer had gone down a pretty cool route for the networking.
They had a single Extreme Networks Summit switch (10GbE) connected via 10GB to both their hosts and to the VNX. 2 ports on the host were connected to the switch and configured for an active LACP group. The same was true for the VNX datamovers - 2 ports, active LACP.
Everything on the switch showed 'Up', so the did the array and the hosts. The array and host were configured per best practices for IP hash load balancing and beacon probing, etc.
However, they had a problem - any time they tried to deploy to the array's datastore store over NFS, or even just browse the datastore, they'd get a pause, and then the datastore would go offline from the ESXi host perspective. Looking at the vmkernel.log, we saw that the datastore would get marked as All Paths Down.
My first thought was that this was network related. Usually, stuff going on/off line happens somewhere in the network. If it wasn't up at all, it wouldn't connect in the first place, and performance would be a totally different issue.
So, I checked jumbo frames first - it looked correct, but just to be sure, we knocked everything back down to 1500 just to be safe. No dice :(
Next we started looking at the LACP configs on the array, host and switch. They all looked right at first glance. But then, when I looked carefully at the config for the LACP trunks from the switch side, they were configured for L2 load balancing. As astute readers will notice, thats NOT the same as ip hash (which is a L3 method), which we used above.
So, we tore down the LACP group, and rebuilt it with L3 load balancing. With that, everything came online, browsing was fast, and we were able to deploy VMs.
Moral of the story: Just because your network team says they followed instructions, doesn't mean they did :)