Exaforge

Cloud, DevOps, Evangelism

Part 5: Building A Large VMAX & NFS Environment

The is the last post in a series describing the design considerations of a VMAX + VG8 NFS environment for running a significant vSphere farm.  I've discussed the requirements, high-level design for the VMAX and VG8 components and described the implementation details for the VMAX.  This post is the last in the series and will cover the same implementation details for the VG8 gateway.

Installation

Installation of the VG8 is very straight forward if the follow the instruction manual available on the support site.  Every step is prompted and pretty reasonably described.

If you've properly completed your zoning and masking steps described in the previous articles, then the VG8 cluster will automatically find the control volumes and install to them.  One thing worth noting - the install process gives time estimates for the various parts of the process; I've found that these are almost universally wildly incorrect.  Some that estimate 2 minutes complete in less than a 10th of a second, while some that estimate 1 minute took 15.  Be prepared for the entire install process to take at least 4 hours (although 95% of that is hands off time you can head to lunch for).

I suspect that a huge part of this is the simple fact that each HBA on each datamover will be seeing / scanning nearly 2000 devices, and there are 16 HBAs.  Scanning 32,000 devices is quite time consuming for any operating system.

Disk Marking

Once the install procedure has completed, the VG8 will begin the disk marking procedure.  This mostly went without a hitch, bit there was a small hiccup.  One of out TDEVs we had masked to the system was in a Not Ready state (I believe I had failed to bind it to a pool), and so the marking process choked on that disk when it was able to see it but not perform any IO. For the more SCSI minded of you, the device was returned in a INQ response, but responded with Check Conditions to any IO.

Once that small issue was sorted we continued to volume creation.

Volume Creation

The first issue we found was that the system had automatically created a disk pool and added all the devices into it!  Well, we aren't going to use automatic volume management (AVM) on this array (why? - see the next section), so we dont want this pool created?  When we attempted to delete it however, we were met with an 'Invalid Operation' error.  A quick call to support instructed us to simply created our desired stripe/meta structures and the devices would be removed from the pool automatically.  When the pool was empty, it was automatically deleted.

Why don't we want to use AVM? Well, because it has a small, but real performance overhead, and also has a limit of 8 devices in a stripe set...we can do better manually to squeeze the last inch of performance out of this system. Normally, I would recommend against fully manual volume management - its too much of a hassle for a small (<10%) performance gain.  However, in this case, once we setup this gateway*, its never going to change.  As a result, this is a one time effort isn't too much pain.

*famous last words

To create the devices, we first use the nas_volume command to build the raw devices (dVols) into a stripe set (of 17 members):

nas_volume -n stv_01 -S 262144 d500,d501,d502,d503,d504,d505,d506,d507,d508,d509,d501,d511,d512,d13,d14,d515,d16

This produces a stripe called stv_01 with a stripe size of 256KB from the 17 volumes listed.  You can determine the mapping between the dVol number and the Symmetrix device number using the nas_disk -l command. In this case, because all my devices are TDEVs virtualized in a FAST VP pool, they are all the same and I dont have to worry about spreading across busses, etc.  The Symmetrix handles that for us.

Next, we create a single metavolume out of that stripe.  Why bother?  Well, just for the future, in case we need to expand things, we will have the option to expand the meta easily.

nas_volume -n mtv_01 -c stv_01

We create a meta called mtv_01 from the stripe stv_01.  Pretty straght forward stuff.

Last, we create a filesystem on the metavolume:

nas_fs -name fs_01 -create mtv_01 log_type=split -option nbpi=32768

To break this down:

  • -name fs_01 - the name for the filesystem.  This isn't the same as the mount point.
  • -create mtv_01 - create a new filesystem backed by volume mtv_01 (our meta)
  • log_type=split - this is a new option for 7.1 DART code that puts the intent log in the volumes themselves (rather than the control volumes) to prevent hot spots.
  • -option nbpi=32768 - this option changes the number of inodes that are created.  The default is to create 1 inode every 8K (8192), which is reasonable for a filesystem that will contain word documents, etc.  However, ours will contain giant 100GB VMDKs, so we dont need to waste so much metadata memory on inodes that we'll never use, so we crank it down to every 32K.

Last, we mount the filesystem:

server_mount server_2 -option uncached fs_01 /fs_01

The uncached option prevents the gateway from attempting to perform write coaslescing, which significantly improves response time on random IO workloads.

And with that, we are done - our VMAX is carved, our filesystems are mounted, and we are ready to access these filesystem using our regular connections.

I hope you've enjoyed this series, and if I've missed an important part, please let me know in the comments so I can add it in!.