Cloud, DevOps, Evangelism

ScaleIO @ Scale - Update

If you saw my recent post about pushing the limits of ScaleIO on AWS EC2, you'll notice that I had a few more plans.  I wanted to push the node count even higher, and run on some heavier duty instances.

Well, the ScaleIO development team noticed what I had done, and decided to take my code and push it to the next level.  Using the methods I developed, they hit the numbers I had been hoping to get.  1000 nodes, across 5 protection domains, all on a single MDM and single cluster.

995SDS_400SDC_100Vols_923773IOPSAs you can see, the team was able to get 100 volumes built and 400(!) clients.  Most impressively, using minimal nodes (I believe these to be m1.medium nodes), they achieved a 3.5 Gbytes/s (yes, bytes!) - 28 Gigabit worth of performance across the AWS cloud.  Also, very nearly 1 million IO/s.  Needless to say, I was floored when I saw these results.

Special thanks to the ScaleIO team, Saar, Alexei, Alex, Lior, Eran, Dvir who ran this test (with no help from me, a mean feat in itself given how undocumented my code was!) and produced these results.

Lastly, I also got my hands on 10 AWS hi1.4xlarge instances, which have local SSDs...Unfortunately, I managed to delete most of the screen shots from my test, but I was able to achieve 3.5-4.0 Gbytes/sec using 10 nodes on the same 10GBit switch.  Truly impressive.  And, as a number of people have asked about latency....average latencies in that test were ~650 µsec!  The one screen shot I was able to grab was during a rebuild after I had removed and replaced a couple nodes.

Screen Shot 2013-11-14 at 8.22.38 PM


Rebuilding at 2.3GB/s is something you rarely see :).

I'm really happy to be able to share these cool updates from the team.  Feel free to ask questions.