If you saw my recent post about pushing the limits of ScaleIO on AWS EC2, you'll notice that I had a few more plans. I wanted to push the node count even higher, and run on some heavier duty instances.
Well, the ScaleIO development team noticed what I had done, and decided to take my code and push it to the next level. Using the methods I developed, they hit the numbers I had been hoping to get. 1000 nodes, across 5 protection domains, all on a single MDM and single cluster.
As you can see, the team was able to get 100 volumes built and 400(!) clients. Most impressively, using minimal nodes (I believe these to be m1.medium nodes), they achieved a 3.5 Gbytes/s (yes, bytes!) - 28 Gigabit worth of performance across the AWS cloud. Also, very nearly 1 million IO/s. Needless to say, I was floored when I saw these results.
Special thanks to the ScaleIO team, Saar, Alexei, Alex, Lior, Eran, Dvir who ran this test (with no help from me, a mean feat in itself given how undocumented my code was!) and produced these results.
Lastly, I also got my hands on 10 AWS hi1.4xlarge instances, which have local SSDs...Unfortunately, I managed to delete most of the screen shots from my test, but I was able to achieve 3.5-4.0 Gbytes/sec using 10 nodes on the same 10GBit switch. Truly impressive. And, as a number of people have asked about latency....average latencies in that test were ~650 µsec! The one screen shot I was able to grab was during a rebuild after I had removed and replaced a couple nodes.
Rebuilding at 2.3GB/s is something you rarely see :).
I'm really happy to be able to share these cool updates from the team. Feel free to ask questions.