{"id":3676,"date":"2017-09-19T10:55:29","date_gmt":"2017-09-19T09:55:29","guid":{"rendered":"https:\/\/www.devco.net\/?p=3676"},"modified":"2017-10-09T08:38:43","modified_gmt":"2017-10-09T07:38:43","slug":"load-testing-choria","status":"publish","type":"post","link":"https:\/\/www.devco.net\/archives\/2017\/09\/19\/load-testing-choria.php","title":{"rendered":"Load testing Choria"},"content":{"rendered":"
Given that Choria is heading down a path of being a rewrite in Go I am also taking the opportunity to look into much larger scale problems to meet some client needs.<\/p>\n
In this and the following posts I’ll write about work I am doing to load test and validate Choria to 100s of thousands of nodes and what tooling I created to do that.<\/p>\n
Turns out that’s exactly what typical MCollective needs as it never really used the persistence features and those just made the associated middleware quite heavy.<\/p>\n
To give you an idea, in the old days the community would suggest every ~ 1000 nodes managed by MCollective required a single ActiveMQ instance. Want 5 500 MCollective nodes? That’ll be 6 machines – physical recommended – and 24 to 30 GB RAM in a cluster just to run the middleware. We’ve had reports of much larger RabbitMQ networks on 4 or 5 servers – 50 000 managed nodes or more, but those would be big machines and they had quite a lot of performance issues.<\/p>\n
There was a time where 5 500 nodes was A LOT but now it’s becoming a bit every day, so I need to focus upward.<\/p>\n
With NATS+Choria I am happily running 5 500 nodes on a single 2 CPU VM with 4GB RAM. In fact on a slightly bigger VM I am happily running 50 000 nodes on a single VM and NATS uses around 1GB to 1.5GB of RAM at peak.<\/p>\n
Doing 100s of RPC requests in a row against 50 000 nodes the response time is pretty solid around 16 seconds for a RPC call to every node, it’s stable, never drops a message and the performance stays level in the absence of Java GC issues. This is fast but also quite slow – the Ruby client manages about 300 replies every 0.10 seconds due to the amount of protocol decoding etc that is needed.<\/p>\n
This brings with it a whole new level of problem. Just how far can we take the client code and how do you determine when it’s too big and how do I know the client, broker and federation I am working on significantly improve things.<\/p>\n
I’ve also significantly reworked the network protocol to support Federation<\/a> but the shipped code optimize for code and config simplicity over lets say support for 20 000 Federation Collectives. When we are talking about truly gigantic Choria networks I need to be able to test scenarios involving 10s of thousands of Federated Network all with 10s of thousands of nodes in them. So I need tooling that lets me do this.<\/p>\n As part of the rewrite I am doing I am building a Go framework with the Choria protocol, config parsing and network handling all built in Go. Unlike the Ruby code I can instantiate multiple of these in memory and run them in Go routines.<\/p>\n This means I could write a emulator<\/a> that can start a number of faked Choria daemons all in one process. They each have their own middleware connection, run a varying amount of agents with a varying amount of sub collectives and generally behave like a normal MCollective machine. On my MacBook I can run 1 500 Choria instances quite easily.<\/p>\n So with fewer than 60 machines I can emulate 50 000 MCollective nodes on a 3 node NATS cluster and have plenty of spare capacity. This is well within budget to run on AWS and not uncommon these days to have that many dev machines around.<\/p>\n In the following posts I’ll cover bits about the emulator, what I look for when determining optimal network sizes and how to use the emulator to test and validate performance of different network topologies.<\/p>\n Overview Many of you probably know I am working on a project called Choria that modernize MCollective which will eventually supersede MCollective (more on this later). Given that Choria is heading down a path of being a rewrite in Go I am also taking the opportunity to look into much larger scale problems to meet […]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"","_et_pb_old_content":"","footnotes":""},"categories":[7,1],"tags":[126,78,21],"_links":{"self":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/3676"}],"collection":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/comments?post=3676"}],"version-history":[{"count":13,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/3676\/revisions"}],"predecessor-version":[{"id":3708,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/3676\/revisions\/3708"}],"wp:attachment":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/media?parent=3676"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/categories?post=3676"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/tags?post=3676"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}Getting to running 50 000 nodes<\/H2>
\nNot everyone just happen to have a 50 000 node network lying about they can play with so I had to improvise a bit.<\/p>\nFollow-up Posts<\/H2><\/p>\n
\n