{"id":3676,"date":"2017-09-19T10:55:29","date_gmt":"2017-09-19T09:55:29","guid":{"rendered":"https:\/\/www.devco.net\/?p=3676"},"modified":"2017-10-09T08:38:43","modified_gmt":"2017-10-09T07:38:43","slug":"load-testing-choria","status":"publish","type":"post","link":"https:\/\/www.devco.net\/archives\/2017\/09\/19\/load-testing-choria.php","title":{"rendered":"Load testing Choria"},"content":{"rendered":"<p><H2>Overview<\/H2><br \/>\nMany of you probably know I am working on a project called <a href=\"http:\/\/choria.io\">Choria<\/a> that modernize MCollective which will eventually supersede MCollective (more on this later).<\/p>\n<p>Given that Choria is heading down a path of being a rewrite in Go I am also taking the opportunity to look into much larger scale problems to meet some client needs.<\/p>\n<p>In this and the following posts I&#8217;ll write about work I am doing to load test and validate Choria to 100s of thousands of nodes and what tooling I created to do that.<\/p>\n<p><H2>Middleware<\/H2><br \/>\nChoria builds around the <a href=\"http:\/\/nats.io\/\">NATS<\/a> middleware which is a Go based middleware server that forgoes a lot of the persistence and other expensive features &#8211; instead it focusses on being a fire and forget middleware network. It has an additional project should you need those features so you can mix and match quite easily. <\/p>\n<p>Turns out that&#8217;s exactly what typical MCollective needs as it never really used the persistence features and those just made the associated middleware quite heavy.<\/p>\n<p>To give you an idea, in the old days the community would suggest every ~ 1000 nodes managed by MCollective required a single ActiveMQ instance.  Want 5 500 MCollective nodes? That&#8217;ll be 6 machines &#8211; physical recommended &#8211; and 24 to 30 GB RAM in a cluster just to run the middleware.   We&#8217;ve had reports of much larger RabbitMQ networks on 4 or 5 servers &#8211; 50 000 managed nodes or more, but those would be big machines and they had quite a lot of performance issues.<\/p>\n<p>There was a time where 5 500 nodes was A LOT but now it&#8217;s becoming a bit every day, so I need to focus upward.<\/p>\n<p>With NATS+Choria I am happily running 5 500 nodes on a single 2 CPU VM with 4GB RAM.  In fact on a slightly bigger VM I am happily running 50 000 nodes on a single VM and NATS uses around 1GB to 1.5GB of RAM at peak.<\/p>\n<p>Doing 100s of RPC requests in a row against 50 000 nodes the response time is pretty solid around 16 seconds for a RPC call to every node, it&#8217;s stable, never drops a message and the performance stays level in the absence of Java GC issues. This is fast but also quite slow &#8211; the Ruby client manages about 300 replies every 0.10 seconds due to the amount of protocol decoding etc that is needed.<\/p>\n<p>This brings with it a whole new level of problem.  Just how far can we take the client code and how do you determine when it&#8217;s too big and how do I know the client, broker and federation I am working on significantly improve things.<\/p>\n<p>I&#8217;ve also significantly reworked the network protocol to support <a href=\"http:\/\/choria.io\/docs\/federation\/\">Federation<\/a> but the shipped code optimize for code and config simplicity over lets say support for 20 000 Federation Collectives.  When we are talking about truly gigantic Choria networks I need to be able to test scenarios involving 10s of thousands of Federated Network all with 10s of thousands of nodes in them.  So I need tooling that lets me do this.<\/p>\n<p><H2>Getting to running 50 000 nodes<\/H2><br \/>\nNot everyone just happen to have a 50 000 node network lying about they can play with so I had to improvise a bit.<\/p>\n<p>As part of the rewrite I am doing I am building a Go framework with the Choria protocol, config parsing and network handling all built in Go. Unlike the Ruby code I can instantiate multiple of these in memory and run them in Go routines.<\/p>\n<p>This means I could <a href=\"https:\/\/github.com\/choria-io\/choria-emulator\">write a emulator<\/a> that can start a number of faked Choria daemons all in one process.  They each have their own middleware connection, run a varying amount of agents with a varying amount of sub collectives and generally behave like a normal MCollective machine.  On my MacBook I can run 1 500 Choria instances quite easily.<\/p>\n<p>So with fewer than 60 machines I can emulate 50 000 MCollective nodes on a 3 node NATS cluster and have plenty of spare capacity. This is well within budget to run on AWS and not uncommon these days to have that many dev machines around.<\/p>\n<p>In the following posts I&#8217;ll cover bits about the emulator, what I look for when determining optimal network sizes and how to use the emulator to test and validate performance of different network topologies.<\/p>\n<p><H2>Follow-up Posts<\/H2><\/p>\n<ul>\n<li><a href=\"https:\/\/www.devco.net\/archives\/2017\/09\/22\/what-to-consider-when-speccing-a-choria-network-size.php\">What To Consider When Speccing a Choria Network<\/a><\/li>\n<li><a href=\"https:\/\/www.devco.net\/archives\/2017\/10\/09\/the-choria-emulator.php\">The Choria Emulator<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Overview Many of you probably know I am working on a project called Choria that modernize MCollective which will eventually supersede MCollective (more on this later). Given that Choria is heading down a path of being a rewrite in Go I am also taking the opportunity to look into much larger scale problems to meet [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"","_et_pb_old_content":"","footnotes":""},"categories":[7,1],"tags":[126,78,21],"_links":{"self":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/3676"}],"collection":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/comments?post=3676"}],"version-history":[{"count":13,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/3676\/revisions"}],"predecessor-version":[{"id":3708,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/3676\/revisions\/3708"}],"wp:attachment":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/media?parent=3676"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/categories?post=3676"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/tags?post=3676"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}