by R.I. Pienaar | Aug 7, 2010 | Code
I often get asked about MCollective and other programming languages. Thus far we only support Ruby but my hope is in time we’ll be able to be more generic.
Initially I had a few requirements from serialization:
- It must retain data types
- Encoding the same data – like a hash – twice should give the same result from the point of view of md5()
That was about it really. This was while we used a pre-shared key to validate requests and so the result of the encode and decode should be the same on the sender as on the receiver. With YAML this was never the case so I used Marshal.
We recently had a SSL based security plugin contributed that relaxed the 2nd requirement so we can go back to using YAML. We could in theory relax the 1st requirement but it would just inhibit the kind of tools you can build with MCollective quite a bit. So I’d strongly suggest this is a must have.
Today there are very few cross language serializers that let you just deal with arbitrary data YAML is one that seems to have a log of language support. Prior to version 1.0.0 of MCollective the SSL security system only supported Marshal but we’ll support YAML in addition to Marshal in 1.0.0.
This enabled me to write a Perl client that speaks to your standard Ruby collective (if it runs this new plugin).
You can see the Perl client here. The Perl code is roughly a mc-find-hosts written in Perl and without command line options for filtering – though you can just adjust the filters in the code. It’s been years since I wrote any Perl so that’s just the first thing that worked for me.
Point is someone should be able to take any language that has the Syck YAML bindings and write a client library to talk with Mcollective. I tried the non Syck bindings in PHP and it’s unusable, I suspect the PHP Syck bindings will work better but I didn’t try them.
As mentioned on the user list post 1.0.0 I intend to focus on long running and scheduled requests I’ll then also work on some kind of interface between Mcollective and Agents written in other languages – since that is more or less how long running scheduled tasks would work anyway. This will then use the Ruby as a transport hooking clients and agents in different languages together.
I can see that I’ll enable this but I am very unlikely to write the clients myself. I am therefore keen to speak to community members who want to speak to MCollective from languages like Python and who have some time to work on this.
by R.I. Pienaar | Aug 5, 2010 | Code
The typical Puppet use case is to run the daemon every 30 minutes or so and just let it manage your machines. Sometimes though you want to be able to run it on all your machines as quick as your puppet master can handle.
This is tricky as you generally do not have a way to cap the concurrency and it’s hard to orchestrate that. I’ve extended the MCollective Puppet Agent to do this for you so you can do a rapid run at roll out time and then go back to the more conservative slow pace once your window is over.
The basic logic I implemented is this:
- Discover all nodes, sort them alphabetically
- Count how many nodes are active now, wait till it’s below threshold
- Run a node by just starting a –onetime background run
- Sleep a second
This should churn through your nodes very quickly without overwhelming the resources of your master. You can see it in action here, you can see it started 3 nodes and once it got to the 4th 3 were already running and it waited for one of them to finish:
% mc-puppetd -W /dev_server/ runall 2
Thu Aug 05 17:47:21 +0100 2010> Running all machines with a concurrency of 2
Thu Aug 05 17:47:21 +0100 2010> Discovering hosts to run
Thu Aug 05 17:47:23 +0100 2010> Found 4 hosts
Thu Aug 05 17:47:24 +0100 2010> Running dev1.one.net, concurrency is 0
Thu Aug 05 17:47:26 +0100 2010> dev1.one.net schedule status: OK
Thu Aug 05 17:47:28 +0100 2010> Running dev1.two.net, concurrency is 1
Thu Aug 05 17:47:30 +0100 2010> dev1.two.net schedule status: OK
Thu Aug 05 17:47:32 +0100 2010> Running dev2.two.net, concurrency is 2
Thu Aug 05 17:47:34 +0100 2010> dev2.two.net schedule status: OK
Thu Aug 05 17:47:35 +0100 2010> Currently 3 nodes running, waiting
Thu Aug 05 17:48:00 +0100 2010> Running dev3.two.net, concurrency is 2
Thu Aug 05 17:48:05 +0100 2010> dev3.two.net schedule status: OK |
% mc-puppetd -W /dev_server/ runall 2
Thu Aug 05 17:47:21 +0100 2010> Running all machines with a concurrency of 2
Thu Aug 05 17:47:21 +0100 2010> Discovering hosts to run
Thu Aug 05 17:47:23 +0100 2010> Found 4 hosts
Thu Aug 05 17:47:24 +0100 2010> Running dev1.one.net, concurrency is 0
Thu Aug 05 17:47:26 +0100 2010> dev1.one.net schedule status: OK
Thu Aug 05 17:47:28 +0100 2010> Running dev1.two.net, concurrency is 1
Thu Aug 05 17:47:30 +0100 2010> dev1.two.net schedule status: OK
Thu Aug 05 17:47:32 +0100 2010> Running dev2.two.net, concurrency is 2
Thu Aug 05 17:47:34 +0100 2010> dev2.two.net schedule status: OK
Thu Aug 05 17:47:35 +0100 2010> Currently 3 nodes running, waiting
Thu Aug 05 17:48:00 +0100 2010> Running dev3.two.net, concurrency is 2
Thu Aug 05 17:48:05 +0100 2010> dev3.two.net schedule status: OK
This is integrated into the existing mc-puppetd client script you don’t need to roll out anything new to your servers just the client side.
Using this to run each of 47 machines with a concurrency of just 4 I was able to complete a cycle in 8 minutes. Doesn’t sound too impressive but my average run time is around 40 seconds on every node with some being 90 to 150 seconds. My puppetmaster server that usually sits at a steady 0.2mbit out were serving a constant 2mbit/sec for the duration of this run.
by R.I. Pienaar | Jul 29, 2010 | Uncategorized
I’m quite the fan of data, metadata and querying these to interact with my infrastructure rather than interacting by hostnames and wanted to show how far I am down this route.
This is more an iterative ongoing process than a fully baked idea at this point since the concept of hostnames is so heavily embedded in our Sysadmin culture. Today I can’t yet fully break away from it due to tools like nagios etc still relying heavily on the hostname as the index but these are things that will improve in time.
The background is that in the old days we attempted to capture a lot of metadata in hostnames, domain names and so forth. This was kind of OK since we had static networks with relatively small amounts of hosts. Today we do ever more complex work on our servers and we have more and more servers. The advent of cloud computing has also brought with it a whole new pain of unpredictable hostnames, rapidly changing infrastructures a much bigger emphasis on role based computing.
My metadata about my machines comes from 3 main sources:
- My Puppet manifests – classes and modules that gets put on a machine
- Facter facts with the ability to add many per machine easily
- MCollective stores the meta data in a MongoDB and let me query the network in real time
Puppet manifests based on query
When setting up machines I keep some data like database master hostnames in extlookup but in many cases I am now moving to a search based approach to finding resources. Here’s a sample manifest that will find the master database for a customers development machines:
$masterdb = search_nodes("{'facts.customer': '${customer}', 'facts.environment':${environment}, classes: 'mysql::master'}") |
$masterdb = search_nodes("{'facts.customer': '${customer}', 'facts.environment':${environment}, classes: 'mysql::master'}")
This is MongoDB query against my infrastructure database, it will find for a given node the name of a node that has the class mysql::master on it, by convention there should be only one per customer in my case. When using it in a template I can get back full objects with all the meta data for a node. Hopefully with Puppet 2.6 I can get full hashes into puppet too!
Making Metadata Visible
With machines doing a lot of work, filling a lot of roles etc and with more and more machines you need to be able to tell immediately what machine you are on.
I do this in several places, first my MOTD can look something like this:
Welcome to Synchronize Your Dogmas
hosted at Hetzner, Germany
Puppet Modules:
- apache
- iptables
- mcollective member
- xen dom0 skeleton
- mw1.xxx.net virtual machine |
Welcome to Synchronize Your Dogmas
hosted at Hetzner, Germany
Puppet Modules:
- apache
- iptables
- mcollective member
- xen dom0 skeleton
- mw1.xxx.net virtual machine
I build this up using snippet from my concat module, each important module like apache can just put something like this in:
motd::register{"Apache Web Server": } |
motd::register{"Apache Web Server": }
Being managed by my snippet library, if you just remove the include line from the manifests the MOTD will automatically update.
With a big block of welcome done, I now need to also be able to show in my prompts what a machine does, who its for a importantly what environment it is in.
Above a shot of 2 prompts in different environments, you see customer name, environment and major modules. Like with the motd I have a prompt::register define that module use to register into the prompt.
SSH Based on Metadata
With all this meta data in place, mcollective rolled out and everything integrated it’s very easy to now find and access machines based on this.
MCollective does real time resource discovery, so keeping with the mysql example above from puppet:
$ mc-ssh -W "environment=development customer=acme mysql::master"
Running: ssh db1.acme.net
Last login: Thu Jul 29 00:22:58 2010 from xxxx
$
Here i am ssh’ing to a server based on a query, if it found more than one machine matching the query a menu would be presented offering me a choice.
Monitoring Based on Metatdata
Finally setting up monitoring and keeping it in sync with reality can be a big challenge especially in dynamic cloud based environments, again I deal with this through discovery based on meta data:
$ check-mc-nrpe -W "environment=development customer=acme mysql::master" check_load
check_load: OK: 1 WARNING: 0 CRITICAL: 0 UNKNOWN: 0|total=1 ok=1 warn=0 crit=0 unknown=0 checktime=0.612054 |
$ check-mc-nrpe -W "environment=development customer=acme mysql::master" check_load
check_load: OK: 1 WARNING: 0 CRITICAL: 0 UNKNOWN: 0|total=1 ok=1 warn=0 crit=0 unknown=0 checktime=0.612054
Summary
This is really the tip of the ice berg, there is a lot more that I already do – like scheduling puppet runs on groups of machines based on metadata – but also a lot more to do this really is early days down this route. I am very keen to get views from others who are struggling with shortcomings in hostname based approaches and how they deal with it.
by R.I. Pienaar | Jul 28, 2010 | Uncategorized
I often see some confusion about terminology in use in MCollective, what the major components are, where software needs to be installed etc.
I attempted to address this in a presentation and screen cast covering:
- What middleware is and how we use it.
- The major components and correct terminology.
- Anatomy of a request life cycle.
- And an actual look inside the messages we sent and receive.
You can grab the presentation from Slideshare or view a video of it on blip.tv. Below find an embedded version of the slideshare deck including audio. I suggest you view it full screen as there’s some code in it.
by R.I. Pienaar | Jul 25, 2010 | Code
I have a number of ActiveMQ servers, 7 in total, 3 in a network of brokers the rest standalone. For MCollective I use topics extensively so don’t really need to monitoring them much other than for availability. I also though do a lot of Queued work where lots of machines put data in a queue and others process the data.
In the Queue scenario you absolutely need to monitor queue sizes, memory usage and such. You also need to graph things like rates of messages, consumer counts and memory use. I am busy writing a number of Nagios and Cacti plugins to help with this, you can find them on Github.
To use these you need to have the ActiveMQ Statistics Plugin enabled.
First we need to monitor queue sizes:
$ check_activemq_queue.rb --host localhost --user nagios --password passw0rd --queue exim.stats --queue-warn 1000 --queue-crit 2000
OK: ActiveMQ exim.stats has 1 messages |
$ check_activemq_queue.rb --host localhost --user nagios --password passw0rd --queue exim.stats --queue-warn 1000 --queue-crit 2000
OK: ActiveMQ exim.stats has 1 messages
This will connect to localhost monitoring a queue exim.stats warning you when it’s got 1000 messages and critical at 2000.
I need to add to this the ability to monitor memory usage, this will come over the next few days.
I also have a plugin for Cacti it can output stats for the broker as a whole and also for a specific queue. First the whole broker:
$ activemq-cacti-plugin.rb --host localhost --user nagios --password passw0rd --report broker
stomp+ssl:stomp+ssl storePercentUsage:81 size:5597 ssl:ssl vm:vm://web3 dataDirectory:/var/log/activemq/activemq-data dispatchCount:169533 brokerName:web3 openwire:tcp://web3:6166 storeUsage:869933776 memoryUsage:1564 tempUsage:0 averageEnqueueTime:1623.90502285799 enqueueCount:174080 minEnqueueTime:0.0 producerCount:0 memoryPercentUsage:0 tempLimit:104857600 messagesCached:0 consumerCount:2 memoryLimit:20971520 storeLimit:1073741824 inflightCount:9 dequeueCount:169525 brokerId:ID:web3-44651-1280002111036-0:0 tempPercentUsage:0 stomp:stomp://web3:6163 maxEnqueueTime:328585.0 expiredCount:0 |
$ activemq-cacti-plugin.rb --host localhost --user nagios --password passw0rd --report broker
stomp+ssl:stomp+ssl storePercentUsage:81 size:5597 ssl:ssl vm:vm://web3 dataDirectory:/var/log/activemq/activemq-data dispatchCount:169533 brokerName:web3 openwire:tcp://web3:6166 storeUsage:869933776 memoryUsage:1564 tempUsage:0 averageEnqueueTime:1623.90502285799 enqueueCount:174080 minEnqueueTime:0.0 producerCount:0 memoryPercentUsage:0 tempLimit:104857600 messagesCached:0 consumerCount:2 memoryLimit:20971520 storeLimit:1073741824 inflightCount:9 dequeueCount:169525 brokerId:ID:web3-44651-1280002111036-0:0 tempPercentUsage:0 stomp:stomp://web3:6163 maxEnqueueTime:328585.0 expiredCount:0
Now a specific queue:
$ activemq-cacti-plugin.rb --host localhost --user nagios --password passw0rd --report exim.stats
size:0 dispatchCount:168951 memoryUsage:0 averageEnqueueTime:1629.42897052992 enqueueCount:168951 minEnqueueTime:0.0 consumerCount:1 producerCount:0 memoryPercentUsage:0 destinationName:queue://exim.stats messagesCached:0 memoryLimit:20971520 inflightCount:0 dequeueCount:168951 expiredCount:0 maxEnqueueTime:328585.0 |
$ activemq-cacti-plugin.rb --host localhost --user nagios --password passw0rd --report exim.stats
size:0 dispatchCount:168951 memoryUsage:0 averageEnqueueTime:1629.42897052992 enqueueCount:168951 minEnqueueTime:0.0 consumerCount:1 producerCount:0 memoryPercentUsage:0 destinationName:queue://exim.stats messagesCached:0 memoryLimit:20971520 inflightCount:0 dequeueCount:168951 expiredCount:0 maxEnqueueTime:328585.0
Grab the code on GitHub and follow there, I expect a few updates in the next few weeks.