I’ve moved a number of my more complex infrastructure components from being Puppet managed to being Docker managed. There are many reasons for this the main one being my Puppet code is ancient and faced with a rewrite to be Puppet 4 like or just rethinking things, I’m leaning towards rethinking. I don’t think CM is solving the right problem for me for certain aspects of my infrastructure and new approaches can bring more value for my use case.
There’s a lot of posts around talking about Docker and concentrating on the image building side of it or just the running of a container side – which I find quite uninteresting and in fact pretty terrible. The real benefit for me comes in workflow, the API, the events out of the daemon and the container stats. People look at the image and container aspects in isolation and go on about how this is not new technology, but that’s missing the point.
Mainly a workflow problem
I’ll look at an example moving rbldnsd from Puppet to Docker managed and what I gain from that. Along the way I’ll also throw in some examples of a more complex migration I did for my bind servers. In case you don’t know rbldnsd is a daemon that maintains a DNS based RBLs using config files that look something like this:
$DATASET dnset senderhost .digitalmarketeer.com :127.0.0.2:Connection rejected after user complaints. |
You can then query it using the usual ways your MTA support and decide policy based on that.
The life cycle of this service is typical of the ones I am targeting:
- A custom RPM had to be built and maintained and served from yet another piece of infrastructure.
- The basic package, config, service triplet. So vanilla it’s almost not worth looking at the code, it looks like all other package, config, service code.
- Requires ongoing data management – I add/remove hosts from the blacklists constantly. But this overlaps with the config part above.
- Requires the ability to test DNS queries work in development before even committing the change
- Requires rapid updating of configuration data
The last 3 points here deserve some explanation. When I am editing these configuration files I want to be able to test them right there in my shell without even committing them to git. This means starting up a rbldnsd instance and querying it with dig. This is pretty annoying to do with the puppet work flow which I won’t go into here as it’s a huge subject on it’s own. Suffice to say it doesn’t work for me and end up not being production like at all.
When I am updating this config files onto the running service there’s a daemon that will load them into its running memory. I need to be pretty sure that daemon I am testing on is identical to what’s in production now. Ideally bit for bit identical. Again this is pretty hard as many/most dev environments tend to be a few steps ahead of production. I need a way to say give me the bits running production and throw this config at them and then do an end to end test with no hassles and in 5 seconds.
I need a way to orchestrate that config data update to happen when I need it to happen – and not when Puppet runs again – and ideally it has to be quick, not at the pace that Puppet manages my 600 resources. Services should let me introspect them to figure out how to update their data and a generic updater should be able to update all my services that match this flow.
I’ve never really solved the last 3 points with my Puppet workflows for anything I work on, it’s a fiendishly complex problem to solve correctly. Everyone does it with Vagrant instances or ever more complex environments. Or they do their change, commit it and make sure there are test coverage and only get feedback later when something like Beaker ran. This is way too slow for me in this scenario. I just want to block 1 annoying host. Vagrant especially does not work for me as I refuse to run things on my desktop or laptop, I develop on VMs that are remote, so Vagrant isn’t an option. Additionally Vagrant environments become so complex, basically a whole new environment. Yet built in annoyingly different ways so that keeping match with Production can be a challenge – or just prohibitively slow if you’re building them out with Puppet. So you end up again not testing in a environment that’s remotely production like.
These are pretty major things that I’ve never been able to solve to my liking with Puppet. I’ve first moved a bunch of my web sites then bind and now rbldnsd to Docker and think I’ve managed to come up with a workflow and toolchain that solves this for me.
Desired outcome
So maybe to demonstrate what I am after I should show what I want the outcome to look like. Here’s a rbldnsd dev session. I want to block *.mailingliststart.com, specifically I saw sh8.mailingliststart.com in my logs. I want to test the hosts are going to be blocked correctly before pushing to prod or even committing to git – it’s so embarrassing to make fix commits for obvious dumb things ๐
So I add to the zones/bl file:
.mailingliststart.com :127.0.0.2:Excessive spam from this host |
$ vi zones/bl $ rake test:host Host name to test: sh8.mailingliststart.com Testing sh8.mailingliststart.com Starting the rbldnsd container... >>> Testing black list docker exec rbldnsd dig -p 5301 +noall +answer any sh8.mailingliststart.com.senderhost.bl.rbl @localhost sh8.mailingliststart.com.senderhost.bl.rbl. 2100 IN A 127.0.0.2 sh8.mailingliststart.com.senderhost.bl.rbl. 2100 IN TXT "Excessive spam from this host" >>> Testing white list . . . Removing the rbldnsd container... $ git commit zones -m 'block mailingliststart.com' $ git push origin master |
Here I added the bits to the config file and want to be sure the hostname I saw in my logs/headers will actually be blocked.:
- It prepares the latest container by default and mounts my working directory into the container with -v ${PWD}:/service.
- Container starts up just like it would in production using the same bits that’s running production – but reads the new uncommitted config
- It uses dig to query the running rbldnsd and run any in-built validation steps the service has (this container has none yet)
- Cleans up everything
The whole thing takes about 4 seconds on a virtual machine running on virtualbox on circa 2009 Mac. I saw the host was blacklisted and not somehow also whitelisted, looks good, commit and push.
Once pushed a webhook triggers my update orchestration and the running containers get the new config files only. The whole edit, test and deploy process takes less than a minute. The data though is in git which means tonight when my containers get rebuilt from fresh they will get this change baked in and rolled out as new instances.
There’s one more pretty mind blowing early feedback story I wanted to add here. My bind zones used to be made with puppet defines:
bind::zone{"foo.com": owner => "Bob", masterip => "1.2.3.4", type => $server_type} |
I had no idea what this actually did by reading that line of code. I could guess yeah sure. But you only know for sure with certainty when you run Puppet in production since no matter what the hype says, you’ll only see the diff against actual production file when that hits the production box using Puppet. Not OK. You also learn nothing with this, it’s always bothered me that Puppet end up being a crutch like a calculator, I have all these abstractions and so a junior using this define might never even know what it does or learn how bind works. Desirable in some cases, not for me.
In my Docker bind container I have a YAML file:
zones: Bob: options: masterip: 1.2.3.4 domains: - foo.com |
It’s the same data I had in my manifest just structured a bit different. Same basic problem though I have no idea what this does by looking at it. In docker world though you need to bake this YAML into bind config. And this has to be done during development so that a docker build can get to the final result. So I add a new domain bar.com:
$ vi zones.yaml $ rake construct:files Reading specification file buildsettings.yaml Reading scope file zones.yaml Rendering conf/named_slave_zones with mode 644 using template templates/slave_zones.erb Rendering conf/named_master_zones with mode 644 using template templates/master_zones.erb conf/named_master_zones | 10 ++++++++++ conf/named_slave_zones | 9 +++++++++ 2 files changed, 19 insertions(+) $ git diff +// Bob +zone "bar.com" { + type slave; + file "/srv/named/zones/slave/bar.com"; + masters { + 1.2.3.4; + }; +}; |
The rake construct:files just runs a bunch of ERB templates over the zones hash – it’s basically identical to the templates I had in Puppet with just a few var name changes and slightly different looping, no more or less complex.
This is the actual change that will hit production. No ifs or buts, that’s what will change in prod. When I rake test here without comitting this, this actual production change is being tested against the actual bits in the named binary that today runs production.
$ time rake test docker run -ti --rm -v /home/rip/work/docker_bind:/srv/named -e TEST=1 ripienaar/bind >> Checking named.conf syntax in master mode >> Checking named.conf syntax in slave mode >> Checking zones.. rake test 0.18s user 0.33s system 7% cpu 3.858 total |
Again my work dir is mounted into the container version currently running in production, my uncommitted change is tested using the bit for bit identical version of bind as currently in prod. This is a massive confidence boost and the feedback cycle is Implementation Details
I won’t go into all the Dockerfile details it’s just normal stuff. The image building and running of containers is not exciting. The layout of the services are something like this:
/service/bin/start.sh /service/bin/update.sh /service/bin/validate.sh /service/zones/{bl,gl,wl} /opt/rbldnsd-0.997a/rbldnsd |
What is exciting is that I can introspect a running container. The Dockerfile has lines like this:
ENV UPDATE_METHOD /service/bin/update.sh ENV VALIDATE_METHOD /service/bin/validate.sh |
And an external tool can find out how this container likes to be updated or validated – and later monitored:
$ docker inspect rbldnsd . . "Env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "UPDATE_METHOD=/service/bin/update.sh", "VALIDATE_METHOD=/service/bin/validate.sh", "GIT_REF=fa9dd19d93e6d6cb7d5b2ebdc57f99cd2906df6f" ], |
My update webhook basically just does this:
mco rpc docker runtime_update container=rbldnsd -S container("rbldnsd").present=1 --batch 1 |
So I use mcollective to target an update operation on all machines that runs the rbldnsd container – 1 at a time. The mcollective agent uses docker inspect to introspect the container. Once it knows how the container wants to be updated it calls that command using docker exec.
Outcome Summary
For me this turned out to be a huge win. I had to do a lot of work on the image building side of things, the orchestration, deployment etc – things I had to do with Puppet too anyway. But this basically ticks all the boxes for me that I had in the beginning of this post and quite a few more:
- A reasonable facsimile of the package, config, service triplet that yields idempotent builds
- A comfortable way to develop and test my changes locally with instant feedback like I would with unit tests for normal code but for integration tests of infrastructure components using the same bits as in production.
- Much better visibility over what’s actually going to change, especially in complex cases where config files are built using templates
- An approach where my services are standalone and they all have to think about their run, update and validation cadences. With those being introspectable and callable from the outside.
- My services are standalone artefacts and versioned as a whole. Not spread around the place on machines, in package repos, in data and in CM code that attempts to tie it all together. It’s one thing, from one git repo, stored in one place with a version.
- With validation built into the container and the container being a runnable artefact I get to do this during CI before rolling anything out just like I do on my CLI. And always the actual bits in use or proposed to be used in Production are used.
- Overall I have a lot more confidence in my production changes now than I had with the Puppet workflow.
- Changes can be rolled out to running containers very rapidly – less than 10 seconds and not at the slow Puppet run pace.
- My dev environment is hugely simplified yet much more flexible as I can run current, past and future versions of anything. With less complexity.
- Have a very nice middle ground between immutable server and the need for updating content. Containers are still rebuilt and redeployed every night on schedule and they are still disposable but not at the cost of day to day updates.
I’ve built this process into a number of containers now some like this that are services and even some web ones like my wiki where I edit markdown files and they get rolled out to the running containers immediately on push.
I still have some way to go with monitoring and these services are standalone and not complex multi-component ones but I don’t foresee huge issues with those.
I couldn’t solve this with all these outcomes without a rapid way to stand up and destroy production environments that are isolated from my machine I am developing on. Especially if the final service is some loosely coupled combination of parts from many different sources. I’d love to talk to people who think they have something approaching this without using Docker or similar and be proven wrong but for now, this is a huge step forward for me.
So Puppet and CM tools are irrelevant now?
Getting back to the Puppet part of this post. I could come up with some way to mix Puppet in here too. There are though other interesting aspects about the Docker life cycle that I might blog about later which I think makes it a bit of a square peg in a round hole to combine these two tools. Especially I think people today who think they should use Puppet to build containers or configure containers are a bit miss guided and missing out, I hope they keep working on that though and get somewhere interesting because omfg Dockerfiles but I don’t think the current attempts are interesting.
It kind of gets back to the old thing where it turns out Puppet is not a good choice to manage deployments of Applications but its ok for Infrastructure. I am reconsidering what is infrastructure and what are applications.
So I chose to rethink things from the ground up – how would a nameserver service looked if I considered it Application and not Infrastructure and how should a Application development life cycle around that service look?
This is not a new realisation for me, I’ve often wished and expressed the desire that Puppet Labs should focus a lot more on the workflow and the development cycle and work on providing tools and hooks for that and think about how to make that better, I don’t think that’s really happened. So the conclusion for me was that for this Application or Service development and deployment life cycle Puppet was the wrong tool. I also realise I don’t even remotely resemble their paying target audience.
I am also not saying Puppet or other CM tools are irrelevant due to Docker that’s just madness. I think there’s a place where the 2 worlds meet and for me I am starting to notice that a lot of what I thought was Infrastructure are actually Applications and these have different development and deployment needs which CM and Puppet especially do not address.
Soon there will not be a single mention of DNS related infrastructure in my Puppet code. The container and related files are about equal in complexity and lines of code to what was in Puppet, the final outcome is about the same and it’s as configurable to my environments. The workflow though is massively improved because now I have the advantages that Application developers had for this piece of Infrastructure. Finally a much larger part of the Infrastructure As Code puzzle is falling together and it actually feels like I am working on code with the same feedback cycles and single verifiable artefact outcomes. And that’s pretty huge. Infrastructure are still being CM managed – I just hope to have a radically reduced Infrastructure footprint.
The big take away here isn’t that Docker is some technological magical bullet killing off vast parts of the existing landscape or destroying a subset of tools like CM completely. It brings workflow and UX improvements that are pretty unique and well worth exploring. And this is especially a part where the CM folk have basically just not focussed on. The single biggest win is probably the single artefact aspect as this enables everything I mentioned here.
It also brings a lot of other things from the daemon side – the API, the events, the stats etc that I didn’t talk about here and those are very big deals too wrt what future work they enable. But that’s for future posts.
Technically I think I have a lot of bad things to say about almost every aspect of Docker but those are out weighed by this rapid feedback and increased overall confidence in making change at the pace I would like to.