Sometimes processes just die. It’s unavoidable. In most cases, it’ll give your users a chance to see the snarky error page you made late one night or, more likely, the ubiquitous “502 Bad Gateway” page. The art of getting your servers to automatically recover from unexpectedly lost daemons is part of Process Management. It’s just one of the tools on the sysadmin’s utility belt, but it’s a fundamental one.
There are many different process management solutions out there to choose from. In this post, I’ll walk through some of the more popular options and point out the good, the bad, and the ugly of each.
Obligatory disclaimer: I’m not an expert in process management and the following is my personal opinion and impressions of each solution.
At first glance, Upstart seems like it’s only meant to replace /etc/init.d and similar daemonization techniques. (It says so right on the project’s homepage.) Maybe you already know how to write a half-decent init.d script, so why bother with these newfangled .conf files? Because Upstart does a heck of a lot more than replacing simple start/stop/restart daemonization, including keeping daemons alive!
In many cases, a short Upstart script can replace a more verbose init.d script to achieve the same goal of being able to start, stop, restart, and get the status of a daemon process. Of course, the devil is in the details. The majority of a typical init.d script tends to be dedicated to validating the environment, reading configuration, setting environment variables, etc. Upstart supports all that too. The cookbook goes into detail about everything Upstart can do.
Note: While the cookbook is a great resource, it can be overwhelming for a beginner…it certainly was for me.
Here’s a simple configuration to manage a Django-Celery worker process:
It’s straightforward enough that I won’t go through it line by line, but it is important to note that in this example, Upstart will immediately start the daemon again if it unexpectedly goes away, courtesy of the
Save it as /etc/init/celery.conf and then run
initctl reload-configuration to let Upstart know about it. (The documentation claims changes are automatically discovered in the /etc/init directory, but that has proved unreliable for me.) The daemon will now start when the server boots, stop when it shuts down, and can be manually controlled by the likes of
service celery start or
stop celery where the service name is whatever comes before the .conf in the filename, in this case “celery”.
Overall, I like Upstart a lot. Its functionality is rich enough to let power-users achieve complex tasks while still facilitating simple daemonization thanks to well-designed defaults. Since all processes managed by Upstart are subprocesses of a master Upstart process, daemons that exit unexpectedly are immediately detected and respawned.
I like that I don’t have to worry about the question “What if Upstart itself dies?” On Ubuntu and several other distributions, Upstart is used to manage most of the core system-level processes like networking, syslog, ssh, tty terminals, etc. This is comforting. If Upstart dies, you’ll have bigger problems on your hands than your web server daemon disappearing.
Unfortunately, Upstart lacks support for custom commands. The ability to send arbitrary signals to a process or execute arbitrary scripts like init.d scripts allows comes in handy. The convenience of custom commands like
/etc/init.d/nginx configtest to check my nginx configuration syntax without affecting the running nginx service is useful enough to keep me from migrating my nginx daemonization to Upstart.
Monit is an established player in the process management game. Its sole purpose is to monitor daemon processes, files, directories, filesystems, etc on your server and respond with appropriate actions whenever something is not as it should be.
Here’s a simple configuration to monitor an SSH server daemon process:
What’s going on here? Given the daemon’s PID filepath and how to start/stop the daemon, Monit will check that the process exists every 60 seconds and start it anew if it is not found. What’s more, Monit will also attempt an SSH connection on port 22 and restart the ssh server process if the test fails.
That last bit is the true power of Monit. On top of simply checking that a process exists, it can perform network tests as well as check system resources like CPU usage, memory consumption, number of child processes, and many other things. This can aid greatly in determining if a webserver is correctly serving traffic on port 80. It’s also a great band-aid for a process known to have a memory leak.
After Monit is initially configured and started, it’s controlled via the command line. Executing
monit summary gives the state of all the processes it is monitoring. Monitoring for a given process can be temporarily disabled/enabled with the
monit unmonitor and
monit monitor commands. This can be useful as part of a deploy process for ensuring a daemon is stopped while a source code directory is rsynced or otherwise updated. A handful of other actions are also available.
By the way, a built-in web interface can be enabled to provide much of the same functionality available from the command line in a more user friendly way. Just be sure to properly lock down the web interface from public access, since it exposes so many powerful and potentially dangerous functions.
Monit is a pretty solid and useful tool. It’s portable and easy to compile since it has very few library dependencies. Note that, unlike most of the other solutions described here, Monit does not daemonize processes; it only monitors them. This can be seen as either a positive or negative. On one hand, separation of concerns is a good way to keep things simple. On the other hand, it’s one more system to maintain.
Monit seems to have a good developer community around it with a fairly responsive mailing list. The wiki includes an amazing collection of configurations examples for just about every common service out there.
Monit can also be used to monitor the existence, contents, and other properties of arbitrary files or directories on the server. While this likely has some interesting use cases, I imagine these features have been somewhat superseded with the advent of configuration management tools like Puppet and Chef.
I ran into several gotchas while getting familiar with Monit.
When specifying the start program and stop program directives, don’t make any assumptions about the environment’s PATH variable; use absolute paths for all executables. This led me to more than a few head-banging moments.
Don’t forget about Monit when manually starting or stopping a daemon it is watching. This can lead to a process either inexplicably being resurrected shortly after stopping it or, even worse, a process left unmonitored once it is started again! Either commit yourself to never forgetting that Monit is running (good luck with that) or get in the habit of using the
monit stop and
monit start commands when manually controlling daemon processes.
Programmatically issuing commands like
monit unmonitor and
monit monitor in rapid succession will often lead to errors. To avoid this, use groups intelligently so that only one command is ever required at a time. If groups aren’t enough, adding a one second sleep between monit commands is a reasonable solution.
Supervisor is a Python-based process management solution. It’s one of the newer contenders in the space, and shares design principles and goals with Upstart. As such, it takes care of daemonization as well as process monitoring.
Here’s a configuration analogous to the one shown for Upstart:
Configuration is done in the familiar .ini format rather than a custom syntax as is the case for Upstart and Monit. While it is more verbose, it remains fairly readable. Configuring Supervisor itself is a bit more involved than some other solutions. Fortunately, the folks behind Supervisor realized this and include the
echo_supervisord_conf command for you to run on the command line to send a default configuration to stdout.
After everything is configured, start Supervisor by executing
supervisord. The processes Supervisor manages are automatically started as subprocesses of the main supervisord process. As with Upstart, a managed process will be restarted immediately if it unexpectedly exits. Note that the
autorestart configuration option lets you explicitly define what what qualifies as “unexpectedly exiting”.
Supervisor can be controlled through
supervisorctl on the command line. Commands may be issued one at a time (e.g.
supervisorctl restart celery) or by first starting a Supervisor terminal (by running
supervisorctl with no action specified) and then issuing actions directly (e.g.
Supervisor is an exciting prospect! Since it written in Python, it’s arguably less intimidating for the average user to go source-code diving to investigate a bug or satisfy curiosity. It also makes it trivial to install on any system that has easy_install or pip.
Again, since all monitored processes are subprocesses of the supervisord process, Supervisor is instantly notified of a process dying and acts to respawn it instantly. Instant is nice compared to the interval that Monit runs its checks on (which tends to be set to 60 seconds by default).
It’s not all roses. If Supervisor itself were to crash (this should be a very rare occurrence), all daemons it controlled would also go away. Not trying to scare anyone away, just something to consider since it’s every sysadmins’ job to be paranoid and consider all the what-ifs they can.
While the .ini configuration syntax is nice and has a lot of options available, I found it odd and constricting sometimes. Loading environment variables from a defaults file seems all but impossible (please correct me if I’m wrong!). I thought perhaps making a .ini version of a defaults file would work, but still no luck due to Supervisor using an older variant of Python’s ConfigParser class. The simplest way around this I could come up with is to wrap the daemon’s start command in a shell script that first loads any extra environment variables. A hack for sure, but not a horrible one.
Daemonizing Supervisor itself is a task. I would recommend Upstart as a good candidate for doing so. ;-)
I’ve yet to investigate how well Supervisor is supported by Puppet or Chef. This is an important point since Supervisor isn’t the native daemonization facility on servers and will therefore require a custom plugin to properly start/stop/restart processes via a content management system. A third party Puppet module exists and has a respectable following, but I have not tried it out just yet.
Circus is another Python-based process manager similar to Supervisor, but with a twist. In addition to managing processes, Circus can also create sockets and direct traffic to/from them. It’s clearly been designed and tailored with webserver stacks in mind and may not be the best choice for general purpose process management. Nonetheless, it is a relatively new project with some innovative concepts, so I’d say it’s worth looking into.
As it stands now, I have major apprehensions about Circus. After reading a bit about it, I was excited to try it out. It seemed like it could become a core part of all my web projects. It’s maintained by a group within Mozilla, and corporate backing of an open source project is usually a great indicator of good support and forward progress.
Unfortunately, I had a bad first experience with it.
While getting my “Hello, World!” of Circus up and running, I ran into several inaccuracies in the documentation.
The deployment page gives a sample Upstart script to daemonize Circus. Great! A quick copy/paste later, Circus was failing to start up. After looking closer, turns out there are two typos. First, a newline is missing between the respawn and exec directives. Second, en-dashes are used rather than double normal dashes before the log-output and pidfile parameters. That was a particularly frustrating and difficult to spot typo!
I also found a quirk on the configuration page. The
[env:program] block in the first example configuration is indented. While still correct syntax, it made me second guess myself since indentation typically is not used in .ini files.
I submitted a pull request and an issue to address these issues. They were accepted/addressed promptly and are now reflected on the website, so kudos to the Circus team for that!
These may seem like nitpicks, and on their own they probably are. However, they could suggest a larger issue if the documentation is rarely proofread or sanity-checked.
Part of the case for Circus is that it simplifies and unifies the management of a typical webserver stack. For example, it can outright replace a webserver like Gunicorn or uWSGI and take over the job of spawning web workers. I won’t try to explain anymore, just read their description. This is exciting since it reduces the number of moving parts in a stack by one since the webserver and process manager are one and the same. This is at odds with the notion of separating concerns, but will likely be an elegant and simplified solution for most users.
After finally getting the Circus daemon up and running, I tried issuing commands to it using the
circusctl command line tool. The commands available are similar to those for Supervisor’s
supervisorctl command. Unfortunately, the commands would frequently timeout or completely fail without any useful explanation as to why. What’s worse, after one command failed, all subsequent commands would fail until fully restarting the circus daemon. A couple times, the commands would hang indefinitely — consuming 100% of the CPU — until I forcefully killed the supervisorctl process.
These issues were enough for me to lose my confidence in Circus being reliable in a production environment, so I curtailed my evaluation of it. Of course, it’s very possible I misconfigured Circus in a fundamental way, causing myself to experience these issues. Either way, the lack of helpful error messages was not particularly beginner-friendly.
There are lots of options in the field of Process Management. The ones I’ve described above are only those that I’ve had the opportunity to take for a test spin and, in some cases, deploy in production. There is a lot of discussion and debate about the right way to do this stuff out there, and the appropriate mix of tools will vary a lot depending on your scenario and requirements. If you are new to this, hopefully I’ve given you enough of an (opinionated) survey of the space to get started. If you already have some experience with process management, I would love to hear what you think!
By the way, if you’re looking for a job and this interests you, Crocodoc is hiring. We’re a fast-moving team and frequently get to evaluate new technologies that can improve our product…or just because we want to! I’d love to have you join us.
written by Matt Long, Crocodoc Cofounder and Lead Developer
See the discussion on Hacker News
This is not meant to be an exhaustive list of Process Management systems. There are many other great tools out there that I simply haven’t had the time to evaluate yet. I’ll do a follow up post on another batch of systems. Please let me know if you have any suggestions!