Friday, January 28, 2011

Shell script to execute something when one of the daemon dies?

I just had to build two instances of daemon running on different ports. let's say they both serve critical missions for some applications.

How can I do some automatic task (eg. shell script) execution for checking both of daemons when one of them fails to serve?

What kind of script that can always check the daemon's life and maybe can execute some other jobs if one of the daemons incidentally stopped?

  • Depends alot on the daemon.

    If the daemon has an API and you can communicate to it via either a TCP/IP socket, or a UNIX socket, then you can do that.

    If it listens on a TCP/IP port for example, you can write a script that connects to it, expects a certain response - and also times how long it took to get that response - you can feed that to monitoring tools such as nagios, munin, or other - Whatever script you end up writing to carry out a single check, you can then easily integrate into nagios plugin for example - which is probably what you want to do - once you've written your check script.

    If it provides you nothing useful, does not listen on a port, all you can really do is check the process tree using lsof, netstat, or something like that, and just make sure it exists, but you can't really do a health check.

    You'll need to be more specific really before anyone here can offer you something useful.

    Have a look at the PID of the running daemon, then look under /proc/<PID>/fd to get a view of what pipes/sockets/files the daemon is interacting with - that might help you get started.

    From Xerxes
  • My advice would be to use your (hopefully) existing monitoring infrastructure.

    I use Nagios and I query my Linux machines with SNMP. The Linux MIB allows me to retrieve the names of all running processes, along with PID and arguments. I use this to monitor various daemons (like crond) that don't open ports.

  • monit is great in this case - it runs on localhost, so you don't need a network connection to restart your daemon (in case it fails, or daemon is responsible for networking). It also has small footprint on the system, and you can use it to monitor your other daemons/disk space/etc. as well.

    Create a start/stop script (similar to those in /etc/init.d/ and create symlinks for it in runlevels that your system uses for normal operation, ensuring that your daemon will start at reboot and stop at shutdown properly. If your daemon doesn't have pidfile, create one using start-stop-daemon script.

    After that, install monit and create configuration for your daemon, something like this:

    check process daemond with pidfile /var/run/daemond.pid
        start program "/etc/init.d/daemond start"
        stop program "/etc/init.d/daemond stop"
        if failed port 1234 type TCP for 5 times within 10 cycles then restart
        if 3 restarts within 5 cycles then alert
    

    This configuration will ensure that if daemon stops responding on tcp port 1234, or will stop running, it will be restarted using the init script. monit will also send you an alert through email, or do other things, depending on how you configure it. Just check out monit(1) manpage.

    Gogonez : thanks, monit is quite good enough for this, and i have control both daemons, restart them when something goes wrong..

    "check process sphnx-a with pidfile /usr/local/sphinx-a/var/run.pid"

    i'd never completely read all the monit man for this (rush) time, have anyone tell me how to run a script (php script) when failures detected on monit monitor?
    Maciej Delmanowski : @Andri: You should include your php script in init script "start" call, from the shell it should be simple 'php path/to/script.php'. You need php5-cli package for this to work.
  • Well, I have a favorite check for something when it dies that's incredibly simple. Just fire it up with:

    /path/to/daemon || mail user@domain.com "/path/to/daemon died; do X" or:

    while [[ -z "" ]]; do /path/to/daemon || mail user@domain.com "/path/to/daemon died; restarting"; done

    Or something like that. If the daemon will exit if there's a problem, this gives you very easy and reliable monitoring.

    From Outdoor
  • Maybe you could try starting the tasks using the init daemon, (assuming this is Unix, which other answers seem to have done). Check out the manual page for inittab, which should explain in detail how to do it. You can arrange for your process to start at boot time, or in a specific runlevel(s).

    If you use the respawn option, then your process will just be restarted automatically if it fails - i.e. the init daemon is a built in process monitor as well as process starter. However, the init daemon also has some "intelligence" such that if your daemon is restarted too often in too short a space of time, then eventually init will stop trying to start it again for a few minutes. This makes it more difficult for a rogue process to accidently consume all the CPU on the machine for example.

    init will normally be configured to log entries to /var/log somewhere as well, so you get logging for free.

  • Not sure why no one has listed the obvious simple solution:

    #!/bin/bash
    ps ax | grep "[p]rogie" >/dev/null 2>&1
    if [ $? != 0 ] ; then
        # do something
    fi
    
    From tylerl
  • I'd recommend monit, or god for this kind of task, over a network monitoring tool like nagios or a custom script like one of the other posters recommends.

    Another poster included a config snippet for Monit; God is similar in concept, but written in ruby. Both of these tools are written to handle this kind of situation, and provide far more flexibility than a custom script would (what if the process exists, but is non-responsive? What if your grep matches a different process that you didn't anticipate? ) and provide thresholding and hysteresis support "out of the box".

    I might use Nagios as a separate alerting and notification path, but I wouldn't use it as the sole way of monitoring and restarting the process, for the same reason that I wouldn't use handwritten scripts - it doesn't provide enough flexibility for most monitoring situations, and while you can trigger events (such as restarting a service), it doesn't have the flexibility than tools like monit do.

0 comments:

Post a Comment