Error checking

Sat Apr 7 16:12:40 UTC 2012

Error checking

Suddenly at a previous workplace, a HP-UX machine running Sendmail suddenly started beeping regularily, in intervals of a few minutes. Further examination showed: it was the X server no longer starting up properly.

Yes, I know, a X server is not supposed to run on a mail server. But this is not the problem I am talking about here.

Further examination led to the X server failing to start up because of wrong permissions in /etc–anything there suddenly was owned by the group mail, and having 664 permissions. How can this have happened?

The culprit was easily found: a cronjob along the lines of

13 * * * * cd /var/spool/mail; find . -type f -exec chgrp mail {} \; -exec chmod 664 {} \;

At a certain time, the remote mount of /var/spool/mail was down, making the cd command fail. However, the cronjob just continued to run, and performed that nasty find command all over the root file system.

We were lucky that the server did not succeed in accessing the user home directories due to the same issue...

The Solution

First of all: don't do such weird hacks to solve a setup problem of e.g. Sendmail. These hacks never help. In this case, it only was necessary due to a bug in the procmail setup. Typically, procmail gains root privileges by setuid bit, then changes its user context to the target user's in order to run commands from the user's .procmailrc file. For some reason, this mechanism was not working on the system, and the previous admins thought it to be a good idea to install such a cron job to work around the problem. As a side effect, this meant that any user had full access to any user's mail by simply putting the right commands in their .procmailrc.

This specific problem however stemmed from being careless when writing shell scripts. One should never perform such a critical command on anything it is not meant for. In this case, this means: either should one have used an explicit path specification as argument to find:

13 * * * * find /var/spool/mail -type f -exec chgrp mail {} \; -exec chmod 664 {} \;

Or, one should check the status of the cd command and abort on failure:

13 * * * * cd /var/spool/mail && find -type f -exec chgrp mail {} \; -exec chmod 664 {} \;

Or, one could get used to implicit error abort in POSIX shells by using set -e:

13 * * * * set -e; cd /var/spool/mail; find -type f -exec chgrp mail {} \; -exec chmod 664 {} \;

Any of these measures would have prevented the catastrophic failure that in the end led to the system being reinstalled in order to restore permissions back to normal.

Lesson learned: check for error conditions, bail out to prevent worse things from happening


Posted by OpBaI | Permanent link | File under: unix, shell