Five diagnostic questions for email

I used to face a lot of problems with e-mail of this sort:

  • The e-mail is not working
  • Person X is not getting an important mail
  • There’s something wrong with the mail system
  • Please check that the mail servers are working – people are saying something’s wrong
  • Nobody can send mail
  • Nobody is getting mail
  • Person X is getting spam
  • Company X says they can’t send mail to any of our servers

Eventually I figured out what specific information actually helps to get to the bottom of the problem – and it is these five things:

  1. Sender e-mail address
  2. Recipient e-mail address
  3. Time of sending the mail
  4. SMTP server that handled the mail
  5. What was the error

Experience says that without all of these bits of information, or a reasonable facsimile for each, you have a pretty good chance of being led a merry dance – fishing for a problem, and maybe finding it, but probably not.  (The not-so handy acronym “STRES” is not in the dictionary, and really doesn’t help much, so don’t bother with it.)

The reason for each of these questions is:

  1. Sender e-mail address: appears in log files, might be invalid in some way, might be blacklisted.  Actually, the envelope sender appears in the logs, and the From: header does not.
  2. Recipient e-mail address: appears in log files, might be invalid in some way, might be affected by DNS, IP routing, SMTP server failure, local delivery failure (e.g. long queue), spam filtering failure or “incorrect” success.  This is the most important part of the report: if you have to choose just one thing to know about the mail failure you are diagnosing, choose the recipient address.
  3. Time of sending: appears in the log files, tells you which particular log file to look in, tells you whether it’s possibly a transient failure, related to an outage, related to an update.
  4. SMTP server that handled the mail: you have to start somewhere – the last place the mail was seen alive is a good place – sometimes the mail has not even left the sending computer, and the answer to this question tells you that.
  5. What was the error: it is amazing how many times I get asked to solve a problem with nobody telling what the actual problem is.  It’s obvious to the person asking the question, but somehow the telepathic message never gets through.  Knowing the actual error makes such a difference.  Sometimes the error is not even a failure – “the mail system is broken” can mean “I am getting my mail, and some of it is spam.”

So what kind of errors mess up your mail?

  • Sometimes mail disappears into the void – although if you know which SMTP server last handled it, the void is less formless – especially if that server is under your administration.
  • Sometimes that server cannot deliver it because of blacklisting, greylisting, load.  Knowing the last SMTP server and the time can identify the problem.  Often enough, the error says what the problem was.
  • The weirdest failure is people who will not take mail from you, because they cannot verify the sender address, because their system is misbehaving (e.g. it doesn’t speak SMTP properly, like exim’s verify_sender callout)
  • The same mail is received over and over: at the bottom of the pile you find that either it was sent over and over (not often though), or it was forwarded in some kind of loop, or the end-user system is downloading it over and over.  When you have the sender and recipient address, you can verify that there is only one of these mails in the end-user mailbox, and that it is marked as already-read.
  • DNS failures: the worst DNS failures are the ones where someone thinks they have two DNS servers, but they have just one overloaded virtual machine with two IP addresses and a congested and contended network interface.  When you have the sender and recipient addresses, you can test the DNS configuration for each of them.
  • Sometimes the system administrator deletes your mail.  Yep.  If you’re sending spam, you can expect that.  If there’s a large system failure (e.g. one server out of 17 fails for a week after building up a large backlog), then working through the backlog of mail can be impossible because of available resources of CPU time, network bandwidth on the sender and the recipients.  Late delivery generates its own problems and queries in any case.  It’s one of those “if it’s important they will phone again” moments.  Anti-spam systems discriminate against old mail as well – if it spent its time sitting in a mail queue somewhere, it may be because it overwhelmed that system’s capacity by its sheer spammyness (spamosity?).  When you have the last SMTP server you can properly assign blame.  If you are to blame, you can confess.

Additionally, since these are the correct questions for almost any mail problem, you can reject all mail problem reports that do not include these details.  ☺  (Don’t try this at home folks.)

This entry was posted in Stuff and tagged , , , . Bookmark the permalink.