I have a fantasy. OK, so I live a lot of my life in software-development and software-use land, so it’s a kind of prosaic fantasy. But bear with me: here goes anyway.
One day, my fantasy goes, an email will arrive in my Inbox from the vendor of some piece of software I’m using (Intuit, for the sake of example) which will go something like this:
Our monitoring systems have detected that on 20th January 2015, you received an error message “Error 407: Unable to update bank transactions. Please try later or contact support.” We have now analysed the cause of this error and are glad to tell you that a fix was deployed in last night’s release.
We trust that this fix has been effective, but if the error should recur, please contact our developers at email@example.com quoting incident no. 123456789.
The Intuit Development team
Sadly, when I’ve woken up, reality is very different. What actually happens is this:
- Intuit certainly don’t proactively look at error messages they generate for me and deal with them on my behalf. What actually happens is that I phone the support line; when I’ve negotiated their IVR system, I get put through to an agent whose first reaction to all problems is to ask me to clear cookies and try again.
- Once it’s been verified that my error is unaffected by cookies (no surprises there), I get asked to uninstall and re-install as much of the system as possible.
- Once that’s failed, we’re into “it’s all terribly difficult, isn’t it: maybe you can try again tomorrow” territory.
- I then receive a survey asking me the now-ubiquitous “Net Promoter” question (the one that begins “on a scale of 0 to 10, would you recommend…”), followed by an email about the latest upgrade, which contains some delightful new feature set I didn’t ask for.
By the way, I’m not singling out Intuit here: their support line is actually one of the better ones I deal with. But the general tenor of the experience is common to most technology vendors that I’ve either worked in or whose products I’ve used: software houses prioritise cool new features over the simple business of eliminating errors.
What’s particularly striking is how bad software developers are at dealing with intermittent faults: if you can’t replicate the problem to order, that’s pretty much end of story in terms of getting anyone to take it seriously.
In my view, *any* error message is a bad thing. If it’s as a result of a software bug, there should be zero tolerance. If it’s as a result of user error, I should be thinking “how could I have designed the interface better so that the user would have been less likely to make that mistake”. Eventually, of course, there’s a law of diminishing returns here. But the vast majority of software, I would argue, is a country mile from reaching the point where a significant improvement in user experience would no longer be generated by a straightforward analysis of the rate at which error messages are generated and their most frequent causes.
And here’s an important thing: technically, it’s not all that difficult to keep logs of enough diagnostic information to enable a developer to find out what went wrong, even for the intermittent stuff. It comes down to a matter of choice: do you or do you not make the effort to log the data and then make it someone’s job to look through the logs and find the root causes. The software companies who make engine management or process control systems keep this kind of log data as a matter of course: it’s completely understood that some particular vibration pattern might only happen once in a long test run, that testers can’t predict when it will happen and that analysis needs to be done after the event.
As well as the technology being there to keep and analyse logs, storage is now becoming so cheap that it’s possible to take logs in a lot more detail. The toughest issue, these days, is ensuring the privacy of all this log data – which is tricky, but not insurmountable.
So here’s my plea to all you providers of software and software-based systems:
- Analyse your incidence of error messages, and gather a metric along the lines of “number of errors per user per hour of usage”. Allocate more resources to reducing this metric than you do to providing the latest cool features.
- Adopt a zero-tolerance approach to bugs, including intermittent ones. Get rid of the “if you can’t replicate a bug, it doesn’t really exist” mentality, and replace it by “if a bug happens even once, we want to find out why and kill it”.
- Invest in instrumentation so that your developers can review logs of one-off events in enough detail to fix them.
- And if you really want to delight me, make my own crash data personally identifiable (with my permission, of course) so that you can proactively tell me about the good things you’ve done for me.
After writing this, I made a resolution to put my money (well, time) where my mouth is, so on Friday, I looked through the error logs on Bachtrack’s web server. Surely enough, there was a consistent “page not found” log that occurred over a hundred times in March. That’s not a lot, in the grand scale of things (we get 200,000 page views a month), but it only took an hour or so to find and fix it. If I can keep doing that for a few hours each week, that adds up to a lot of people whose user experience is going to be improved. None of them, by the way, called in to complain.
As software suppliers, let’s all take this stuff a lot more seriously. It really will help the world out there.