Category: Software

Software makes mistakes. So do users. So let’s deal with it.

I have a fantasy. OK, so I live a lot of my life in software-development and software-use land, so it’s a kind of prosaic fantasy. But bear with me: here goes anyway.

One day, my fantasy goes, an email will arrive in my Inbox from the vendor of some piece of software I’m using (Intuit, for the sake of example) which will go something like this:

Dear davidkarlin,

Our monitoring systems have detected that on 20th January 2015, you received an error message “Error 407: Unable to update bank transactions. Please try later or contact support.” We have now analysed the cause of this error and are glad to tell you that a fix was deployed in last night’s release.

We trust that this fix has been effective, but if the error should recur, please contact our developers at development@intuit.com quoting incident no. 123456789.

Regards

The Intuit Development team

Sadly, when I’ve woken up, reality is very different. What actually happens is this:

  1. Intuit certainly don’t proactively look at error messages they generate for me and deal with them on my behalf. What actually happens is that I phone the support line; when I’ve negotiated their IVR system, I get put through to an agent whose first reaction to all problems is to ask me to clear cookies and try again.
  2. Once it’s been verified that my error is unaffected by cookies (no surprises there), I get asked to uninstall and re-install as much of the system as possible.
  3. Once that’s failed, we’re into “it’s all terribly difficult, isn’t it: maybe you can try again tomorrow” territory.
  4. I then receive a survey asking me the now-ubiquitous “Net Promoter” question (the one that begins “on a scale of 0 to 10, would you recommend…”), followed by an email about the latest upgrade, which contains some delightful new feature set I didn’t ask for.

By the way, I’m not singling out Intuit here: their support line is actually one of the better ones I deal with. But the general tenor of the experience is common to most technology vendors that I’ve either worked in or whose products I’ve used: software houses prioritise cool new features over the simple business of eliminating errors.

What’s particularly striking is how bad software developers are at dealing with intermittent faults: if you can’t replicate the problem to order, that’s pretty much end of story in terms of getting anyone to take it seriously.

In my view, *any* error message is a bad thing. If it’s as a result of a software bug, there should be zero tolerance. If it’s as a result of user error, I should be thinking “how could I have designed the interface better so that the user would have been less likely to make that mistake”. Eventually, of course, there’s a law of diminishing returns here. But the vast majority of software, I would argue, is a country mile from reaching the point where a significant improvement in user experience would no longer be generated by a straightforward analysis of the rate at which error messages are generated and their most frequent causes.

And here’s an important thing: technically, it’s not all that difficult to keep logs of enough diagnostic information to enable a developer to find out what went wrong, even for the intermittent stuff. It comes down to a matter of choice: do you or do you not make the effort to log the data and then make it someone’s job to look through the logs and find the root causes. The software companies who make engine management or process control  systems keep this kind of log data as a matter of course: it’s completely understood that some particular vibration pattern might only happen once in a long test run, that testers can’t predict when it will happen and that analysis needs to be done after the event.

As well as the technology being there to keep and analyse logs, storage is now becoming so cheap that it’s possible to take logs in a lot more detail. The toughest issue, these days, is ensuring the privacy of all this log data – which is tricky, but not insurmountable.

So here’s my plea to all you providers of software and software-based systems:

  1. Analyse your incidence of error messages, and gather a metric along the lines of “number of errors per user per hour of usage”. Allocate more resources to reducing this metric than you do to providing the latest cool features.
  2. Adopt a zero-tolerance approach to bugs, including intermittent ones. Get rid of the “if you can’t replicate a bug, it doesn’t really exist” mentality, and replace it by “if a bug happens even once, we want to find out why and kill it”.
  3. Invest in instrumentation so that your developers can review logs of one-off events in enough detail to fix them.
  4. And if you really want to delight me, make my own crash data personally identifiable (with my permission, of course) so that you can proactively tell me about the good things you’ve done for me.

After writing this, I made a resolution to put my money (well, time) where my mouth is, so on Friday, I looked through the error logs on Bachtrack’s web server. Surely enough, there was a consistent “page not found” log that occurred over a hundred times in March. That’s not a lot, in the grand scale of things (we get 200,000 page views a month), but it only took an hour or so to find and fix it. If I can keep doing that for a few hours each week, that adds up to a lot of people whose user experience is going to be improved. None of them, by the way, called in to complain.

As software suppliers, let’s all take this stuff a lot more seriously. It really will help the world out there.

Three questions you should ask your cloud-based software provider

Back in the day, if you were a software company pitching to investors, the first questions they asked you were much the ones you might expect: your turnover, margins, how many customers you have and so on. Smarter investors asked about things like retention rates and cost of customer acquisition. Around 2005 or so, all that changed: the question at the top of the list became “What’s your SaaS strategy?” A couple of years later, that morphed into “What’s your Cloud strategy?”

A few years later, I run a business which is small (9 employees) but complex (multi-currency, multi-lingual, multi-country). And indeed, pretty much everything that isn’t on our own server is run in the cloud: I finally moved our accounting system from Intuit’s Quickbooks desktop to Quickbooks Online eighteen months ago.

The move to Online has resulted in some small wins. The main one is that I don’t have to run a Windows Virtual Machine any more (I run Macs because I develop software and the tools require a Unix-family operating system). And it’s occasionally but infrequently useful to be able to get some of the accounts done at home in the evening. But the truth is that most of the product works very similarly and, broadly speaking, going cloud hasn’t affected things much either way.

Except that I’m now terrified. For three reasons.

What happens, it’s fair to ask,  if I do something really stupid with a transaction – of the sort that can’t be reversed. I’m accident-prone, after all, like anyone else. On the desktop product, it was easy to deal with: I would simply have reverted to the previous night’s or previous month’s backup and re-input a bunch of transactions. On the online product, backup and restore isn’t an option that’s provided. This isn’t unique to Intuit, by the way – the norm seems to be that most cloud vendors simply don’t offer this.

Lest you think this is unlikely to happen, I can tell you that when you advance payroll a month, there’s a large warning saying “This cannot be undone”: any mistakes and you’re toast. And when I have needed to work around bugs or omissions in Quickbooks, their technical support people have recommended with gay abandon that I do things that affect transactions in now-closed periods (i.e. would potentially make my VAT return illegal).

The next question for your vendor concerns their attitude to bugs. Not “technical support issues,” not “stray transactions that can be corrected,” but bugs – the real thing, where the system isn’t working. Perhaps intermittently, and perhaps just on your database. In desktop days, you had the option to simply not upgrade. Or to roll back an upgrade if it all went pear-shaped. In cloud days, you don’t. You really, really want your vendor to be completely committed to doing whatever it takes to bring you back on-line and running. And the truth is, these vendors are not. A missing feature deep in the multi-currency handling of Quickbooks Online kept my ledgers out of balance for most of a year until someone clever in Intuit figured out a workaround. Problems with my online banking interface are approaching their second birthday: the software worked fine when I evaluated it; two months in, Intuit deployed a rewrite which broke it. And there is no sign of them showing any commitment to getting it fixed: they work on it for a bit, and then give up. Fortunately, it’s only a time waster rather than a complete showstopper: because remember, I don’t have data portability of any viable sort. I have no easy way of exporting my data such that I could rapidly start again with another vendor.

The scariest problem (albeit the least frequent) is what happens if you or a vendor messes up your login credentials. You can all imagine the situation: you try to log in one morning and you get told that one of your passwords is wrong, or the software asks you to re-authenticate using one of your “memorable phrases,” and your phrase turns out to be less memorable than you thought.

With one of my cloud service vendors, that’s just what happened: I got locked out of certain areas of my account, and the vendor refused point blank to take the required steps to re-authenticate me. I was unable to satisfy them with the data they required in their online form, most probably because I couldn’t remember the month and year in which I originally joined the service, around a decade earlier, or which of my many email addresses I used at the time – but I can’t be sure.

And no, this wasn’t a small, fly-by-night operator: this was Microsoft. I actually had to stop using my old account (which still exists, by the way: they are unable/unwilling to delete it) and open a new one. Now losing a Skype account wasn’t the end of the world. I shudder to think how I would deal with the situation if this happened to my accounting system, or web host, or Gmail.

And that, by the way, is without considering the possibility of criminal malice: although, thank goodness, I’ve never personally had my identity stolen, I’ve watched it happen to one of my employees (who had a common first name and whose surname was Smith, which didn’t help) and I can assure you that it was a truly horrific experience.

So before you dive into the Cloud, here are three questions you should ask:

  1. What strategy do you support for me to back up and restore my data? (And while we’re on the subject, if I wish to move my data to another provider, how is that supported).
  2. If I hit a bug in my installation, what guarantees and timescales can you provide me that you will (a) provide a fix to get me up and running, and (b) fix the problem permanently?
  3. What, if any, data do you require me to hold to guarantee that, in the event of my being denied access to the system (whether because of identity theft or just my own forgetfulness), you will accept or replace my user credentials ?

The chances are that the answers to these will be something along the lines of (1) you don’t need to back up your data because we guarantee you 99.999% uptime; (2) our technical support team is available to help you 24/7 but we don’t provide specific guarantees and (3) we don’t publish security-sensitive information of this sort.  If they are and you’re a large organisation, you will need to write a set of large, ugly items into your corporate risk register.

Or, if you’re a small business, just lose some sleep.