A Nerdy View of the World

Lessons Learned

We had a code release yesterday at work that didn't go great. Luckily, we had the main new feature behind a feature flag, where we could test it before we turned it on globally for our users. Front End (the dark side) had a few issues - one being they had the production url wrong for our new API in both the web and mobile app. The mobile app, of course, is worse, because it had to be resubmitted for approval.

On the backend side (the light side), a regular expression that should have worked for CORS didn't. It allowed traffic for www, but not the root domain in prod. We finally gave up and put the hard coded values we needed into the config.

We also had a communication issue between FE and BE. One of my backend devs had just completed a big project where we transitioned from an expensive manual translation service to a more reasonably priced automated one. The tool we needed to generate the translations wasn't ready until a few days ago, but the second it was, I cheerfully ran the translations and told the teams I had done so.

What I did not do, was be precise in my language. I basically said, "hey!, I generated the translations!" with a big smilie face in slack. I didn't ask QA or anybody to verify them, and I didn't go verify them in the UI myself. I saw the translations in the appropriate database tables and called it good, figuring acceptance testing would make sure nothing had been missed.

Something HAD been missed. Our new API uses one our core libraries that contains entities and services used throughout our system. The translations are one of those "magic" things that happen in the background, and to be honest, I never paid that much attention to it. I had worked on tracking down a couple of translation glitches when I first started at the company, and hadn't really looked at it since.

It turned out, I needed a piece in the API side to make the magic work. I was able to get it sorted in a couple of hours, but that was super embarrassing.

On the Devops side, our new API had been locked down behind a VPN while we did our testing. Because devs have to be behind the VPN to access a lot of our stuff, we never noticed that it wasn't accessible to the general public. Ooops.

So, lessons learned:

  1. Be more precise in language. Ask a specific person or team to help check something if you can't do it yourself. Don't assume they will do it automatically because they normally do.
  2. Verify a feature all the way through the system - not just for my team's little corner of things.
  3. Regular expressions are still evil. We couldn't have seen this glitch until the feature was released to prod (no subdomains in prod, and no environment without subdomains in staging and testing areas).
  4. Dev and Staging areas really should match prod exactly. If our QA or acceptance testing environments had no subdomains, we would have caught this earlier.

Where I work, we don't typically save up features for a big bang release. This was a rare occasion when we did so, and it was quite obvious that we just didn't have the procedures and habits in place to prevent some of this nonsense. All we can do now is fix the stuff that is broken, and then do something to protect against this kind of stuff happening again. In other words, Kaizen, y'all.