Announcing:

LessMoney Conference will be June 7th in Tampa! Register today and make us smile super big!

Steven Bristol is a Fucking Idiot

written by Steven on August 06, 2010

I used to be a smart man. I used to be able to sleep at night. No longer. Yesterday, the latter caught up with the former. Here is my story: I was installing trivial piece of software on the main LessAccounting production server. I installed in in /var, and after configuring a few other services to work with it I discovered that I needed to upgrade one service to work with the latest version of the thing I was installing. Yum would only upgrade to a minor version behind the one I needed and I didn't care enough to compile, so I just installed the previous version of this trivial software, which does work with the older version of the service. All of this is trial, minor shit, so it was fine. I checked and everything was working. So I decided to clean up the newer, unused software and ran 'sudo rm -fr /var/trivalxxx.x.x.x', enter. Now this is a small directory and should have finished immediately, so after two seconds I looked at the command and realized that due to using tab to complete, my tiredness and my new found lack of intelligence the command I actually ran was 'sudo rm -fr /var trivialxxxx.x.x.x'. FUUUUUUUCK ME!!!! ctrl-c halted the devastation and sure enough LessAccounting.com was no longer serving pages. I looked into /var and most everything was there. I looked down into /var to see how much of LessAccounting's pieces were missing. Everything kept in /var was missing.

A bit of panic ensued, but not much. There was no data there so my first thought was redeploy the missing pieces and then get the rest from backup and everything would be fine. Except it wasn't. Webistrano/capistrano would not connect to the server, ssh problem. Now the panic really started. None of my guys were around for support. Fuck me. If I can't ssh in what else is wrong? How can I fix it. If I loose my one terminal session, how do I get back in. I completely panicked. I video'd with Allan to let him know. Our video session was disconnected and I couldn't connect, via the browser, to some ancillary pages on the server and I shit my pants: the server needs to be rebuilt. It turned out to be a network issue on my side and I video'd him back. He said "There's nothing I can do to help." I pleaded "Just hold my hand and listen to be yell for a few minutes until I calm down." He stayed on for a few minutes, while I went to get the missing stuff from backup. Since I only have one terminal, I had to just sit and wait the ten minutes to retrieve the 5GB of files I needed, and then the 13 minutes to untar/zip the file. During that wait time I'm searching for a hint to the sshd issue. I finally call Rich Cavanaugh for help. He calmly starts walking me through the diagnosis. I copy the missing files back to /var and LessAccounting comes to life. Thankfully, nothing else needed to be done. Total down time about 32 minutes. Back to ssh. Rich suggests tailing /var/log/messages (yes, /var/log was unaffected) and it's obvious that sshd needs /var/empty/ssh/etc/ so it can symlink to the currently timezone file. Creating these directories fixes sshd and I can connect from another terminal. It's over. Everything is over. I'm still shaking.

Understand, this wasn't just carelessness. I am very aware of this type of mistake. I always remind my guys to be careful of this sort of thing when they're in production and I'm always cognizant and very careful myself. The take away for me is that I used to be a smart man who slept at night, but now I am a fucking idiot. A very tired fucking idiot.

Afterword

Remember that this all started because I had two directories: /var/trivialxxxxx.3.2.1 and /var/trivalxxxxx.2.5.1 and I wanted to delete the higher versioned dir. After everything was fixed, I noticed the my terrible rm command had removed the 2.5.1 version, but not the 3.2.1 version. So I had to install 2.5.1 AND still run 'sudo rm -fr /var/trivialxxxxx.3.2.1. After typing the command, but before running it, I cut and paste the command into campfire so someone could double check it. After approval, I ran the command successfully. Once again: Fuck me.

Learn how LessEverything built their consultancy to over $1,000,000 annual revenue at LessMoney Conference, June 7th in Tampa Florida. Each attendee will get early access to our upcoming ebook as well.

14 Comments

Rich Cavanaugh
Rich Cavanaugh said on August 06, 2010

No worries, we all do stupid shit every once in a while. I’ve done this and DROP (DATABASE|TABLE) in the wrong place and worse over the years.

It’s how you respond to these situations that matters. 32 minutes down. Sounds like you handled it nicely and you’re transparent about it to boot. Great work.

Eric Anderson
Eric Anderson said on August 06, 2010

Been there before too. Feel your pain. Luckily you were able to get things back up so fast. Thanks for your transparency.

This is a big reason I want a setup like Heroku to work so bad. The idea of just focusing on my app and not on network admin is so appealing. If only Heroku wasn’t so unstable. It has gotten better over the years but even their own status page details constant failures and I can tell you as a customer of theirs they do not report all their outages on their status page. Anybody know of a alternative that has a similar work flow but better uptime?

Rick DeNatale
Rick DeNatale said on August 06, 2010

Too much head butting with Obie in your past perhaps?

I always try to do ls with the same arguments before, then edit the command from history to change ‘ls’ to ‘rm -rf’ after I’ve confirmed what I’m going to delete, particularly when the path starts with ‘/’.

Teflon Ted
Teflon Ted said on August 06, 2010

reaching out to rich was a good idea. it’s far too common for developers to try to hide in their shell and try to fix it themselves before somebody notices, and in their panic they make things worse.

Dave
Dave said on August 06, 2010

I’ve updated a huge customer table and set every zipcode to be the same by missing a where clause. Oracle’s select * as of 15 minutes ago saved me. Lack of sleep was at fault there too.

Jade Robbins
Jade Robbins said on August 06, 2010

Shit happens man, good work on being clear and open with people about it.

Now time to figure out why you can’t sleep. . . .

paul
paul said on August 06, 2010

Sounds like a round trip to hell;( U handled it very well, and in MHP does not make u a FI. We all make mistakes sometimes, you’ve handled it smartly and with cool. You did everything right, and put it out for us to learn from it. So here is my daily affirmation: Steven Bristol is a smart and honest guy, who keeps his cool when the sh hits the fan.

Douglas F Shearer
Douglas F Shearer said on August 06, 2010

Been there, done that.

Shortly afterwards I found myself researching devops tools like Puppet and Chef. That and testing all changes on a clone have probably saved me from myself a few times now.

Ruben
Ruben said on August 07, 2010

Wow. I can imagine how you must have been freaking out. I can totally relate; that’s not a fun feeling. Still, recovery from that in 30 mins is awesome. Some people lose it and can’t think clearly enough when things get stressful, so nice job on the recovery.

Kai
Kai said on August 08, 2010

As someone checking out one of your software programs (LessTimeSpent), seeing this title of your post linked on your software page is a big turnoff. My initial impression of the software and web site was that it looked very professional, and then I scrolled down to see this post linked to. I don’t really care what you talk like in your personal life, but it’s a poor reflection on the software and the company.

Me
Me said on August 08, 2010

Completely agree with the above post. My first impression was that this product looked very promising. The language in the post is completely unprofessional and really does make you look like an idiot. Looking elsewhere for business….

Jim Gay
Jim Gay said on August 15, 2010

At least it’s back, thank God.

I once spent hours recovering a backup of our Directory Server after a power outage and only saw 100 records listed in our app. I panicked and started the backup all over again beginning at 1am only to find in the end that the app only displayed 100 of the thousands by default. I did the recovery a second time for no reason.

Yardboy
Yardboy said on August 19, 2010

Gack, made me clench just reading it. Testament to your good backup/DR procedures that you were only down 32 mins.

Steven M Bristol
Steven M Bristol said on March 02, 2011

This is the first item that shows when I google my name, cool…

Leave a Comment

About Steven
Steven Bristol has written code for the past 20 years. He like green vegetables and kittens, oh and butterflies too. He loves to throw ninja stars at his enemies.

You Should...

Follow Steven on Twitter
Friend Steven on Facebook
Subscribe
LessEverything Copyright 2011 LessEverything.com
We don't like footers, they're kinda boring