Blog server implosion – fix and cause

I got an email on Saturday morning:

“I’m getting a message when I try and “post draft and edit online”.  See pictures attached of the messages.”

Blog error

Uh oh. Nothing had changed in the config of the web server for months – and adding extra disk space to the server wouldn’t cause this.

I looked at the Apache error logs – nothing. I couldn’t see anything that would be causing this. Typically it’s a permissions or xml-rpc problem that’s kicking up a complaint in Windows Live Writer.

Other blogs on the same server were working perfectly; I could upload via xml-rpc as well. Very strange.

Eventually I tracked down an alert in /var/log/warn that was flagging ‘cannot read inode bitmap’ – whenever I tried to upload an image via xml-rpc. Even stranger. This really didn’t make any sense – but it looked like early signs of a corrupt root filesystem and being unable to write to temp.

I dismounted everything and tried to fsck the disk – and then the world of pain unraveled. The entire root filesystem seemed to have junk – it’s ext3 so should be pretty robust. I’ve no idea what caused it – but the end result was that most of /etc was toasted and there were some 10,000 entries in lost+found.

The upside is that the mysql and web data are all on seperate disks – so really easy to reconstruct the server. I had backups of my PHP, mysql and Apache confs – as well as all the data. The only slog was updating the Apache/PHP/MySQL stack to the correct (current) versions for my uses.

What I learned:

  • backups are great – but separating the data from the OS is a real winner
  • backup the config files for the core apps
  • document the correct versions of core apps. Currently Apache 2.2.10, PHP 5.3.2 and MySQL 5.1.3 – these all work together without problems

Total downtime – about eight hours. Real time spent fixing this – about three hours.

I also moved several of the blogs to WordPress 3.0 RC1 – it’s been really stable so far on the main blog. I also had to do a latin1 to utf8 conversion on one of the older blogs. Always painful – but a one time hit. I need to add that to the change control/validation for the next round of big updates.