Select Page

Infrastructure – heavy lifting and planning

A trio of projects before the year-end – all interwined.

  • Migration of mail from Google Apps to Hosted Exchange.
  • Migration of DNS from current service provider to ‘someone new’
  • Migration of blog/photos to ‘somewhere in the cloud’

Moving the mail isn’t that hard – it’s just making sure that mail doesn’t get dropped while the new MX and CNAMEs are propagating. The old mail will live on in Google Apps – the new stuff in hosted Exchange. The trickier part is making sure that ‘my customers’ get the right service – and can keep getting mail in Outlook or the web. Users eh.

Moving the DNS is part of the mid-term strategy to change ISP. Covad have been great to me since I moved to the US; sadly they are starting to show signs of decay. I need to support additional DNS records than the A, CNAME and MX records – no plans from Covad.

The final push is to move the blog servers out of the ‘home data centre’ and to a reliable, faster provider.

The ultimate aim is to divorce myself from Covad and the Static IP business DSL that has worked so well – and move to something that is much faster – but maybe without the SLA on the line itself.

xCache – PHP caching, performance and stability

I’ve been testing out xCache for a while – primarily as a PHP accelerator.

Early results were really promising – reducing page load times dramatically; and also reducing CPU load as common pages (i.e. the latest blog post and photos) were fed directly from the cache.

There seems to be some kind of memory leak/cache clean up issue with xCache 1.3 – I allocate some amount of RAM for cache (16MB, 64MB, 256MB – it really doesn’t matter) and at some point Apache/PHP starts eating up RAM, then starting to swap – and finally the server grinds to a halt.

xCache is off for now – I’ll keep investigating.

Overnight line problems

Any ideas?

Twice this week all connectivity has been lost – upstream of the CPE (on premise router).

The first was from 2100 to 0800:

line-outage-4-5-oct-10

The next from 2130 to 0430:

line-outage-6-7-oct10

It looks like some kind of maintenance window from the Qwest who actually provision the line.

Blog server implosion – fix and cause

I got an email on Saturday morning:

“I’m getting a message when I try and “post draft and edit online”.  See pictures attached of the messages.”

Blog error

Uh oh. Nothing had changed in the config of the web server for months – and adding extra disk space to the server wouldn’t cause this.

I looked at the Apache error logs – nothing. I couldn’t see anything that would be causing this. Typically it’s a permissions or xml-rpc problem that’s kicking up a complaint in Windows Live Writer.

Other blogs on the same server were working perfectly; I could upload via xml-rpc as well. Very strange.

Eventually I tracked down an alert in /var/log/warn that was flagging ‘cannot read inode bitmap’ – whenever I tried to upload an image via xml-rpc. Even stranger. This really didn’t make any sense – but it looked like early signs of a corrupt root filesystem and being unable to write to temp.

I dismounted everything and tried to fsck the disk – and then the world of pain unraveled. The entire root filesystem seemed to have junk – it’s ext3 so should be pretty robust. I’ve no idea what caused it – but the end result was that most of /etc was toasted and there were some 10,000 entries in lost+found.

The upside is that the mysql and web data are all on seperate disks – so really easy to reconstruct the server. I had backups of my PHP, mysql and Apache confs – as well as all the data. The only slog was updating the Apache/PHP/MySQL stack to the correct (current) versions for my uses.

What I learned:

  • backups are great – but separating the data from the OS is a real winner
  • backup the config files for the core apps
  • document the correct versions of core apps. Currently Apache 2.2.10, PHP 5.3.2 and MySQL 5.1.3 – these all work together without problems

Total downtime – about eight hours. Real time spent fixing this – about three hours.

I also moved several of the blogs to WordPress 3.0 RC1 – it’s been really stable so far on the main blog. I also had to do a latin1 to utf8 conversion on one of the older blogs. Always painful – but a one time hit. I need to add that to the change control/validation for the next round of big updates.