Select Page

Infrastructure changes – Covad and GoDaddy frustrations

Most of the way through the infrastructure changes at the moment.

A recap:

  • Migration of mail from Google Apps to Hosted Exchange.
  • Migration of DNS from current service provider to ‘someone new’
  • Migration of blog/photos to ‘somewhere in the cloud’

Step one – the mail switch was relatively painless – it needed some careful planning – but zero downtime.

  • signup for Hosted Exchange (actually the Microsoft Exchange Labs Friends and Family program)
  • DNS changes (mainly CNAME work to prove ownership)
  • family education (the hard part)
  • DNS changes (MX records and webmail A and CNAME records)
  • reconfiguration of email clients (Outlook, phones, devices etc)

As I wrote a couple of months ago – the old mail lives on at Google Apps. Everything new is in Exchange.

Step two of the move was more complex and painful. I decided to change the DNS hosting with a consolidation of the various registrars that I’d used over the past decade. What should have been a week-long process of sign-up, DNS unlock, auth code request and move – took most of my time.

I moved from register.com and Network Solutions (resold by Covad). Getting the domains unlocked and auth codes for the move were a snap with register.com – they were efficient, friendly, knowledgeable – and it took about five days. Covad was a nightmare. Total time – five weeks and multiple escalations. During that time Covad managed to completely screw up the zones too.

Step three is mostly complete too. Only one blog site to move – and the photos are uploading right now. This is really my frustration with GoDaddy. They have pretty good (i.e. I get what I pay for) hosting and infrastructure – but some of the grid hosting limitations and the associated responses from support are really frustrating.

The GoDaddy issue is that they either cycle the grid hosts (so an ssh/scp session is terminated) or they kill long running processes. With four photo blogs – some insane number of photos – total of some 80GB of data to move – I had to get creative.

Firstly copying the data via non-secure ftp wasn’t really my idea of fun. I started off with scp – but the remote host kept killing the connection. Next I tarred up the needed files – and the connection was killed. The final working solution to get the pictures up to GoDaddy was the convoluted tar – md5sum – split – scp – cat – md5sum  – untar. Moving 80GB in 200MB chunks with a retry script at my end was not fun.

The next issue was actually untarring these enormous tarballs. The first site unpacked just fine; the second kept being interrupted – i.e. tar was getting killed. There is no ‘nice’ on the server – so no way to fly below the radar. Turns out there is a process time limit of something like 180 seconds. This means that the practical limit to untar is about 13GB in size. My frustration with GoDaddy support was that they kept telling me to use ftp and that there was a limit of 100MB for tar. I spoke to GoDaddy support right at the start of this process and offered to PAY to ship a USB drive with the 80GB of tarballs to an admin to dump onto my space. I’d say there’s a value add here for GoDaddy.

Lessons learned:

Change control and planning are king. See previous posts. Nothing went wrong – but there were things that could have been smoother. What I guessed was a few weeks turned into a two month project.

Test with real-world datasets. Migrating a test blog with 200 photos isn’t a valid test.

First line support people often repost from the knowledgebase. A limit of 100MB for tar is unrealistic. Tell people it’s a time-related kill rather than a size issue. We can figure it out and workaround it.

Uploading mysql dumps to GoDaddy

Strictly a console guy – I’ve been struggling to get the big blog database dumps up to the new hosting. phpMyAdmin claims to support zipped dumps – but that doesn’t work. There are also timeouts in the console for the upload and import.

I finally fixed it by using scp to move the non-compressed dump to the hosting server; and then using the Hosting Control Center to restore the dump as if it was a backup.

It’s running right now – so hopefully I’ll have happy blogs again soon.

Infrastructure – heavy lifting and planning

A trio of projects before the year-end – all interwined.

  • Migration of mail from Google Apps to Hosted Exchange.
  • Migration of DNS from current service provider to ‘someone new’
  • Migration of blog/photos to ‘somewhere in the cloud’

Moving the mail isn’t that hard – it’s just making sure that mail doesn’t get dropped while the new MX and CNAMEs are propagating. The old mail will live on in Google Apps – the new stuff in hosted Exchange. The trickier part is making sure that ‘my customers’ get the right service – and can keep getting mail in Outlook or the web. Users eh.

Moving the DNS is part of the mid-term strategy to change ISP. Covad have been great to me since I moved to the US; sadly they are starting to show signs of decay. I need to support additional DNS records than the A, CNAME and MX records – no plans from Covad.

The final push is to move the blog servers out of the ‘home data centre’ and to a reliable, faster provider.

The ultimate aim is to divorce myself from Covad and the Static IP business DSL that has worked so well – and move to something that is much faster – but maybe without the SLA on the line itself.

xCache – PHP caching, performance and stability

I’ve been testing out xCache for a while – primarily as a PHP accelerator.

Early results were really promising – reducing page load times dramatically; and also reducing CPU load as common pages (i.e. the latest blog post and photos) were fed directly from the cache.

There seems to be some kind of memory leak/cache clean up issue with xCache 1.3 – I allocate some amount of RAM for cache (16MB, 64MB, 256MB – it really doesn’t matter) and at some point Apache/PHP starts eating up RAM, then starting to swap – and finally the server grinds to a halt.

xCache is off for now – I’ll keep investigating.

Overnight line problems

Any ideas?

Twice this week all connectivity has been lost – upstream of the CPE (on premise router).

The first was from 2100 to 0800:

line-outage-4-5-oct-10

The next from 2130 to 0430:

line-outage-6-7-oct10

It looks like some kind of maintenance window from the Qwest who actually provision the line.

Blog server implosion – fix and cause

I got an email on Saturday morning:

“I’m getting a message when I try and “post draft and edit online”.  See pictures attached of the messages.”

Blog error

Uh oh. Nothing had changed in the config of the web server for months – and adding extra disk space to the server wouldn’t cause this.

I looked at the Apache error logs – nothing. I couldn’t see anything that would be causing this. Typically it’s a permissions or xml-rpc problem that’s kicking up a complaint in Windows Live Writer.

Other blogs on the same server were working perfectly; I could upload via xml-rpc as well. Very strange.

Eventually I tracked down an alert in /var/log/warn that was flagging ‘cannot read inode bitmap’ – whenever I tried to upload an image via xml-rpc. Even stranger. This really didn’t make any sense – but it looked like early signs of a corrupt root filesystem and being unable to write to temp.

I dismounted everything and tried to fsck the disk – and then the world of pain unraveled. The entire root filesystem seemed to have junk – it’s ext3 so should be pretty robust. I’ve no idea what caused it – but the end result was that most of /etc was toasted and there were some 10,000 entries in lost+found.

The upside is that the mysql and web data are all on seperate disks – so really easy to reconstruct the server. I had backups of my PHP, mysql and Apache confs – as well as all the data. The only slog was updating the Apache/PHP/MySQL stack to the correct (current) versions for my uses.

What I learned:

  • backups are great – but separating the data from the OS is a real winner
  • backup the config files for the core apps
  • document the correct versions of core apps. Currently Apache 2.2.10, PHP 5.3.2 and MySQL 5.1.3 – these all work together without problems

Total downtime – about eight hours. Real time spent fixing this – about three hours.

I also moved several of the blogs to WordPress 3.0 RC1 – it’s been really stable so far on the main blog. I also had to do a latin1 to utf8 conversion on one of the older blogs. Always painful – but a one time hit. I need to add that to the change control/validation for the next round of big updates.