It’s been a while since a site update – since the shenanigans that happened last week, I suppose an update is deserved. However, I’ve been extremely busy trying to catch-up on putting out content, so unfortunately, this post didn’t make it out before the end of the weekend.
A Downtime Out of Nowhere
On February 19, 2023 at 02:27 AM AEDT, I received a ping from my uptime monitor that told me the site was down. As I was in bed, and given the time, I had assumed it was ordinary server maintenance that sometimes occurs on such hosting services and went back to sleep.
Waking up at 6am that Sunday, I noticed the site was still down. Pinging the server was fine, but I couldn’t get any response from the HTTP/HTTPS server at all – all it did was time out. I tried the host’s own webpage … no dice. Uh oh. Usually the host keeps their own site up in preference to their customers – their own site is important as that’s their shopfront and also their ticketing system. But now that it’s down, we’re all in the dark.
Over the next hour, I scoured around and managed to find the host’s personal mobile number online – the only contact I could find. I gave that a ring at about 6:50am and no answer. I waited another 10 minutes and gave them another ring at 7am – an obviously groggy individual answered the line and I told them simply – “Everything’s down and timing out. Even your own site. Ticketing is down.” They asked for my name and number to get back to me and I said simply – “This is big, no need to contact me back specifically as I’ll be busy throughout the day – fix it for everyone and that will make everyone happy.”
By now, it’s 11am and I notice the site still isn’t up. Surely something could have been done about most issues within this timeframe, so something must have gone really wrong. Out of concern and a desire to learn an ETA on restoration, I phone up again to see what’s going on. I was told that I was the first to report the outage and that their own monitoring did not spot it because the server still responded to pings. The cause was a failed SSD RAID array that could not be repaired and they decided to just cut their losses and start fresh with a hardware replacement, reload of operating system and restore of backups from yesterday (to avoid any potential backup corruption). But to do this, they had to first contact their “multinational” upstream custom VPS provider to have someone at the Sydney datacentre work on their computer, with a full post-mortem to come later. I was assured that once the reload finishes, we should be back up.
At this point I felt rather sorry for the bloke that I phoned up – it’s a Sunday morning and he surely didn’t need this stress. Devices fail, sometimes out-of-the-blue. I’ve experienced a SSD RAID10 go down pretty much without warning, so it’s not like it’s impossible. But then again, this is the life of a sysadmin and service provider.
But things were not so rosy, as by the evening at about 5pm, their own front page came back up but my site was still down. Trying to get into cPanel was fruitless, and their own CRM was down too since ioncube PHP Loader was somehow broken. It seems the reload had not gone to plan, but they were trying.
By now, it was Monday and I was busy at work – no time to hassle the staff! I noticed that their own site had gone down again, suggesting to me that they were having another shot at the problem. Without fanfare, full service appeared to be back based on monitoring at 12:42:21 PM AEDT on 20 February 2023 but with minor instability for a few hours afterward. This time, my site came up before theirs! Total downtime was 1 day and 10 hours, instantly wiping out the hosts claimed “99.9% uptime guarantee”. Thankfully for them, this “guarantee” that was advertised was not backed up by any SLA and is not codified in their terms, or else I’d be knocking at their doors for some compensation.
I did phone them up later just to follow-up on a few things and they revealed to me that their upstream VPS provider provided a supported OS image that was somehow broken and resulted in an installation that couldn’t be configured to their needs. Even getting their staff involved for over six hours of troubleshooting got them nowhere and instead, the provider gave them another image which thankfully worked. That misadventure cost a whole day – but a full block-backup was not a silver bullet either.
Unfortunately, there was no broader communications from the company more broadly about what happened to their customers – I was only able to glean such valuable insight thanks to phoning up the personal mobile of someone obviously very senior in the company. In the process, we also managed to deduce another issue – Telstra’s routing to the server was very sub-optimal, bouncing my traffic from Sydney to Singapore, to France, to London, to USA, to Tokyo, then back to Sydney for a 400+ms RTT.
This has been flagged for resolution and the route has now been “pruned” a little – Sydney to Singapore, to Tokyo then back to Sydney for about 130ms RTT. It seems it’s all because Telstra refuses to peer with providers connected to the hosting service used. Other providers (Optus, TPG/Vodafone, AARnet) that I’ve tested either peer directly or go over an interexchange,making them very fast (sub-20ms RTT). In the meantime, some minor configuration differences seem to exist in this new set-up – I guess I’ll need to lodge a ticket to get them to fix that.
In the end, it was the record longest downtime in the history of the site since it has been online. I’m not proud of it, but there wasn’t much I could do about it without throwing money at another host and frantically uploading all the data from a backup and trying to reconfigure all my services. I was prepared to do it if it seemed they wouldn’t be back anytime soon. But it just underscores the importance of having a disaster recovery plan (DRP) and making sure it works … and that includes users keeping their own private backups even when a host has daily backups. I’m glad they were able to bring me back online losing only a day of drafts (which really wasn’t anything major). Having to suddenly upload a 6GB chunk to restore my site while living on LTE would be a bit more of an inconvenience …
It has me wondering whether my experience was somehow prophetic – perhaps SSD reliability is going downhill, even in the enterprise space.
You may see that I’ve not named the host in question. You may be able to deduce it in previous postings, but I’ve decided not to name them because I felt that they did a decent job all things considered for the price I paid and I’m afraid the host may be sensitive to such “reviews” – I’ve seen other sites get “evicted” from their hosts for daring to criticise. Needless to say, I’ll be happy to stick with these guys based on their experience and willingness to discuss the internals with me as they are a smaller company. Such transparency would be hard to obtain from a larger host.
Hard Disk Corner & Optical Disc Corner Updates
It’s been a while, so I thought that as part of some data migration and refreshing, I would refresh the Hard Disk Corner with some new drives. It’s probably not something that would enthrall everyone, but I bought another 27 drives to the collection as part of a 2023 refresh. This was a bit of a marathon effort to shuffle data and test – they were done one-by-one just in case multiple drive operation caused interference in the form of vibration or bus contention.
- Fujitsu MHV2120BH PL (2.5″ 120GB 2006)
- Hitachi HDP725025GLA380 Deskstar (3.5″ 250GB 2009)
- Hitachi HDS5C3030BLE630 / Toshiba DT01ABA300 (3.5″ 3TB 2012)
- Hitachi HDS723030BLE640 / Toshiba DT01ACA300 (3.5″ 3TB 2012)
- Samsung HD204UI SpinPoint F4EG (3.5″ 2TB 2010)
- Samsung HD502IJ SpinPoint F1 (3.5″ 500GB 2009)
- Samsung MP0804H Spinpoint M40 (2.5″ 80GB 2005)
- Seagate ST1000LM024 HN-M101MBB Momentus (2.5″ 1TB 2014)
- Seagate ST2000DL003-9VT166 Barracuda Green (3.5″ 2TB 2011)
- Seagate ST32000542AS Barracuda LP (3.5″ 2TB 2011)
- Seagate ST380011A Barracuda 7200.7 (3.5″ 80GB 2004)
- Seagate ST4000DM000-1F2168 Barracuda LP (3.5″ 4TB 2014)
- Seagate ST5000DM000-1FK178 Desktop HDD (3.5″ 5TB 2015)
- Seagate ST500DM002-1BD142 Barracuda (3.5″ 500GB 2015)
- Seagate ST500LM021-1KJ152 Laptop Thin HDD (2.5″ 500GB 2017)
- Seagate ST9160821AS Momentus 5400.3 (2.5″ 160GB 2007)
- Seagate ST9200420AS Momentus 7200.2 (2.5″ 200GB 2009)
- WD WD10EACS-00ZJB0 Caviar SE16 GreenPower (3.5″ 1TB 2007)
- WD WD10EURX-63UY4Y0 GreenPower AV (3.5″ 1TB 2015)
- WD WD2002FAEX-007BA0 Caviar Black (3.5″ 2TB 2013)
- WD WD20EARX-00PASB0 Caviar Green (3.5″ 2TB 2012)
- WD WD30EZRX-00MMMB0 Caviar Green (3.5″ 3TB 2011)
- WD WD40EZRX-00SPEB0 Green (3.5″ 4TB 2014)
- WD WD5000AACS-00ZUB0 Caviar GP GreenPower (3.5″ 500GB 2008)
- WD WD5000AAKX-603CA0 Caviar Blue (3.5″ 500GB 2011)
- WD WD60EZRX-00MVLB1 Green (3.5″ 6TB 2015)
- WD WD7500BPVT-80HXZT3 Scorpio Blue (2.5″ 750GB 2012)
Of course, there are more drives that I still own but have not been able to test, but I am quite surprised just how many drives I’ve handled in my relatively “recent” past, considering I don’t work for a computer shop and don’t build computers all the time. That being said … I have really laid off buying hard drives. The last purchase was back in 2019 when I bought a 10TB Western Digital Elements external drive. Perhaps my drive appetite will thin out.
I’ve also had a chance encounter with two additions to the Optical Disc Corner in the form of two Imation DVD retail discs that I salvaged from being disposed of. No big deal by comparison to the hard disk updates but an update nonetheless.
Conclusion
There’s always lots of content to put up … but the time needed is enormous. The downtime really put a roadblock up last week, so now I’m playing catch-up, but things should well get back to normal as soon as I have the time. In the interim, work has been absurdly busy and there are reviews and tests happening in the background. I’m also part of a “Save the Bees” design challenge over at element14 which will take some more time away from me. As usual, I’ll be posting whenever I can, but chances are that it will be erratic.
In the meantime, sorry for the downtime, but unexpected things just happen and this was definitely not foreseen. It’s not something to be happy or proud about, but I’m glad that I didn’t have to do anything to have it come back up. I do feel a bit sorry for the crew having been given this stress on a Sunday morning … but I suppose that’s also part of the sysadmin life.






































