Portal Home > Knowledgebase > Articles Database > My server just crashed totally with Leaseweb - corrupted files - Updates hourly / daily
My server just crashed totally with Leaseweb - corrupted files - Updates hourly / daily
Posted by nookienoq, 10-23-2016, 02:16 AM |
Hi,
I have a server with Leaseweb and yesterday my server crashed and I want to share experience how this is going to handled by Leaseweb, are they going to save my data and what is going on..
Daily updates to come.
So, my biggest fear is here. My dedicated server crashed yesterday.
I was just surfing the web when suddenly Apache trowed an error - "Can not write to disk - filesystem read only".
I tried to access any file and i could not open anyfiles on the server.
All website throwed mysql database could not be acccessed.
I'm currently in Australia but my server is in Germany so bare mind of times.
My server configuration:
Brand and Model Hp - DL380e Gen8 (8 LFF)
Processor 2x Intel Hexa-Core Xeon E5-2420 [ 6 Core(s) ]
RAM 16 GB
HDD 4x2 TB SATA
KVM No
Hardware RAID:Yes
RAID: RAID 1+0
Here is what has happened so far.
22/10/2016 - 14:10 Trying to access the portal and I can get password reset. Requesting for password reset.
22/10/2016 - 14:20 No email for password reset
22/10/2016 - 14:21: Second password reset
22/10/2016 - 14:30 No email still received for password reset.
22/10/2016 - 14:32 Sending email to support to reset my password to Leaseweb portal
22/10/2016 - 14:42 Email received to reset password.
22/10/2016 - 15:03 Rebooting the server
22/10/2016 - 15:25 Server is still not coming up
22/10/2016 - 15:30 Rebooting the server into rescue mode
22/10/2016 - 15:47 Creating ticket with Leaseweb that server is not coming up not even in rescue mode
22/10/2016 - 16.00 No reply yet from Leaseweb - Calling the support (I must say the support person was very helpful and he trying to find what was going on. He was going to call me once he had more information)
22/10/2016 - 16:14 Getting ticket which says: 22/10/2016 - 16:35 I'm getting the call from Leaseweb and information I'm receiving is:
Munin (software) has somehow damaged my boot files for Debian so I have to reinstall the operating system. All my files have been mounted in rescue mode so I can take backup of them.
Once I have taken the backup, I should call back Leaseweb and notify them to replace the faulty disk so we can do a clean install of the operating system.
This is nothing i have been looking for do reinstall and configure EVERYTHING but I was happy to at least have my websites saved..... So i thought.
22/10/2016 - 17.00 -> 23/10/2016 - 03.00
Im trying to take backup of my files on my mounted drives but there are ALOT of corrupted files.
Example of the errors I'm receiving: tar: c1_notes/project_members.MYD: Read error at byte 0, while reading 4096 bytes: Input/output error
I have been trying to save everything all night but with so many files corrupted I don't know what is missing and what is not really.
A lot of databases are corrupted as well in /var/lib/mysql..
23/10/2016 - 03.00
The person who installed and configured my server comes online and here is his feedback on the problem:
Hi,
I found some disturbing things about the situation we have now
We have a RAID 1+0 plan, which means we have 4 drives 2T each.
RAID 1+0 means they are connected in two groups:
- Group 1 is two drives (each 2T) and they are mirror to each other.
When one fails, the other should be ok to be used for restore
operations (as it is done now with sda5)
- Group two does the same -> 2 drives 2T each which mirror each other
The two groups are connected in a way so they can work faster but they
share all data between each other, which means one file can be (and
usually is) partially in the first group and partially in the second....
http://cdn.ttgtmedia.com/rms/editori...ge_raid_10.png
All this is done by the hardware and the above lying software never will
know it is running on RAID, so their explanation about monit, munin etc
is very incorrect...
Now in order to have a failure, what is needed that both drives of one
group to fail.... And when that happens you loose most of the files you
have...
I dunno how this happened at all with our plan... But what they have
done now, is to put only one of the 2T drives (from group 1) in 1+0 Raid
mode (you can get the info from command line using:
hpacucli ctrl slot=2 pd all show detail
)
The drive is with SN MN1220F30ULTJD
That means we only have (approximately) half of the data that was
written on the total 4T space... i.e. sda5 shows it is 4T but it is
actually 2T only..
Having in mind all of the above, any backup we take will be
(and is) incomplete with missing (or garbaged) files, no matter how
lucky we are that some files be written on this same pair of disks. (in
RAID 1+0 files are written to stripes divided between the two groups of
physical disks)
This also means we cannot run fsck to fix errors, because you will get
all missing parts of files (and files) deleted...
So I dunno what to do... Probably use what can be used from those
backups you made (but there is a big chance some of the files to have
garbage inside them - like for example /mnt/sda5/var/log/syslog.1) and
start from a clean system...
23/10/2016 - 10.54 Sending the technicians reply to Leaseweb
23/10/2016 - 16.00 Still no reply from Leaseweb, calling Leaseweb to tell them that they need to get my data back since I have alot of corrupted data..
Leaseweb feedback: A technician is working on this currently in Germany
23/10/2016 - 10.00 - Trying to install recovered data in Virtualbox
Installing LAMP to test the recovered data.. approximatly 70% work..
But the DB is corrupted in some cases. Doesn't work to restore the data properly..
So this is my case..
I truly hope Leaseweb can do something about this and that I will get my data back.
Honestly i bought this server with so much storage and RAID just to avoid these things. No i'm no expert in RAID what so ever but I thought if one harddrive crashes that it just to replace it and it will just pick things where they were since there are other harddrives who have correct data...
Story continues..l will keep you guys posted, what is going on, my experience with Leaseweb and if I will get this sorted out. ¨
|
Posted by madRoosterTony, 10-23-2016, 02:27 AM |
This is where I hope you have a secondary backup of the files. RAID is not a backup solution and yes as you have learned, if one disk fails, it can cause corruption across all disks in the RAID. This is a very common problem with RAID and happens more often then most people like to admit, because everyone thinks that RAID is the savior for everything.
At this point, you would have been better going 2 x 2TB in RAID 1 and then using the 2 other 2TB dirves for secondary backup of files and databases.
|
Posted by SkylakeDC, 10-23-2016, 04:30 AM |
Sorry to hear about your issue. Never use a RAID as your backup system. You should keep a 3rd party backup solution with the server.
|
Posted by aeris, 10-23-2016, 05:13 AM |
Multiple disk failures do happen, RAID is no guarantee for data integrity.
You are responsible for keeping backups. The host has no obligation to restore your data, only to restore the server to a bootable state - which in this case would likely involve an OS reinstall. You would then have to restore the data from the external backup.
|
Posted by net, 10-23-2016, 05:44 AM |
Hardware do fail no matter what.... even Raid.
So, if your data is important to you, you should keep a remote backup in the first place......
Best you can do is do not touch the current server and just get a new server then copy some data that you can grab from the old server.
|
Posted by alanwoo, 10-23-2016, 10:52 AM |
even if you have a 3rd party backup, if the data is important, you will need to regularly restore and verified the backup data is valid.
there is "silent data corruption" that can be happen in harddisk or raid disk array, thus enterprise disk array always come with "data checksum" and "data scrubbing" feature, and regularly check each data block against checksum to ensure the data stored is not corrupted.
|
Posted by AndriusPetkus, 10-24-2016, 01:57 AM |
Just for an information - What is your SLA level with a Leaseweb?
|
Posted by SenseiSteve, 10-24-2016, 01:53 PM |
Whatever SLA level the OP has with Leaseweb, it'll still only compensate him with service credits which in no way compares to the data he's lost. To the OP, I certainly hope Leaseweb can help you sort all of this out.
|
Posted by AndriusPetkus, 10-24-2016, 02:40 PM |
Really?
Paying nothing but expecting platinum support... It is not going to happen.
Basic – 24x7x24 (incl.)
Response time: 24hrs – Free phone 24/7 support – Hardware replacement in 24hrs
Bronze – 24x7x4 (€ 29.00)
Response time: 4hrs – Free phone 24/7 support – Hardware replacement in 4hrs - 30mins per month of Advanced Support
Silver – 24x7x2 (€ 49.00)
Response time: 2hrs – Free phone 24/7 support – Hardware replacement in 3hrs - 60mins per month of Advanced Support
Gold – 24x7x1 (€ 79.00)
Response time: 1hrs – Free phone 24/7 support – Hardware replacement in 2hrs - 90mins per month of Advanced Support
Platinum - 24x7x½ (€ 119.00)
Response time: 30mins – Free phone 24/7 support – Hardware replacement in 2hrs - 120mins per month of Advanced Support
|
Posted by SenseiSteve, 10-24-2016, 02:49 PM |
I agree with you there, but I already read through all of the SLA levels. My point was that no matter what your level is, the only compensation offered if they're not met is a service credit. It's always best to have your own remote backups as RAID is not a disaster recovery solution.
Last edited by SenseiSteve; 10-24-2016 at 02:55 PM.
|
Posted by SneakySysadmin, 10-24-2016, 03:03 PM |
I cut out almost all of your post because it essentially documents your ignorance of how servers work
RAID is not a substitute for having frequent, up-to-date backups.
Correct. Now answer this question: When did the first drive in your RAID die? Yesterday? Or 6 weeks ago? You don't know - do you? For all you know you've been limping along on a degraded RAID for weeks or months and the second drive finally gave up.
I had a colo customer who just today discovered the joys of multi-disk failure. His server did not come back up after he rebooted it (remotely) and requested hands. His machine hadn't rebooted because it was stuck at the RAID controller's prompt saying "Hey guess what? You've lost drives!" and two drives (12 bay chassis) were blinky-red faulted.
I will bet money he wasn't monitoring his RAID status, had lost one drive weeks/months ago - and lost the other one during the reboot tuning his server into a very large, very expensive paper weight with blinky lights on it.
The lesson to take away from this is:
1. BACK UP YOUR SERVER. Fully and frequently. RAID is not a substitute for backups.
2. Store those backups off site.
3. MONITOR your RAID status religiously. If a drive fails and you don't know about it and aren't monitoring it then why have RAID in the first place?
|
Add to Favourites Print this Article
Also Read