Sunday, November 01, 2009

Crazy weekend

So, what I was going to do this weekend was re-work how our git branches are organized, creating a testing branch called "too", and work on getting the new Funtoo/OpenRC network scripts finalized.

What I *actually* did was try to reload one of my servers -- and ran into lots of problems. This process actually started the middle of last week, sucking up my after-work hours. I wanted to reload one of my Nehalem servers, and for some reason -- the thing just wouldn't boot. I tried rolling back firmware. I tried using known-good kernels. I tried using an exact replica of my *working* identical Nehalem server, so it was basically set up totally identically to my other box, and the thing still wouldn't boot. Everything I tried, it complained about an unknown block device and that it was unable to mount the root filesystem. But the strange thing was -- sysrescuecd could access the disk just fine. And I created a minimal initrd with "bash" and "mount" on it, and I was able to mount the root filesystem just fine with no extra modules required.

I found the solution around 2:30 AM last night - for some reason, the kernel auto-detection of what filesystem is on the partition was consistently failing. If I pass a "rootfstype=ext3" kernel boot option, everything works. But if I leave my "root=/dev/sda3" boot option to fend for itself, the thing won't boot. I've never ran into this behavior before.

But thinking about it, the "rootfstype" option makes a lot of sense. I don't want to rely on my kernel auto-detecting the type of my filesystem by itself - especially now that I've seen it fail so badly. Apparently some funky data on your block devices can cause this auto-detection to bite it pretty hard.

All that to say -- I had a crazy weekend. 20 hours of troubleshooting.

But there's some good things that are going to come out of it - I am going to fork the Gentoo GRUB package, which needs quite a bit of love. It's time to get GPT partitioning documented properly and grub-1.97 supported offiicially as it is in Ubuntu. So expect to see some things related to that soon.

4 comments:

jsn said...

I suppose it's "File Systems" / "Advanced partition selection" checkbox in kernel config. Is it checked in yours? It screwed up fs type autodetect more than once for me.

Daniel Robbins said...

Yes, it's on. I need it for GPT partition table support. It's turned on on my other machine, and it's working fine over there without rootfstype=.

Casidiablo said...

Hi Daniel... I'd like to know why don't you have grub-9999 in the funtoo's portage tree? Are you adding it in the near future? or is there any problem with it?

Thank you so much.

Daniel Robbins said...

Casidiablo, this is because we have a much-enhanced GRUB 2 userland - see http://www.funtoo.org/en/projects/grub/ - because of our patches, I am not offering an SVN version (too risky.)

I will be posting about the new GRUB soon.