FreeBSD 13.2-STABLE can not boot from damaged mirror AND pool stuck in "resilver" state even without new devices.

Discussion:

(too old to reply)

Lev Serebryakov

2024-01-05 17:28:55 UTC

Hello!

I have (remote) physical server with 2 SATA disks. These disks were partitioned with GPT into "freebsd-boot" (ada{0|1}p1, legacy one, not EFI), "freebsd-swap" (ada{0|1}p2) and "freebsd-zfs" (ada{0|1}p3).

Both disks were 512/512 (it looks important).

I have only one ZFS pool "zroot", mirror of "ada0p3" and "ada1p3".

I have very fresh "gptzfsboot" on both "ada0p1" and "ada1p1".

Now, ada0 failed. It was replaced by DC support with new disk, which is 512/4096.

After that my server fails to boot, gtpzfsboot from second disk (ada1) reports several "zio_read error: 5" and

ZFS: i/o error - all block copies unavailable
ZFS: can't read MOS of pool zroot

after that.

I've booted to rescue Linux (unfortunately, there is NO rescue FreeBSD at Hetzner anymore), and Linux could import (degraded) pool no problem. But Linux has problems with detecting pool on partition, so I don't do nothing under Linux.

I've checked "live" disk under Linux, though: it reads, SMART is clear, everything is Ok.

I've booted FreeBSD 13.2 from installation ISO under qemu with physical devices as disks. Then I partitioned fresh HDD and started disk replacement in mirror. It worked, but resilver was unbearable slow. I stopped VM with FreeBSD to continue process after normal boot.

NO LUCK. "zio_read error: 5", boot failed.

Then I've overwrite ada0 (new disk) with FreeBSD memstick IMG and boot it - it can import pool from ada1p3 but, of course, resilver is stopped.

I've removed all faulted components, effectivly converting mirror to "simple" device. But "zpool status" shows that there is resilver!

And "gptzfsboot" still CAN NOT read this ZFS pool and find loader!

Ok, I've converted swap to UFS boot form UFS. It works. It can use pool as root. But pool still is "reslivering".

Now I have very strange situation:

(1) I have ZFS pool with 1 device which says:

% zpool status -v zroot
pool: zroot
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Fri Jan 5 19:24:07 2024
750G scanned at 472B/s, 40.5G issued at 25B/s, 974G total
0B resilvered, 4.16% done, no estimated completion time
config:

NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
ada1p3 ONLINE 0 0 0

errors: No known data errors
%

(2) gtpzfsboot from very this system version can not read this pool and bot from it
(3) kernel can use this pool as source of root (and all other) filesystems.

--
// Lev Serebryakov

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Warner Losh

2024-01-07 18:34:06 UTC

Permalink

Post by Lev Serebryakov
ZFS: i/o error - all block copies unavailable
ZFS: can't read MOS of pool zroot
after that.

I've re-created pool from scratch
zpool create znewroot ada0p3 && zfs send zroot | zfs receive znewroot

&& zpool destroy zroot && zpool attach znewroot ada0p3 ada1p3

but gptzfsboot still can not boot from it with same diagnostics :-(

I must have missed it. What were the diagnostics?

How large are your disks in a question?

2TB
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <HGST HUS726020ALE610 APGNTD05> ACS-2 ATA SATA 3.x device
ada0: Serial Number K5HPZZLD
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 1907729MB (3907029168 512 byte sectors)
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <WDC WD2000FYYZ-01UL1B1 01.01K02> ATA8-ACS SATA 3.x device
ada1: Serial Number WD-WMC1P0504169
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 1907729MB (3907029168 512 byte sectors)

< 4294967296 sectors should be good. So these drives shouldn't see this
problem. the BIOS interfaces should have no trouble here.

As far as I search the internet it is caused by the boot code (later

stage which is in a file in /boot directory) was moved too far from the
beginning of the disk and some old BIOS cannot allow the system to continue
booting.
Oh, it is good hypothesis. It is Haswell-time MSI board (old Hetzner
EX40 instance)...

Yes. If the drives are > 2TB you lose. BIOS is not for you... Unless you
make special partitions that are in the first 2TB of the drive and only
boot off of those. Also, if the drives are 4k, you likely lose, though it's
hit or miss. Those are the hard limits of the BIOS ABI.

It can also be avoided if your machine supports EFI boot, but my HP
Microserver Gen 8 does not support it.
I'll try to switch to EFI, but it needs some luck to get to BIOS with
provided KVM, it is very unstable :-)

BIOS booting is dying. It will be unsupportable in not too many more years
and the code removed. The rapid proliferation of ZFS crypto and compression
types is hastening the race to see who can use up the most space in the
boot loader. We can do marginal things to make it better wrt the 640k
limit, sure, but then we hit other limits like the 2TB address space, like
not being able to reliably support 4k drives, etc. BIOS booting likely will
support an increasingly small subset of all possible booting methods as we
go forward. The current crazy mix of different alternative firmwares makes
it hard to know what will survive, but as we hit these limitations, it will
make it harder and harder to configure, deploy and manage these systems.

The Linux on ZFS root pages, btw, recommend having two pools on two
partitions on the disk. One that's a few GB that's the bool that has the
kernel in it, and the other, rest of the disk, that's rpool for the root
pool. If people want to continue to support BIOS booting (or rather,
booting using the CSM interfaces), then somebody is going to need to step
up to the plate and implement a similar option in bsdinstall, bectl,
freebsd-update, etc.

Warner

Warner Losh

2024-01-07 21:06:26 UTC

Permalink

Post by Warner Losh
< 4294967296 sectors should be good. So these drives shouldn't see this
problem. the BIOS interfaces should have no trouble here.

[...]

Post by Warner Losh
Yes. If the drives are > 2TB you lose. BIOS is not for you... Unless
you make special partitions that are in the first 2TB of the drive and
only boot off of those. Also, if the drives are 4k, you likely lose,
though it's hit or miss. Those are the hard limits of the BIOS ABI.

It is not always that simple math. As I wrote in my previous reply, my
pool was unbootable in one machine but boots fine in the other. Both
were Intel based amd64 with BIOS, not EFI. I think there are some buggy
BIOSes where it cannot boot even on smaller pools than 2TB. (or maybe
some improved BIOSes supporting larger boundaries than 2TB? I don't know
in what exact position bootloader / kernel was on my 4TB pool)

OK. If the problem is that int13 has only 32-bits in the ABI, the math is
that simple.
The limit is 2^32 blocks, and there's no reliable provision for 4k sector
sizes (there's
some BIOSes that will do it, others that won't... it's a bit muddled
looking at the problem
reports, though we do try to support that). There's no BIOS64
implementation that
extends the int13 interfaces to do wider block sizes that I've seen... It's
just that it's
so close it's easy to gravitate to a known issue...

If other weird things are happening, then that means that we may have a type
problem that's truncating the logical block size (which the BIOS doesn't
care
about) to 32-bit (or maybe sometimes) which then leads to weird things
happening.
But... UEFI should suffer this same problem and we should hear about it a
lot
I'd think (though maybe how gptzfsboot is compiled might be the culprit,
since
that's the only thing that's confined to the gpt boot blocks that's not
common
binary code (we #include the implementation to make two different binary
things....)). It shouldn't care that the copy of /boot/loader is past the
2TB
logical limit, because the drives are smaller than 2TB and so none of their
LBAs will be > 2^32 and should all work. If that's indeed the issue, then
there's
something weird about how we build it for gptzfsloader.

The other thing it could be, though, is that if there's a resilvering,
there's some
subtle state that's confusing the simple reimplementation of ZFS reading
that's
in the boot loader. Though I'd expect to have heard about that before now.
Especially
since this would hit UEFI booting as well.

Warner

Lev Serebryakov

2024-01-07 20:49:24 UTC

Permalink

Post by Warner Losh
I must have missed it. What were the diagnostics?

zio_read error: 5
zio_read error: 5
zio_read error: 5
ZFS: i/o error - all block copies unavailable
ZFS: can't read MOS of pool zroot

To be honest, I thinks there is something else. Because sequence of events were (sorry, too long, but I think, tht every detail matters here):

(1) Update to 13.2 from 12.4. With installation of new gptzfsboot with gpart on both disks. It could place new /boot far away, but see (2)
(2) Reboot, which completed, but showed that ada0 has problems
(3) Replacement of ada0 by DC technicians, new disk is 512/4096, old disk is 512/512, pool has ashift=9
(4) Server refuses to boot from ada1 (ada0 is empty) with diagnostics (see above)
(5) Linux rescue system, passing 2 devices to qemu with FreeBSD (because Linux shows that ZFS is on whole disk, not on partition!).
(6) Re-creation of GPT on ada0, start of resilver (with sub-optimal ashift!).
(7) Interruption of resilver with reboot, because it is painfully slow under qemu.
(8) Wipe of ada0 (at this point resilver status of pool becomes crazy) to put live FreeBSD image to boot somehow.
(9) Many tries to cancel resilver and boot from single-disk "historical" pool on ada1, no success. I've attributed it to the strange state of pool: one component, no mirrior, but "resilvering".
(10) Boot from small UFS partition (which replaces swap partition).
(11) Pool on ada1 (old, live, 512/512 disk) is still "Reslivering" without any additional components (with zero speed, of course).
(12) Prepare partitions on ada0 again, creating new pool with ashift=12, send|receive.
(13) Removing partition on ada1 (old one, ashift=9, still resilvering after many-many reboots with only one device in it).
(14) Boot from fresh ada0 pool - same errors from gptzfsboot, fail, and gptzfsboot says about OLD pool (which should not be available as GPT on ada1 was wiped out!!!!)
(15) Boot from UFS again.
(16) Adding parition of ada1 as second component of new pool, resilvering successful.
(17) Boot with gptzfsboot still fails! With brand-new ashift=12 pool! Now bootloader reports new pool name, but still fails to boot.

You see, buildworld update could place /boot too far away. But there was one last successful boot between (1) and (3)! And state of pool on live disk ada1 was very strange: I can not cancel resilver no matter what I've tried till I zap GPT and start over.

Post by Warner Losh
If people want to continue to support BIOS booting (or rather, booting using the CSM interfaces), then somebody is going to need to step up to the plate and implement a similar option in bsdinstall, bectl, freebsd-update, etc.

I can use UEFI boot without problems, but now I'm not sure, will it work for me now.

--
// Lev Serebryakov

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Warner Losh

2024-01-07 21:15:14 UTC

Permalink

Post by Lev Serebryakov

Post by Warner Losh
I must have missed it. What were the diagnostics?

Oh, and two "nvlist inconsistency" before that vvvv

Post by Lev Serebryakov
zio_read error: 5
zio_read error: 5
zio_read error: 5

5 is EIO which the loader uses internally for any error that the disk
reports.
I've not read through all the code involved here, but I think that means
there
might be read errors for real.

Though the nvlist inconsistency might be an issue.

So, if this is a mirror, then ada0 blank and ada1 with good data, in theory
you should be fine. However, perhaps ZFS is finding that there's an error
from
ada1 for real. Does all of ada1 read with a simple dd?

Not sure about the losing devices you described later on.

ZFS: i/o error - all block copies unavailable

Post by Lev Serebryakov
ZFS: can't read MOS of pool zroot
To be honest, I thinks there is something else. Because sequence of

Yea. There's something that's failing, which zio_read is woefully under
reporting for our diagnostic efforts. And/or something is
getting confused by the blank disk and/or the partially resilvered disk.

(1) Update to 13.2 from 12.4. With installation of new gptzfsboot with
gpart on both disks. It could place new /boot far away, but see (2)

Post by Lev Serebryakov
(2) Reboot, which completed, but showed that ada0 has problems
(3) Replacement of ada0 by DC technicians, new disk is 512/4096, old

disk is 512/512, pool has ashift=9

Post by Lev Serebryakov
(4) Server refuses to boot from ada1 (ada0 is empty) with diagnostics

(see above)

Post by Lev Serebryakov
(5) Linux rescue system, passing 2 devices to qemu with FreeBSD (because

Linux shows that ZFS is on whole disk, not on partition!).

Post by Lev Serebryakov
(6) Re-creation of GPT on ada0, start of resilver (with sub-optimal

ashift!).

Post by Lev Serebryakov
(7) Interruption of resilver with reboot, because it is painfully slow

under qemu.

Post by Lev Serebryakov
(8) Wipe of ada0 (at this point resilver status of pool becomes crazy)

to put live FreeBSD image to boot somehow.

Post by Lev Serebryakov
(9) Many tries to cancel resilver and boot from single-disk "historical"

one component, no mirrior, but "resilvering".

Post by Lev Serebryakov
(10) Boot from small UFS partition (which replaces swap partition).
(11) Pool on ada1 (old, live, 512/512 disk) is still "Reslivering"

without any additional components (with zero speed, of course).

Post by Lev Serebryakov
(12) Prepare partitions on ada0 again, creating new pool with ashift=12,

send|receive.

Post by Lev Serebryakov
(13) Removing partition table on ada1 (with old pool, ashift=9, still

resilvering after many-many reboots with only one device in it).
And pleas note: this pool on ada1 (old, live disk) was NOT upgraded
after 12-STABLE. It was old, 12-STABLE "level" pool with all new features
disabled.

Yea, this isn't *THAT*OtHER* problem :).

Warner

--
// Lev Serebryakov