improving nfs client & server performance

Discussion:

improving nfs client & server performance

Add Reply

void

2024-10-21 13:46:38 UTC

Reply

I'm looking to try to improve nfs client & server performance.
The problem I'm seeing is that when clients access the server,
if more than one does it at once, the access speed is very
'spiky' and in /var/log/messages on the client, it has things like

Oct 20 07:21:55 kernel: nfs server 192.168.1.10:/zroot/share: not
responding
Oct 20 07:21:55 kernel: nfs server 192.168.1.10:/zroot/share: is alive
again
Oct 20 07:42:05 kernel: nfs server 192.168.1.10:/zroot/share: not
responding
Oct 20 07:42:05 kernel: nfs server 192.168.1.10:/zroot/share: is alive
again
Oct 20 08:29:54 kernel: nfs server 192.168.1.10:/zroot/share: not
responding

If one client is accessing one file at once, the transfer is very fast.
But syncing like rsync or webdav is very problematic and takes much longer than
it should.

The server is recent 14-stable and exports nfs via the zfs sharenfs property.
The clients are a mix of freebsd and linux (debian)

I note on the server there's lots of vfs.nfsd sysctl tunables but I'm not sure
if thay are relevant in a zfs sharenfs context. There's even more vfs.zfs but
nothing pertaining directly to nfs.

On a freebsd client, it has these in rc.conf

# nfs client stuff
nfs_client_enable="YES"

Maybe it needs local locks (-L) unsure how to pass flags to the client started
in this way. How would I know if local locks were needed?

I note in defaults/rc.conf there is
nfs_bufpackets="" # bufspace (in packets) for client

but I'm not sure what the effects would be.

I've so far set 'zfs set sync=disabled' for the particular vdev and
'sysctl vfs.nfsd.maxthreads=256' on the server, about to test this.
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Rick Macklem

2024-10-21 16:17:05 UTC

Reply

On Mon, Oct 21, 2024 at 6:46 AM void <***@f-m.fm> wrote:
>
> I'm looking to try to improve nfs client & server performance.
> The problem I'm seeing is that when clients access the server,
> if more than one does it at once, the access speed is very
> 'spiky' and in /var/log/messages on the client, it has things like
>
> Oct 20 07:21:55 kernel: nfs server 192.168.1.10:/zroot/share: not
> responding
> Oct 20 07:21:55 kernel: nfs server 192.168.1.10:/zroot/share: is alive
> again
> Oct 20 07:42:05 kernel: nfs server 192.168.1.10:/zroot/share: not
> responding
> Oct 20 07:42:05 kernel: nfs server 192.168.1.10:/zroot/share: is alive
> again
> Oct 20 08:29:54 kernel: nfs server 192.168.1.10:/zroot/share: not
> responding
>
> If one client is accessing one file at once, the transfer is very fast.
> But syncing like rsync or webdav is very problematic and takes much longer than
> it should.
>
> The server is recent 14-stable and exports nfs via the zfs sharenfs property.
> The clients are a mix of freebsd and linux (debian)
>
> I note on the server there's lots of vfs.nfsd sysctl tunables but I'm not sure
> if thay are relevant in a zfs sharenfs context. There's even more vfs.zfs but
> nothing pertaining directly to nfs.
>
> On a freebsd client, it has these in rc.conf
>
> # nfs client stuff
> nfs_client_enable="YES"
>
> Maybe it needs local locks (-L) unsure how to pass flags to the client started
> in this way. How would I know if local locks were needed?
>
> I note in defaults/rc.conf there is
> nfs_bufpackets="" # bufspace (in packets) for client
>
> but I'm not sure what the effects would be.
>
> I've so far set 'zfs set sync=disabled' for the particular vdev and
> 'sysctl vfs.nfsd.maxthreads=256' on the server, about to test this.
There are lots of possibilities, but here are a couple to try...
vfs.zfs.dmu_offset_next_sync=0 - this makes SEEK_HOLE/SEEK_DATA
much faster,
but less reliable (as in it
might miss
finding a hole)
vfs.nfsd.cachetcp=0 - this disables the DRC cache for TCP
connections
(if this helps, there are
settings to
try to tune he DRC).
Disabling
the DRC for TCP means that
there is a
slight chance of corruption, due
to duplicate
non-idempotent RPCs being
done after a
TCP reconnect.
Has no
effect on NFSv4.1/4.2 mounts.

rick

> --
>

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

void

2024-10-21 18:52:46 UTC

Reply

Hi Rick, thanks for replying

On Mon, Oct 21, 2024 at 09:17:05AM -0700, Rick Macklem wrote:

>There are lots of possibilities, but here are a couple to try...
>vfs.zfs.dmu_offset_next_sync=0 - this makes SEEK_HOLE/SEEK_DATA
>much faster, but less reliable (as in it might miss finding a hole)

>vfs.nfsd.cachetcp=0 - this disables the DRC cache for TCP connections
>(if this helps, there are settings to try to tune he DRC).
>Disabling the DRC for TCP means that there is a
>slight chance of corruption, due to duplicate
>non-idempotent RPCs being done after a TCP reconnect. Has no
>effect on NFSv4.1/4.2 mounts.

How can I tell what version NFS mount it has?
rpcinfo nfs-server-ip shows versions 4,3,2,1
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Rick Macklem

2024-10-21 22:52:12 UTC

Reply

On Mon, Oct 21, 2024 at 11:53 AM void <***@f-m.fm> wrote:
>
> Hi Rick, thanks for replying
>
> On Mon, Oct 21, 2024 at 09:17:05AM -0700, Rick Macklem wrote:
>
> >There are lots of possibilities, but here are a couple to try...
> >vfs.zfs.dmu_offset_next_sync=0 - this makes SEEK_HOLE/SEEK_DATA
> >much faster, but less reliable (as in it might miss finding a hole)
>
> >vfs.nfsd.cachetcp=0 - this disables the DRC cache for TCP connections
> >(if this helps, there are settings to try to tune he DRC).
> >Disabling the DRC for TCP means that there is a
> >slight chance of corruption, due to duplicate
> >non-idempotent RPCs being done after a TCP reconnect. Has no
> >effect on NFSv4.1/4.2 mounts.
>
> How can I tell what version NFS mount it has?
> rpcinfo nfs-server-ip shows versions 4,3,2,1
On the clients type:
# nfsstat -m

rick

> --
>

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

void

2024-10-28 23:40:59 UTC

Reply

On Mon, Oct 21, 2024 at 03:52:12PM -0700, Rick Macklem wrote:

(stuff)

Hi Rick I tried the various things suggested and did notice a performance
improvement. What I'm seeing though right now is the following, in a
different context [1]

login[some_pid] Getting pipebuf resource limit: Invalid argument

[1] context is root creating a tar on a nfs mount. The tar is huge,
over a Tb, as it's backing up homedirs [2]. It seems to be running ok
though. For the mment. But I'm not seeing anything remotely
approaching maximums in netstat -m, and thought with root doing
the backup its resources would be pretty much unlimited.
It's not spamming the console though.

[2] The SMR hd is showing signs it's ill. Slow everything.
Trying to move all the data off there before a (cold) reboot.
Thing is with these SMRs (I've had two) it seems nothing much
happens in the bad blocks sense. it just gets *very* *slow*.
It's zfs, but single disk, 8Tb.
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

J David

2024-11-15 16:30:54 UTC

Reply

On Mon, Oct 28, 2024 at 7:41 PM void <***@f-m.fm> wrote:
> [1] context is root creating a tar on a nfs mount. The tar is huge,
> over a Tb,
[...]
> [2] The SMR hd

FWIW, writing extremely large files is pretty much the worst-case
scenario for SMR drives.

> it just gets *very* *slow*.

That's what SMR does.

> It's zfs, but single disk, 8Tb.

Perhaps I'm overstating the case, but I believe that using ZFS on SMR
disks is strongly discouraged. I haven't tried myself, mainly due to
the horror stories I've read. Stories that sound a lot like yours.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

infoomatic

2024-11-15 16:34:29 UTC

Reply

From my personal experience, regarding the usage with ZFS:

* SMR disks are are absolutely to avoid, their performance is horrible.
* QLC SSD disks are also horrible in performance - after a short burst
of performance they go back to spindling rust disks performance

On 15.11.24 17:30, J David wrote:
> On Mon, Oct 28, 2024 at 7:41 PM void <***@f-m.fm> wrote:
>> [1] context is root creating a tar on a nfs mount. The tar is huge,
>> over a Tb,
> [...]
>> [2] The SMR hd
>
> FWIW, writing extremely large files is pretty much the worst-case
> scenario for SMR drives.
>
>> it just gets *very* *slow*.
>
> That's what SMR does.
>
>> It's zfs, but single disk, 8Tb.
>
> Perhaps I'm overstating the case, but I believe that using ZFS on SMR
> disks is strongly discouraged. I haven't tried myself, mainly due to
> the horror stories I've read. Stories that sound a lot like yours.
>

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Alan Somers

2024-11-15 16:40:17 UTC

Reply

On Fri, Nov 15, 2024 at 9:34 AM infoomatic <***@gmx.at> wrote:
>
> From my personal experience, regarding the usage with ZFS:
>
> * SMR disks are are absolutely to avoid, their performance is horrible.
> * QLC SSD disks are also horrible in performance - after a short burst
> of performance they go back to spindling rust disks performance

To be clear, SMR disk performance is horrible ***on file systems that
weren't designed for them***. That is to say, on every single legacy
file system. Samsung claims to fully support SMR disks on F2FS, but
that's an overwriting file system. Microsoft claims to fully support
SMR disks on ReFS. But look more closely, and you'll see that not all
features are supported there. So I'm not aware of any CoW file system
that fully supports SMR. Any, that is, except for the one I wrote
myself. Alas, I don't have funding to finish it to a production-ready
state ...

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

J David

2024-12-02 20:23:08 UTC

Reply

On Sun, Dec 1, 2024 at 8:03 PM Rick Macklem <***@gmail.com> wrote:
> Well, this indicates the Debian server is broken. A bitmap and associated
> attribute values are required for a GETATTR reply of NFS4_OK.
> This clearly says they are not there.
>
> That would result in the client saying the RPC is bad.

Even if the response to that isn't "A problem that occurs only with
FreeBSD clients is a FreeBSD client problem; it shouldn't do the thing
that causes that to happen," it could take quite some time for any
change made by the linux-nfs crowd to filter through to reaching a
production Debian release.

Is there a reasonable way to apply Postel's law here and modify the
client to warn on but accept this behavior rather than erroring out in
a way that renders the file structure unusable indefinitely?

Even refusing to cache this response if it is unusable would probably
be an improvement.

Thanks!

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Rick Macklem

2024-12-07 22:42:05 UTC

Reply

On Mon, Dec 2, 2024 at 12:23 PM J David <***@gmail.com> wrote:
>
> On Sun, Dec 1, 2024 at 8:03 PM Rick Macklem <***@gmail.com> wrote:
> > Well, this indicates the Debian server is broken. A bitmap and associated
> > attribute values are required for a GETATTR reply of NFS4_OK.
> > This clearly says they are not there.
> >
> > That would result in the client saying the RPC is bad.
>
> Even if the response to that isn't "A problem that occurs only with
> FreeBSD clients is a FreeBSD client problem; it shouldn't do the thing
> that causes that to happen," it could take quite some time for any
> change made by the linux-nfs crowd to filter through to reaching a
> production Debian release.
>
> Is there a reasonable way to apply Postel's law here and modify the
> client to warn on but accept this behavior rather than erroring out in
> a way that renders the file structure unusable indefinitely?
Probably not.

First is the question of what failure went on-the-wire:
(A) - The record mark length for the message was correct, but the
message did not have any GETATTR reply data.
or
(B) - The record mark length was wrong and the GETATTR reply data
came after the end-of-record as indicated by the record mark
that precedes each RPC message.
If it is (B), the TCP connection is screwed up, since there is no way
to re-synchronize to the start of the next RPC message. All a client
can do in this case is create a new TCP connection and retry all
outstanding RPCs. (Your initial post suggested that this might be
happening?)

If it is (A), then for the specific case of GETATTR not receiving
valid data after a READDIR, it might be ok to ignore the failure.
However, GETATTRs happen a lot and there are many places
where no reply data is a serious problem. For example, the
client might not even know what type of file object (regular file,
directory,...) the object is.
--> The GETATTR replies are all processed in the same place
and, as such, it is not known that this reply comes after a
READDIR.
If there was one reproducible case where a widely used Linux
server was known to fail, it might be possible to come up with a
workaround hack. However, you are the only one reporting this
problem as far as I can recall and it appears to be intermittent.
(ie. It could be that GETATTRs fail to reply with proper data for
other cases, but it is this case that you captured packets for,)

Finally, why would you assume that putting a fix in the FreeBSD
client is somehow easier and less logistically time consuming
compared to fixing a Linux server.

Note that I hinted at how you might isolate why/how the Linux
server is broken. In doing so, I did not intend to suggest that
it was even a software issue. I simply do not know.
(For example, have you looked hard for any evidence that there
is a hardware issue w.r.t. that server?)

rick

>
> Even refusing to cache this response if it is unusable would probably
> be an improvement.
>
> Thanks!

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Rick Macklem

2024-12-07 22:44:22 UTC

Reply

On Sat, Dec 7, 2024 at 2:42 PM Rick Macklem <***@gmail.com> wrote:
>
> On Mon, Dec 2, 2024 at 12:23 PM J David <***@gmail.com> wrote:
> >
> > On Sun, Dec 1, 2024 at 8:03 PM Rick Macklem <***@gmail.com> wrote:
> > > Well, this indicates the Debian server is broken. A bitmap and associated
> > > attribute values are required for a GETATTR reply of NFS4_OK.
> > > This clearly says they are not there.
> > >
> > > That would result in the client saying the RPC is bad.
> >
> > Even if the response to that isn't "A problem that occurs only with
> > FreeBSD clients is a FreeBSD client problem; it shouldn't do the thing
> > that causes that to happen," it could take quite some time for any
> > change made by the linux-nfs crowd to filter through to reaching a
> > production Debian release.
> >
> > Is there a reasonable way to apply Postel's law here and modify the
> > client to warn on but accept this behavior rather than erroring out in
> > a way that renders the file structure unusable indefinitely?
> Probably not.
>
> First is the question of what failure went on-the-wire:
> (A) - The record mark length for the message was correct, but the
> message did not have any GETATTR reply data.
> or
> (B) - The record mark length was wrong and the GETATTR reply data
> came after the end-of-record as indicated by the record mark
> that precedes each RPC message.
Oh, and although it is not easy for the client to tell if the failure
is (A) vs (B),
it can be determined by looking at the packet trace in wireshark, as I
described.

If you do not want to do this but are willing to provide the pcap file to me,
I can take a look and quickly determine if it is (A) vs (B).

rick

> If it is (B), the TCP connection is screwed up, since there is no way
> to re-synchronize to the start of the next RPC message. All a client
> can do in this case is create a new TCP connection and retry all
> outstanding RPCs. (Your initial post suggested that this might be
> happening?)
>
> If it is (A), then for the specific case of GETATTR not receiving
> valid data after a READDIR, it might be ok to ignore the failure.
> However, GETATTRs happen a lot and there are many places
> where no reply data is a serious problem. For example, the
> client might not even know what type of file object (regular file,
> directory,...) the object is.
> --> The GETATTR replies are all processed in the same place
> and, as such, it is not known that this reply comes after a
> READDIR.
> If there was one reproducible case where a widely used Linux
> server was known to fail, it might be possible to come up with a
> workaround hack. However, you are the only one reporting this
> problem as far as I can recall and it appears to be intermittent.
> (ie. It could be that GETATTRs fail to reply with proper data for
> other cases, but it is this case that you captured packets for,)
>
> Finally, why would you assume that putting a fix in the FreeBSD
> client is somehow easier and less logistically time consuming
> compared to fixing a Linux server.
>
> Note that I hinted at how you might isolate why/how the Linux
> server is broken. In doing so, I did not intend to suggest that
> it was even a software issue. I simply do not know.
> (For example, have you looked hard for any evidence that there
> is a hardware issue w.r.t. that server?)
>
> rick
>
> >
> > Even refusing to cache this response if it is unusable would probably
> > be an improvement.
> >
> > Thanks!

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

J David

2024-12-09 23:34:06 UTC

Reply

On Sat, Dec 7, 2024 at 5:42 PM Rick Macklem <***@gmail.com> wrote:
> Finally, why would you assume that putting a fix in the FreeBSD
> client is somehow easier and less logistically time consuming
> compared to fixing a Linux server.

Because if you or I could come up with a workaround or a way to not
cache the bad response so it would at least retry sooner, I could
apply it and rebuild from source. I can't do that on Linux. If there's
a way to do that with a patch from linux-nfs folks on a Debian system
at all, I have no idea what would be involved or how to even begin.

A fix on their end would, most likely, have to go through the complete
release process from linux-nfs, the Linux kernel group, and then the
Debian project.

> (For example, have you looked hard for any evidence that there
> is a hardware issue w.r.t. that server?)

There is no evidence that there is a hardware issue. Nor is it just
one specific server or one client. There are many clients and many
servers, and this can happen to any combination. This is just the case
where I was easily and reliably able to reproduce it. It's so reliable
I may even be able to reproduce it in a couple of VMs, which is what I
am waiting to have time to do before I reach out to linux-nfs.

I put the pcap file in a safe place and am happy to send you a copy. I
will do so as soon as I figure out where I put the safe place...

Thanks!

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Rick Macklem

2024-12-12 21:43:51 UTC

Reply

On Mon, Dec 9, 2024 at 3:34 PM J David <***@gmail.com> wrote:
>
> On Sat, Dec 7, 2024 at 5:42 PM Rick Macklem <***@gmail.com> wrote:
> > Finally, why would you assume that putting a fix in the FreeBSD
> > client is somehow easier and less logistically time consuming
> > compared to fixing a Linux server.
>
> Because if you or I could come up with a workaround or a way to not
> cache the bad response so it would at least retry sooner, I could
> apply it and rebuild from source. I can't do that on Linux. If there's
> a way to do that with a patch from linux-nfs folks on a Debian system
> at all, I have no idea what would be involved or how to even begin.
>
> A fix on their end would, most likely, have to go through the complete
> release process from linux-nfs, the Linux kernel group, and then the
> Debian project.
>
> > (For example, have you looked hard for any evidence that there
> > is a hardware issue w.r.t. that server?)
>
> There is no evidence that there is a hardware issue. Nor is it just
> one specific server or one client. There are many clients and many
> servers, and this can happen to any combination. This is just the case
> where I was easily and reliably able to reproduce it. It's so reliable
> I may even be able to reproduce it in a couple of VMs, which is what I
> am waiting to have time to do before I reach out to linux-nfs.
>
> I put the pcap file in a safe place and am happy to send you a copy. I
> will do so as soon as I figure out where I put the safe place...
Just to bring the list up to date...
J. David did send me a packet trace. The problem is that the "length
of the GETATTR bitmap" is a word of 0 instead of 2, although the 2 words
of bits and the associated attributes is in the reply on-the-wire.

This wouldn't be an obvious Linux knfsd bug. It might be some sort
of runaway pointer or use after free bug.

There is no way the FreeBSD client can easily know that the reply
is corrupted in this way, so I think reporting "RPC struct is bad" is
reasonable.

I have sent a patch to J. David that modifies the NFSv4 Readdir RPC
to not do a GETATTR after the READDIR. It might work for him,
but I do not consider it appropriate for FreeBSD at this time.

rick

>
> Thanks!

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Rick Macklem

2024-12-24 01:07:01 UTC

Reply

On Mon, Dec 2, 2024 at 12:23 PM J David <***@gmail.com> wrote:
>
> On Sun, Dec 1, 2024 at 8:03 PM Rick Macklem <***@gmail.com> wrote:
> > Well, this indicates the Debian server is broken. A bitmap and associated
> > attribute values are required for a GETATTR reply of NFS4_OK.
> > This clearly says they are not there.
> >
> > That would result in the client saying the RPC is bad.
>
> Even if the response to that isn't "A problem that occurs only with
> FreeBSD clients is a FreeBSD client problem; it shouldn't do the thing
> that causes that to happen," it could take quite some time for any
> change made by the linux-nfs crowd to filter through to reaching a
> production Debian release.
The bug has been isolated by Chuck Lever III and he has proposed a
patch to the Linux NFS project group, of which he is a member.
I have no idea how long it will take for this patch to find its way into
production release kernels.

I have created FreeBSD bugzilla pr#283538 with attachments, including
Chuck Lever's proposed patch and J. David's shell script to test for it.

I'll leave this bugzilla pr Open until the server patch has been widely
distributed, so that others can fairly easily find out what is going on.

Thanks go to J. David for figuring out how to reproduce the problem
and Chuck Lever to figuring out how to fix it.
It does appear to be in most Linux knfsd kernel servers, so it is probably
in any Linux NFSv4 server you run, although the failure is a rare/oddball case.

rick

>
> Is there a reasonable way to apply Postel's law here and modify the
> client to warn on but accept this behavior rather than erroring out in
> a way that renders the file structure unusable indefinitely?
>
> Even refusing to cache this response if it is unusable would probably
> be an improvement.
>
> Thanks!

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Alan Somers

2025-01-05 16:19:26 UTC

Reply

On Sun, Jan 5, 2025 at 5:47 AM Harry Schmalzbauer <***@omnilan.de> wrote:
>
> On 2025-01-04 22:53, Alan Somers wrote:
> > On Sat, Jan 4, 2025 at 2:39 PM Harry Schmalzbauer <***@omnilan.de> wrote:
> ....
> >> For now I set the setuid bit to JAILROOT/bin/mount_fusefs.
> >>
> >> **This works fine** (signing in via RDP as unprivileged user (with
> >> freerdp/remmina) allows me to access my shared remote-client directory
> >> in the jailed XFCE4 session).
> ...
> >
> > What is the value of enforce_statfs in your jail? It must be < 2 for
> > mounting within the jail to work.
>
> Thanks for your help. The jail config is fine (enforce_statfs is set to
> 1 in that case), like mentioned utilizing mount_fusefs(8) is working as
> expected in my jail as long as the process invoking it is privileged.
>
> My issue is that vfs.usermount doesn't affect how mount requests from
> jails are handled.
> Even if setting vfs.usermount to 1 on my host would enable unprivileged
> users in my jail to mount_fusefs(8), this setting has unwanted side
> effects - I don't want users to mount anything on the host.
>
> *I don't know if it is intentional* that vfs.usermount is ignored for
> jailed processes.
> What we really would need is a jail-only setting allowing user mounts.
> Global for all jails might be sufficient, since you have to selectively
> allow.mount each fs-type separately.
> Per jail would be the best implementation.
>
> Maybe I oversee any other security impact of allowing unprivileged
> processes to mount from/inside jails!?!
>
> For my current use case, I could tolerate vfs.usermount affecting the
> host security because no users other than the su(1)-permitted admin can
> sign in.
> But I'm not sure I can cope with the security implication having the
> /sbin/mount_fusefs SUID permission bit set, which is my current solution
> (which makes user-mounting RDPDR fusefs working!).
>
> Thanks,
> -harry

Looking through the code, I see that revision
7533652025eb80bc769f019ba6cb82c4f500443d is the first that ever
allowed mounting from within a jail. But it only allowed mounting by
jailed privileged users. There's no public record of the code review,
so I don't know what was discussed. I'd be wary of granting extra
privileges to jails, though. Jail security can be tricky. There are
a number of ways, for example, for a jailed privileged user to
collaborate with an unjailed unprivileged user in order to gain root
outside of the jail.

I will note that there's another option. mac(9) can choose to allow
an operation that would otherwise be disallowed. So it would be
possible to write a rule that would allow a user (perhaps a specific
user, or all users, or a range, etc) to mount a file system.
mac_bsdextended doesn't have that ability, but it could be added.
mac_biba, mac_lowmac, and mac_mls all do. However, I don't know those
well enough to write rules for them. You'll have to do some research
there.

Hope that helps,
-Alan

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Peter 'PMc' Much

2025-01-06 16:37:49 UTC

Reply

On Mon, Jan 06, 2025 at 05:53:38AM -0800, Rick Macklem wrote:
! On Sun, Jan 5, 2025 at 8:45 PM Peter 'PMc' Much
! <***@citylink.dinoex.sub.org> wrote:

! > This doesn't look good. It goes on for hours. What can be done about it?
! > (13.4 client & server)
! >
! >
! > 44 processes: 4 running, 39 sleeping, 1 waiting
! > CPU: 0.4% user, 0.0% nice, 99.6% system, 0.0% interrupt, 0.0% idle
! > Mem: 21M Active, 198M Inact, 1190M Wired, 278M Buf, 3356M Free
! > ARC: 418M Total, 39M MFU, 327M MRU, 128K Anon, 7462K Header, 43M Other
! > 332M Compressed, 804M Uncompressed, 2.42:1 Ratio
! > Swap: 15G Total, 15G Free
! >
! > PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND
! > 417 root 4 52 0 12M 2148K RUN 20:55 99.12% nfscbd
! Do you have delegations enabled on your server
! (vfs.nfsd.issue_delegations not 0)?

Not knowingly:

# sysctl vfs.nfsd.issue_delegations
vfs.nfsd.issue_delegations: 0

! (If you do not, I have no idea why the server would be doing
! callbacks, which is what nfscbd
! handles.)

Me neither. ;)

The good news at this point is, it is a single event. At first I
thought the whole cluster got slow (it is always too slow ;) ), but
it was only this node - the others have no cpu consumption on
nfscbd.

The bad thing is, I cannot remember why I did switch that thing
on.

! Also, "nfsstat -m" on the client shows you/us what your mount
! options are.

It had to be destroyed, as effects got worse.

What I figured is: it didn't issue any syscalls, and it didn't
act on kill -9.
Which means: most likely it found an infinite loop inside the
kernel, aka a never-returning syscall.

! The above suggests that there is still some activity on the client, but the
! info. is limited.

Yes, it got ever slower. The NFS mount is for /usr/ports, and I did
fix some ports there. At some point a "make clean" would start to
take minutes to complete, and there I noticed something is wrong.
Finally it didn't even echo on the console (I had only one cpu
available, and then when something is stuck within the kernel, all
depends on preemption).

! If the client is still in this state, you can collect more info via:
! # tcpdump -s 0 -w out.pcap host <nfs-server>
! run for a little while.

I had to destroy it. I tried to run dtrace to pinpoint exactly where
that thing does execute, but it didn't startup. At that point I didn't
consider it feasible to try further investigation.
These are temporary building guests, they get destroyed after
completion anyway.

So, as apparently it was a single event, I might suggest we just
remember that nfscbd /can do this/ (under yet unclear circumstances)
and otherwise hope for the best.

And probably I should get rid of that daemon altogether. I think I
read something about these delegations, and it looked suitable for the
usecase, but I didn't realize that it would need to be activated
on the server also.
(The usecase is, a snapshot + clone is created from the ports repo,
then switched to a desired tag/branch, and that filetree is then
used by a single guest, exclusively.)

Thanks for Your help!

cheerio,
PMc

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Rick Macklem

2025-01-07 23:45:37 UTC

Reply

On Mon, Jan 6, 2025 at 8:45 AM Peter 'PMc' Much
<***@citylink.dinoex.sub.org> wrote:
>
> On Mon, Jan 06, 2025 at 05:53:38AM -0800, Rick Macklem wrote:
> ! On Sun, Jan 5, 2025 at 8:45 PM Peter 'PMc' Much
> ! <***@citylink.dinoex.sub.org> wrote:
>
> ! > This doesn't look good. It goes on for hours. What can be done about it?
> ! > (13.4 client & server)
> ! >
> ! >
> ! > 44 processes: 4 running, 39 sleeping, 1 waiting
> ! > CPU: 0.4% user, 0.0% nice, 99.6% system, 0.0% interrupt, 0.0% idle
> ! > Mem: 21M Active, 198M Inact, 1190M Wired, 278M Buf, 3356M Free
> ! > ARC: 418M Total, 39M MFU, 327M MRU, 128K Anon, 7462K Header, 43M Other
> ! > 332M Compressed, 804M Uncompressed, 2.42:1 Ratio
> ! > Swap: 15G Total, 15G Free
> ! >
> ! > PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND
> ! > 417 root 4 52 0 12M 2148K RUN 20:55 99.12% nfscbd
> ! Do you have delegations enabled on your server
> ! (vfs.nfsd.issue_delegations not 0)?
>
> Not knowingly:
>
> # sysctl vfs.nfsd.issue_delegations
> vfs.nfsd.issue_delegations: 0
>
> ! (If you do not, I have no idea why the server would be doing
> ! callbacks, which is what nfscbd
> ! handles.)
>
> Me neither. ;)
The cpu being associated with nfscbd might just be a glitch.
NFS uses kernel threads and it is hard to know what process
they might get associated with for these stats.

When you do:
# ps axHl
it will show the kernel threads. If it happens again, it might turn
out that the thread(s) racking up CPU aren't actually doing callacks.

rick

>
> The good news at this point is, it is a single event. At first I
> thought the whole cluster got slow (it is always too slow ;) ), but
> it was only this node - the others have no cpu consumption on
> nfscbd.
>
> The bad thing is, I cannot remember why I did switch that thing
> on.
>
> ! Also, "nfsstat -m" on the client shows you/us what your mount
> ! options are.
>
> It had to be destroyed, as effects got worse.
>
> What I figured is: it didn't issue any syscalls, and it didn't
> act on kill -9.
> Which means: most likely it found an infinite loop inside the
> kernel, aka a never-returning syscall.
>
> ! The above suggests that there is still some activity on the client, but the
> ! info. is limited.
>
> Yes, it got ever slower. The NFS mount is for /usr/ports, and I did
> fix some ports there. At some point a "make clean" would start to
> take minutes to complete, and there I noticed something is wrong.
> Finally it didn't even echo on the console (I had only one cpu
> available, and then when something is stuck within the kernel, all
> depends on preemption).
>
> ! If the client is still in this state, you can collect more info via:
> ! # tcpdump -s 0 -w out.pcap host <nfs-server>
> ! run for a little while.
>
> I had to destroy it. I tried to run dtrace to pinpoint exactly where
> that thing does execute, but it didn't startup. At that point I didn't
> consider it feasible to try further investigation.
> These are temporary building guests, they get destroyed after
> completion anyway.
>
> So, as apparently it was a single event, I might suggest we just
> remember that nfscbd /can do this/ (under yet unclear circumstances)
> and otherwise hope for the best.
>
> And probably I should get rid of that daemon altogether. I think I
> read something about these delegations, and it looked suitable for the
> usecase, but I didn't realize that it would need to be activated
> on the server also.
> (The usecase is, a snapshot + clone is created from the ports repo,
> then switched to a desired tag/branch, and that filetree is then
> used by a single guest, exclusively.)
>
>
> Thanks for Your help!
>
> cheerio,
> PMc

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

andy thomas

2025-01-11 10:39:50 UTC

Reply

But what could be increasing the size of the snapshot after it was
created? Right now, yesterday's snapshot made at 14:45 UTC (nearly 22
hours ago) has grown to 468MB:

***@clustor2:~ # date
Sat Jan 11 10:31:05 GMT 2025
***@clustor2:~ # zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
clustor2/***@2025-01-10_14.45.00 468M - 3.09T -

It's true the size this file system reported by 'zfs list' has increasesd
from 3.09TB yesterday to 3.10TB now (it's used for storing user data in an
HPC that currently has about 700 jobs running on it) but it's very strange
that a snapshot supposedly "set in stone" at the time it is created should
continue to grow afterwards!

Andy

On Fri, 10 Jan 2025, heasley wrote:

> Fri, Jan 10, 2025 at 06:52:43PM +0000, andy thomas:
>> Is there a way to find out the status of a snapshot creation?
>
> it should be near-instantaneous. "USED" would only change if files in
> the snapshot are deleted or overwritten.
>
>

-----------------------------
Andy Thomas,
Time Domain Systems

Tel: +44 (0)7815 060872
https://www.time-domain.co.uk

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Ronald Klop

2025-01-11 11:03:59 UTC

Reply

For example:
After creating a snapshot you delete a file in the original file system. The file is still in the snapshot. So the size of the file will be accounted to the snapshot.
Modifying a file will give similar effect. The âoldâ parts of the file are only available via the snapshot so will be accounted to the snapshot.
Regards,Ronald.

Van: andy thomas <***@time-domain.co.uk>
Datum: 11 januari 2025 11:40
Aan: heasley <***@shrubbery.net>
CC: freebsd-***@freebsd.org
Onderwerp: Re: How can I tell when ZFS has finished creating a snapshot?

>
>
> But what could be increasing the size of the snapshot after it was created? Right now, yesterday's snapshot made at 14:45 UTC (nearly 22 hours ago) has grown to 468MB:
>
> ***@clustor2:~ # date
> Sat Jan 11 10:31:05 GMT 2025
> ***@clustor2:~ # zfs list -t snapshot
> NAME USED AVAIL REFER MOUNTPOINT
> clustor2/***@2025-01-10_14.45.00 468M - 3.09T -
>
> It's true the size this file system reported by 'zfs list' has increasesd from 3.09TB yesterday to 3.10TB now (it's used for storing user data in an HPC that currently has about 700 jobs running on it) but it's very strange that a snapshot supposedly "set in stone" at the time it is created should continue to grow afterwards!
>
> Andy
>
> On Fri, 10 Jan 2025, heasley wrote:
>
> > Fri, Jan 10, 2025 at 06:52:43PM +0000, andy thomas:
> >> Is there a way to find out the status of a snapshot creation?
> >
> > it should be near-instantaneous. "USED" would only change if files in
> > the snapshot are deleted or overwritten.
> >
> >
>
>
> -----------------------------
> Andy Thomas,
> Time Domain Systems
>
> Tel: +44 (0)7815 060872
> https://www.time-domain.co.uk
>
>
>
>
>

Alexander Leidinger

2025-01-11 11:21:34 UTC

Reply

Am 2025-01-11 12:03, schrieb Ronald Klop:

> For example:
>
> After creating a snapshot you delete a file in the original file
> system. The file is still in the snapshot. So the size of the file will
> be accounted to the snapshot.
>
> Modifying a file will give similar effect. The 'old' parts of the file
> are only available via the snapshot so will be accounted to the
> snapshot.

Or in other words, the snapshot does not grow at all, it is read-only.
Space is attributed to a snapshot if the live dataset doesn't reference
a piece of data. Data which is removed or changed in the live dataset
since the snapshot was taken is what you see in USED. You can use "zfs
diff" to see where data has changed. I'm not sure if such data is
attributed to the oldest or the most recent snapshot, but it's one of
both (I guess to the oldest).

Bye,
Alexander.

> Regards,
> Ronald.
>
> Van: andy thomas <***@time-domain.co.uk>
> Datum: 11 januari 2025 11:40
> Aan: heasley <***@shrubbery.net>
> CC: freebsd-***@freebsd.org
> Onderwerp: Re: How can I tell when ZFS has finished creating a
> snapshot?
>
>> But what could be increasing the size of the snapshot after it was
>> created? Right now, yesterday's snapshot made at 14:45 UTC (nearly 22
>> hours ago) has grown to 468MB:
>>
>> ***@clustor2:~ # date
>> Sat Jan 11 10:31:05 GMT 2025
>> ***@clustor2:~ # zfs list -t snapshot
>> NAME USED AVAIL REFER MOUNTPOINT
>> clustor2/***@2025-01-10_14.45.00 468M - 3.09T -
>>
>> It's true the size this file system reported by 'zfs list' has
>> increasesd from 3.09TB yesterday to 3.10TB now (it's used for storing
>> user data in an HPC that currently has about 700 jobs running on it)
>> but it's very strange that a snapshot supposedly "set in stone" at the
>> time it is created should continue to grow afterwards!
>>
>> Andy
>>
>> On Fri, 10 Jan 2025, heasley wrote:
>>
>>> Fri, Jan 10, 2025 at 06:52:43PM +0000, andy thomas:
>>>> Is there a way to find out the status of a snapshot creation?
>>>
>>> it should be near-instantaneous. "USED" would only change if files
>>> in
>>> the snapshot are deleted or overwritten.
>>>
>>>
>>
>> -----------------------------
>> Andy Thomas,
>> Time Domain Systems
>>
>> Tel: +44 (0)7815 060872
>> https://www.time-domain.co.uk
>>
>> -------------------------

--
http://www.Leidinger.net ***@Leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.org ***@FreeBSD.org : PGP 0x8F31830F9F2772BF

Warner Losh

2025-01-20 20:15:11 UTC

Reply

On Mon, Jan 20, 2025, 1:13â¯PM Stefan Esser <***@freebsd.org> wrote:

> Am 20.01.25 um 20:29 schrieb Robert Clausecker:
> > No, though 2027 is only two years away, so if we cannot join the OIN,
> > we only have to keep the driver out-of-tree for that long worst case.
>
> Having a driver ready (and internally tested without a public release)
> would be a good preparation for 2027.
>
> BTW: 2 of the relevant patents have been filed 2009-02-20:
>
> Hash based file name lookup:
>
> https://www.freepatentsonline.com/8321439.html
>
> Contiguous file allocation:
>
> https://www.freepatentsonline.com/8606830.html
>
> IIUC, these would not expire before 2029?
>
> > As for âhosted in Europe:â Europe does not have software patents,
> > so we can basically ignore the ExFAT patents if development is done
> > in Europe independently of any US entity.
>
> I'm not convinced that hosting in Europe would be safe, but I do remember
> that the patented IDEA crypto code had been imported, but was not built
> unless a non-standard build option was used.
>

We have contacts at Microsoft... maybe just ask?

Warner

Regards,
>
> STefan
>
>

Konstantin Belousov

2025-01-21 00:11:25 UTC

Reply

On Mon, Jan 20, 2025 at 11:01:55PM +0000, Kirk McKusick wrote:
> It is possible to mount, read, and write exfat filesystems on FreeBSD
> using the fusefs-exfat-1.4.0_1 port/package running on the Fuse interface.
> My light testing of it shows that it all works as expected except that
> it does not seem to understand uids and gids (everything shows as
> root:wheel even though many of the files and directories on my test
> disk had other owners and groups). Attempts to change owner or group
> failed with "Operation not permitted".
>
> That said, having a Summer of Code project that added native exfat
> support to the existing msdos filesystem implenetation could be useful
> down the road.

https://reviews.freebsd.org/D27376
Old ro Isilon code

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Mark Millard

2025-01-31 15:49:51 UTC

Reply

On Jan 20, 2025, at 11:29, Robert Clausecker <***@freebsd.org> wrote:

> Hi Mark,
>
> Am Mon, Jan 20, 2025 at 11:06:39AM -0800 schrieb Mark Millard:
>> Robert Clausecker <fuz_at_freebsd.org> wrote on
>> Date: Mon, 20 Jan 2025 17:17:40 UTC :
>>
>>> With ExFAT being a common file system on external storage devices
>>> and the patent situation being less bad than a few years ago,
>>
>> Did I miss a status change? I do know that:
>>
>> https://patents.google.com/patent/US20090164440?oq=US2009164440
>>
>> reports: 2027-03-09 Adjusted expiration
>>
>> But, other than that:
>>
>> https://opensource.microsoft.com/blog/2019/08/28/exfat-linux-kernel/
>>
>> reported:
>>
>> QUOTE
>> We also support the eventual inclusion of a Linux kernel with exFAT support in a future revision of the Open Invention Network’s Linux System Definition, where, once accepted, the code will benefit from the defensive patent commitments of OIN’s 3040+ members and licensees.
>> END QUOTE
>>
>> Quoting https://openinventionnetwork.com/# :
>>
>> QUOTE
>> OIN is the largest patent non-aggression community in history. Together, we support freedom of action in Linux as a key element of Open Source & help members reduce patent risks.
>> END QUOTE
>>
>> So, apparently: Very specific to Linux as a context.
>>
>> To my knowledge FreeBSD is not and can not be a member of the
>> Open Invention Network in order to get FreeBSD itself covered.
>
> No, you did not miss anything. However, if Microsoft has given the
> patent to an open source patent pool, it seems likely that we can
> join said pool.
>
>> I'm less sure relative to the means of running Linux code in
>> a booted FreeBSD. May be a OIN membership could cover that
>> for exFAT and more? (No clue.)
>
> Yes, exactly that's what we should evaluate.
>
>>> it
>>> seems interesting to have a native ExFAT driver.
>>>
>>> The driver could be maintained out-of-tree and hosted in Europe
>>> (where the software patents are not enforceable) until we can
>>> merge it.
>>
>> May be the above is an implicit reference to the "2027-03-09
>> Adjusted expiration"?
>
> No, though 2027 is only two years away, so if we cannot join the OIN,
> we only have to keep the driver out-of-tree for that long worst case.

David Chisnall has provided more detailed notes about the patents involved:

https://lists.freebsd.org/archives/freebsd-hackers/2025-January/004264.html

Turns out there are some optimization-related patents that expire later than
the ones for the initial exFAT implementation. So David wrote at the end of
his note:

QUOTE
Next year, I believe, all patents on the original version of exFAT will have expired, which makes it possible to implement an exFAT driver that is not patent encumbered, though without many of the performance improvements.
END QUOTE

> As for “hosted in Europe:” Europe does not have software patents,
> so we can basically ignore the ExFAT patents if development is done
> in Europe independently of any US entity.

===
Mark Millard
marklmi at yahoo.com

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Peter 'PMc' Much

2025-02-24 02:39:12 UTC

Reply

On Sun, Feb 23, 2025 at 04:24:59PM -0800, Rick Macklem wrote:
! > The cause is the the mtime are not updated to those that were in the
! > tarball, but stay at the time of unpacking the tarball. And that
! > doesn't work well with make.
! >
! > Then I kill the nfscbd processes, and my timestamps are correct. I
! > start the nfscbd again, umount and mount, unpack an archive, and
! > it keeps the wrong (current) timestamps.
! > So this is certainly the cause - but what is going wrong?
! You found a bug in NFSv4.n when delegations are enabled (nfscbd
! running on the client and a server that has delegations enabled).
! I reproduced it. The mtime is actually correct on the file server.
! (if you look at the files on the server after unrolling the tarball.)
! At least this is what I see.

Yes, me too - I mounted to another client, and the correct time
appeared.

! However, the client is returning a stale mtime.
! Since I can reproduce it, I should be able to fix it fairly quickly.

relief here... and - You're incredible fast !
Now take your time, this one is not an issue - it's genuine by-catch.

! You already know how to avoid it (don't run nfscbd or don't enable
! delegations on the server).
!
! Thanks for reporting it, rick
! ps: It would still be nice if you can answer the questions in my
! previous email.

Sure, just had to unravel the stack of work on top of which this one
appeared. (I originally wanted to get the smartphones properly into
my VPN)

Server & Client is FreeBSD 13.4,
Serverside vfs.nfsd.issue_delegations=1
Mount-options "nfsv4,readahead=1,rw,async", changing these didn't seem
to make a difference

cheerio,
PMc

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Peter 'PMc' Much

2025-02-24 03:04:26 UTC

Reply

On Sun, Feb 23, 2025 at 03:31:36PM -0800, Rick Macklem wrote:

! Also, what mount options are in use.
! # nfsstat -m
! on the client, lists what is actually being used. (Works for both Linux and
! FreeBSD.)

Forgot this:

nfsv4,minorversion=2,tcp,resvport,nconnect=1,hard,cto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=1,wcommitsize=8388608,timeout=120,retrans=2147483647

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

24 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

void 2024-10-21 13:46:38 UTC

Rick Macklem 2024-10-21 16:17:05 UTC

void 2024-10-21 18:52:46 UTC

Rick Macklem 2024-10-21 22:52:12 UTC

void 2024-10-28 23:40:59 UTC

J David 2024-11-15 16:30:54 UTC

infoomatic 2024-11-15 16:34:29 UTC

Alan Somers 2024-11-15 16:40:17 UTC

J David 2024-12-02 20:23:08 UTC

Rick Macklem 2024-12-07 22:42:05 UTC

Rick Macklem 2024-12-07 22:44:22 UTC

J David 2024-12-09 23:34:06 UTC

Rick Macklem 2024-12-12 21:43:51 UTC

Rick Macklem 2024-12-24 01:07:01 UTC

Alan Somers 2025-01-05 16:19:26 UTC

Peter 'PMc' Much 2025-01-06 16:37:49 UTC

Rick Macklem 2025-01-07 23:45:37 UTC

andy thomas 2025-01-11 10:39:50 UTC

Ronald Klop 2025-01-11 11:03:59 UTC

Alexander Leidinger 2025-01-11 11:21:34 UTC

Warner Losh 2025-01-20 20:15:11 UTC

Konstantin Belousov 2025-01-21 00:11:25 UTC

Mark Millard 2025-01-31 15:49:51 UTC

Peter 'PMc' Much 2025-02-24 02:39:12 UTC

Peter 'PMc' Much 2025-02-24 03:04:26 UTC

about - legalese

Loading...