Home - Waterfall Grid T-Grid Console Builders Recent Builds Buildslaves Changesources - JSON API - About

Console View


Tags: Platforms default
Legend:   Passed Failed Warnings Failed Again Running Exception Offline No data

Platforms default
Rob Norris
linux: detect if kernel defines intptr_t

Since Linux 6.7 the kernel has defined intptr_t. Clang has
-Wtypedef-redefinition by default, which causes the build to fail
because we also have a typedef for intptr_t.

Since its better to use the kernel's if it exists, detect it and skip
our own.

Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/

Pull-request: #16201 part 1/1
Rob Norris
linux: detect if kernel defines intptr_t

Since Linux 6.7 the kernel has defined intptr_t. Clang has
-Wtypedef-redefinition by default, which causes the build to fail
because we also have a typedef for intptr_t.

Since its better to use the kernel's if it exists, detect it and skip
our own.

Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/

Pull-request: #16201 part 1/1
Rob Norris
vdev_queue: per-vdev stats

Adding a bunch of gauges and counters to show in-flight and total IOs,
with per-class breakdowns, and some aggregation counters.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16200 part 5/5
Rob Norris
freebsd/kstat: replace existing kstat if name is reused

Normally, when trying to add a sysctl name that already exists, the
kernel rejects it with a warning. This changes the code to search for a
sysctl with the wanted name in same root. If it exists, it is destroyed,
allowing the new one to go in.

Arguably, a collision like this shouldn't ever happen, but during
import multiple vdev_t (and so vdev_queue_t, and so vdev_queue stats)
can exist at the same time for the same guid. There's no real way to
tell which is which without substantial refactoring in the import and
vdev init codepaths, whch is probably worthwhile but not for today.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16200 part 4/5
Rob Norris
freebsd/kstat: allow multi-level module names

This extends the existing special-case for zfs/poolname to split and
create any number of intermediate sysctl names, so that multi-level
module names are possible.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16200 part 3/5
Rob Norris
linux/kstat: replace existing kstat if name is reused

Previously, if a kstat proc name already existed, the old one would be
kept. This makes it so the old one is discarded and the new one kept.

Arguably, a collision like this shouldn't ever happen, but during
import multiple vdev_t (and so vdev_queue_t, and so vdev_queue stats)
can exist at the same time for the same guid. There's no real way to
tell which is which without substantial refactoring in the import and
vdev init codepaths, whch is probably worthwhile but not for today.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16200 part 2/5
Rob Norris
linux/kstat: allow multi-level module names

Module names are mapped directly to directory names in procfs, but
nothing is done to create the intermediate directories, or remove them.
This makes it impossible to sensibly present kstats about sub-objects.

This commit loops through '/'-separated names in the full module name,
creates a separate module for each, and hooks them up with a parent
pointer and child counter, and then unrolls this on the other side when
deleting a module.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16200 part 1/5
Rob Norris
vdev_queue: per-vdev stats

Adding a bunch of gauges and counters to show in-flight and total IOs,
with per-class breakdowns, and some aggregation counters.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16200 part 5/5
Rob Norris
freebsd/kstat: replace existing kstat if name is reused

Normally, when trying to add a sysctl name that already exists, the
kernel rejects it with a warning. This changes the code to search for a
sysctl with the wanted name in same root. If it exists, it is destroyed,
allowing the new one to go in.

Arguably, a collision like this shouldn't ever happen, but during
import multiple vdev_t (and so vdev_queue_t, and so vdev_queue stats)
can exist at the same time for the same guid. There's no real way to
tell which is which without substantial refactoring in the import and
vdev init codepaths, whch is probably worthwhile but not for today.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16200 part 4/5
Rob Norris
freebsd/kstat: allow multi-level module names

This extends the existing special-case for zfs/poolname to split and
create any number of intermediate sysctl names, so that multi-level
module names are possible.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16200 part 3/5
Rob Norris
linux/kstat: replace existing kstat if name is reused

Previously, if a kstat proc name already existed, the old one would be
kept. This makes it so the old one is discarded and the new one kept.

Arguably, a collision like this shouldn't ever happen, but during
import multiple vdev_t (and so vdev_queue_t, and so vdev_queue stats)
can exist at the same time for the same guid. There's no real way to
tell which is which without substantial refactoring in the import and
vdev init codepaths, whch is probably worthwhile but not for today.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16200 part 2/5
Rob Norris
linux/kstat: allow multi-level module names

Module names are mapped directly to directory names in procfs, but
nothing is done to create the intermediate directories, or remove them.
This makes it impossible to sensibly present kstats about sub-objects.

This commit loops through '/'-separated names in the full module name,
creates a separate module for each, and hooks them up with a parent
pointer and child counter, and then unrolls this on the other side when
deleting a module.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16200 part 1/5
Rich Ercolani
Empty

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>

Pull-request: #16198 part 2/2
Rich Ercolani
Correct level handling in zstream recompress.

sscanf returns number of items parsed on success and EOF on failure.

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>

Pull-request: #16198 part 1/2
Tony Hutter
ZTS: Use QEMU for tests on Linux and FreeBSD

-----------------------------------------------------------------
Do not merge - this is my testing version
-Tony Hutter

Requires-builders: none
----------------------------------------------------------------

This commit adds functional tests for these systems:
- AlmaLinux 8, AlmaLinux 9
- ArchLinux
- CentOS Stream 8, CentOS Stream 9
- Fedora 38, Fedora 39
- Debian 11, Debian 12
- FreeBSD 13, FreeBSD 14, FreeBSD 15
- Ubuntu 22.04, Ubuntu 24.04

Workflow for each operating system:
- install QEMU on the github runner
- download cloud image for this system
- start and init that image via cloud-init
- install deps, build openzfs, load the module
- do the functional testings, hopefully < 5h

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>

Pull-request: #16195 part 1/1
Tony Hutter
ZTS: Use QEMU for tests on Linux and FreeBSD

-----------------------------------------------------------------
Do not merge - this is my testing version
-Tony Hutter

Requires-builders: none
----------------------------------------------------------------

This commit adds functional tests for these systems:
- AlmaLinux 8, AlmaLinux 9
- ArchLinux
- CentOS Stream 8, CentOS Stream 9
- Fedora 38, Fedora 39
- Debian 11, Debian 12
- FreeBSD 13, FreeBSD 14, FreeBSD 15
- Ubuntu 22.04, Ubuntu 24.04

Workflow for each operating system:
- install QEMU on the github runner
- download cloud image for this system
- start and init that image via cloud-init
- install deps, build openzfs, load the module
- do the functional testings, hopefully < 5h

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>

Pull-request: #16195 part 1/1
Tony Hutter
ZTS: Use QEMU for tests on Linux and FreeBSD

-----------------------------------------------------------------
Do not merge - this is my testing version
-Tony Hutter

Requires-builders: none
----------------------------------------------------------------

This commit adds functional tests for these systems:
- AlmaLinux 8, AlmaLinux 9
- ArchLinux
- CentOS Stream 8, CentOS Stream 9
- Fedora 38, Fedora 39
- Debian 11, Debian 12
- FreeBSD 13, FreeBSD 14, FreeBSD 15
- Ubuntu 22.04, Ubuntu 24.04

Workflow for each operating system:
- install QEMU on the github runner
- download cloud image for this system
- start and init that image via cloud-init
- install deps, build openzfs, load the module
- do the functional testings, hopefully < 5h

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>

Pull-request: #16195 part 1/1
Tony Hutter
ZTS: Use QEMU for tests on Linux and FreeBSD

-----------------------------------------------------------------
Do not merge - this is my testing version
-Tony Hutter

Requires-builders: none
----------------------------------------------------------------

This commit adds functional tests for these systems:
- AlmaLinux 8, AlmaLinux 9
- ArchLinux
- CentOS Stream 8, CentOS Stream 9
- Fedora 38, Fedora 39
- Debian 11, Debian 12
- FreeBSD 13, FreeBSD 14, FreeBSD 15
- Ubuntu 22.04, Ubuntu 24.04

Workflow for each operating system:
- install QEMU on the github runner
- download cloud image for this system
- start and init that image via cloud-init
- install deps, build openzfs, load the module
- do the functional testings, hopefully < 5h

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>

Pull-request: #16195 part 1/1
Tony Hutter
ZTS: Use QEMU for tests on Linux and FreeBSD

-----------------------------------------------------------------
Do not merge - this is my testing version
-Tony Hutter

Requires-builders: none
----------------------------------------------------------------

This commit adds functional tests for these systems:
- AlmaLinux 8, AlmaLinux 9
- ArchLinux
- CentOS Stream 8, CentOS Stream 9
- Fedora 38, Fedora 39
- Debian 11, Debian 12
- FreeBSD 13, FreeBSD 14, FreeBSD 15
- Ubuntu 22.04, Ubuntu 24.04

Workflow for each operating system:
- install QEMU on the github runner
- download cloud image for this system
- start and init that image via cloud-init
- install deps, build openzfs, load the module
- do the functional testings, hopefully < 5h

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>

Pull-request: #16195 part 1/1
Tony Hutter
ZTS: Use QEMU for tests on Linux and FreeBSD

-----------------------------------------------------------------
Do not merge - this is my testing version
-Tony Hutter

Requires-builders: none
----------------------------------------------------------------

This commit adds functional tests for these systems:
- AlmaLinux 8, AlmaLinux 9
- ArchLinux
- CentOS Stream 8, CentOS Stream 9
- Fedora 38, Fedora 39
- Debian 11, Debian 12
- FreeBSD 13, FreeBSD 14, FreeBSD 15
- Ubuntu 22.04, Ubuntu 24.04

Workflow for each operating system:
- install QEMU on the github runner
- download cloud image for this system
- start and init that image via cloud-init
- install deps, build openzfs, load the module
- do the functional testings, hopefully < 5h

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>

Pull-request: #16195 part 1/1
Tony Hutter
ZTS: Use QEMU for tests on Linux and FreeBSD

-----------------------------------------------------------------
Do not merge - this is my testing version
-Tony Hutter

Requires-builders: none
----------------------------------------------------------------

This commit adds functional tests for these systems:
- AlmaLinux 8, AlmaLinux 9
- ArchLinux
- CentOS Stream 8, CentOS Stream 9
- Fedora 38, Fedora 39
- Debian 11, Debian 12
- FreeBSD 13, FreeBSD 14, FreeBSD 15
- Ubuntu 22.04, Ubuntu 24.04

Workflow for each operating system:
- install QEMU on the github runner
- download cloud image for this system
- start and init that image via cloud-init
- install deps, build openzfs, load the module
- do the functional testings, hopefully < 5h

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>

Pull-request: #16195 part 1/1
omni
config/zfs-build.m4: add Alpine Linux bash-completion path

Signed-off-by: omni <omni+vagant@hack.org>

Pull-request: #16164 part 2/2
omni
config/zfs-build.m4: sort vendors

Signed-off-by: omni <omni+vagant@hack.org>

Pull-request: #16164 part 1/2
Rob Norris
ddt: lookup and log stats

Adds per-DDT stats counting lookups and where they were serviced from
(either log or backing zap), number of log entries in memory, and flow
rates.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.

Pull-request: #15895 part 17/17
Rob Norris
ddt: lookup and log stats

Adds per-DDT stats counting lookups and where they were serviced from
(either log or backing zap), number of log entries in memory, and flow
rates.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.

Pull-request: #15895 part 17/17
Rob Norris
dnode: allow storage class to be overridden by object type

spa_preferred_class() selects a storage class based on (among other
things) the DMU object type. This only works for old-style object types
that match only one specific kind of thing. For DMU_OTN_ types we need
another way to signal the storage class.

This commit allows the object type to be overridden in the IO policy for
the purposes of choosing a storage class. It then adds the ability to
set the storage type on a dnode hold, such that all writes generated
under that hold will get it.

This method has two shortcomings:

- it would be better if we could "name" a set of storage class
  preferences rather than it being implied by the object type.
- it would be better if this info were stored in the dnode on disk.

In the absence of those things, this seems like the smallest possible
change.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.

Pull-request: #15894 part 2/2
Rob Norris
spa_preferred_class: pass the entire zio

Rather than picking out specific values out of the properties, just pass
the entire zio in, to make it easier in the future to use more of that
info to decide on the storage class.

I would have rathered just pass io_prop in, but having spa.h include
zio.h gets a bit tricky.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.

Pull-request: #15894 part 1/2
Rob Norris
dnode: allow storage class to be overridden by object type

spa_preferred_class() selects a storage class based on (among other
things) the DMU object type. This only works for old-style object types
that match only one specific kind of thing. For DMU_OTN_ types we need
another way to signal the storage class.

This commit allows the object type to be overridden in the IO policy for
the purposes of choosing a storage class. It then adds the ability to
set the storage type on a dnode hold, such that all writes generated
under that hold will get it.

This method has two shortcomings:

- it would be better if we could "name" a set of storage class
  preferences rather than it being implied by the object type.
- it would be better if this info were stored in the dnode on disk.

In the absence of those things, this seems like the smallest possible
change.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.

Pull-request: #15894 part 9/9
Rob Norris
ddt: add "flat phys" feature

Traditional dedup keeps a separate ddt_phys_t "type" for each possible
count of DVAs (that is, copies=) parameter. Each of these are tracked
independently of each other, and have their own set of DVAs. This leads
to an (admittedly rare) situation where you can create as many as six
copies of the data, by changing the copies= parameter between copying.
This is both a waste of storage on disk, but also a waste of space in
the stored DDT entries, since there never needs to be more than three
DVAs to handle all possible values of copies=.

This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the
first ddt_phys_t is used. Each time a block is written with the dedup
bit set, this single phys is checked to see if it has enough DVAs to
fulfill the request. If it does, the block is filled with the saved DVAs
as normal. If not, an adjusted write is issued to create as many extra
copies as are needed to fulfill the request, which are then saved into
the entry too.

Because a single phys is no longer an all-or-nothing, but can be
transitioning from fewer to more DVAs, the write path now has to keep a
copy of the previous "known good" DVA set so we can revert to it in case
an error occurs. zio_ddt_write() has been restructured and heavily
commented to make it much easier to see what's happening.

Backwards compatibility is maintained simply by allocating four
ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys
selection macros to check the flag. In the old arrangement, each number
of copies gets a whole phys, so it will always have either zero or all
necessary DVAs filled, with no in-between, so the old behaviour
naturally falls out of the new code.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.

Pull-request: #15893 part 7/7
Rob Norris
ddt: add "flat phys" feature

Traditional dedup keeps a separate ddt_phys_t "type" for each possible
count of DVAs (that is, copies=) parameter. Each of these are tracked
independently of each other, and have their own set of DVAs. This leads
to an (admittedly rare) situation where you can create as many as six
copies of the data, by changing the copies= parameter between copying.
This is both a waste of storage on disk, but also a waste of space in
the stored DDT entries, since there never needs to be more than three
DVAs to handle all possible values of copies=.

This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the
first ddt_phys_t is used. Each time a block is written with the dedup
bit set, this single phys is checked to see if it has enough DVAs to
fulfill the request. If it does, the block is filled with the saved DVAs
as normal. If not, an adjusted write is issued to create as many extra
copies as are needed to fulfill the request, which are then saved into
the entry too.

Because a single phys is no longer an all-or-nothing, but can be
transitioning from fewer to more DVAs, the write path now has to keep a
copy of the previous "known good" DVA set so we can revert to it in case
an error occurs. zio_ddt_write() has been restructured and heavily
commented to make it much easier to see what's happening.

Backwards compatibility is maintained simply by allocating four
ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys
selection macros to check the flag. In the old arrangement, each number
of copies gets a whole phys, so it will always have either zero or all
necessary DVAs filled, with no in-between, so the old behaviour
naturally falls out of the new code.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.

Pull-request: #15893 part 7/7
Rob Norris
ddt: add FDT feature and support for legacy and new on-disk formats

This is the supporting infrastructure for the upcoming dedup features.

Traditionally, dedup objects live directly in the MOS root. While their
details vary (checksum, type and class), they are all the same "kind" of
thing - a store of dedup entries.

The new features are more varied than that, and are better thought of as
a set of related stores for the overall state of a dedup table.

This adds a new feature flag, SPA_FEATURE_FAST_DEDUP. Enabling this will
cause new DDTs to be created as a ZAP in the MOS root, named
DDT-<checksum>. The is used as the root object for the normal type/class
store objects, but will also be a place for any storage required by new
features.

This commit adds two new fields to ddt_t, for version and flags. These
are intended to describe the structure and features of the overall dedup
table, and are stored as-is in the DDT root. In this commit, flags are
always zero, but the intent is that they can be used to hang optional
logic or state onto for new dedup features. Version is always 1.

For a "legacy" dedup table, where no DDT root directory exists, the
version will be 0.

ddt_configure() is expected to determine the version and flags features
currently in operation based on whether or not the fast_dedup feature is
enabled, and from what's available on disk. In this way, its possible to
support both old and new tables.

This also provides a migration path. A legacy setup can be upgraded to
FDT by creating the DDT root ZAP, moving the existing objects into it,
and setting version and flags appropriately. There's no support for that
here, but it would be straightforward to add later and allows the
possibility that newer features could be applied to existing dedup
tables.

Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.

Pull-request: #15892 part 3/3
Rob Norris
ddt: add support for prefetching tables into the ARC

This change adds a new `zpool prefetch -t ddt $pool` command which
causes a pool's DDT to be loaded into the ARC. The primary goal is to
remove the need to "warm" a pool's cache before deduplication stops
slowing write performance. It may also provide a way to reload portions
of a DDT if they have been flushed due to inactivity.

Sponsored-by: iXsystems, Inc.
Sponsored-by: Catalogics, Inc.
Sponsored-by: Klara, Inc.

Co-authored-by: Will Andrews <will.andrews@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Will Andrews <will.andrews@klarasystems.com>
Signed-off-by: Fred Weigel <fred.weigel@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>

Pull-request: #15892 part 2/3
Don Brady
ddt: dedup table quota enforcement

This adds two new pool properties:
- dedup_table_size, the total size of all DDTs on the pool; and
- dedup_table_quota, the maximum possible size of all DDTs in the pool

When set, quota will be enforced by checking when a new entry is about
to be created. If the pool is over its dedup quota, the entry won't be
created, and the corresponding write will be converted to a regular
non-dedup write. Note that existing entries can be updated (ie their
refcounts changed), as that reuses the space rather than requiring more.

dedup_table_quota can be set to 'auto', which will set it based on the
size of the devices backing the "dedup" allocation device. This makes it
possible to limit the DDTs to the size of a dedup vdev only, such that
when the device fills, no new blocks are deduplicated.

Sponsored-by: iXsystems, Inc.
Sponsored-By: Klara Inc.
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Sean Eric Fagan <sean.fagan@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>

Pull-request: #15892 part 1/3
Rob Norris
ddt: add FDT feature and support for legacy and new on-disk formats

This is the supporting infrastructure for the upcoming dedup features.

Traditionally, dedup objects live directly in the MOS root. While their
details vary (checksum, type and class), they are all the same "kind" of
thing - a store of dedup entries.

The new features are more varied than that, and are better thought of as
a set of related stores for the overall state of a dedup table.

This adds a new feature flag, SPA_FEATURE_FAST_DEDUP. Enabling this will
cause new DDTs to be created as a ZAP in the MOS root, named
DDT-<checksum>. The is used as the root object for the normal type/class
store objects, but will also be a place for any storage required by new
features.

This commit adds two new fields to ddt_t, for version and flags. These
are intended to describe the structure and features of the overall dedup
table, and are stored as-is in the DDT root. In this commit, flags are
always zero, but the intent is that they can be used to hang optional
logic or state onto for new dedup features. Version is always 1.

For a "legacy" dedup table, where no DDT root directory exists, the
version will be 0.

ddt_configure() is expected to determine the version and flags features
currently in operation based on whether or not the fast_dedup feature is
enabled, and from what's available on disk. In this way, its possible to
support both old and new tables.

This also provides a migration path. A legacy setup can be upgraded to
FDT by creating the DDT root ZAP, moving the existing objects into it,
and setting version and flags appropriately. There's no support for that
here, but it would be straightforward to add later and allows the
possibility that newer features could be applied to existing dedup
tables.

Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.

Pull-request: #15892 part 3/3
Rob Norris
ddt: add support for prefetching tables into the ARC

This change adds a new `zpool prefetch -t ddt $pool` command which
causes a pool's DDT to be loaded into the ARC. The primary goal is to
remove the need to "warm" a pool's cache before deduplication stops
slowing write performance. It may also provide a way to reload portions
of a DDT if they have been flushed due to inactivity.

Sponsored-by: iXsystems, Inc.
Sponsored-by: Catalogics, Inc.
Sponsored-by: Klara, Inc.

Co-authored-by: Will Andrews <will.andrews@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Will Andrews <will.andrews@klarasystems.com>
Signed-off-by: Fred Weigel <fred.weigel@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>

Pull-request: #15892 part 2/3
Don Brady
ddt: dedup table quota enforcement

This adds two new pool properties:
- dedup_table_size, the total size of all DDTs on the pool; and
- dedup_table_quota, the maximum possible size of all DDTs in the pool

When set, quota will be enforced by checking when a new entry is about
to be created. If the pool is over its dedup quota, the entry won't be
created, and the corresponding write will be converted to a regular
non-dedup write. Note that existing entries can be updated (ie their
refcounts changed), as that reuses the space rather than requiring more.

dedup_table_quota can be set to 'auto', which will set it based on the
size of the devices backing the "dedup" allocation device. This makes it
possible to limit the DDTs to the size of a dedup vdev only, such that
when the device fills, no new blocks are deduplicated.

Sponsored-by: iXsystems, Inc.
Sponsored-By: Klara Inc.
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Sean Eric Fagan <sean.fagan@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>

Pull-request: #15892 part 1/3
Rob Norris
ddt: add support for prefetching tables into the ARC

This change adds a new `zpool prefetch -t ddt $pool` command which
causes a pool's DDT to be loaded into the ARC. The primary goal is to
remove the need to "warm" a pool's cache before deduplication stops
slowing write performance. It may also provide a way to reload portions
of a DDT if they have been flushed due to inactivity.

Sponsored-by: iXsystems, Inc.
Sponsored-by: Catalogics, Inc.
Sponsored-by: Klara, Inc.

Co-authored-by: Will Andrews <will.andrews@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Will Andrews <will.andrews@klarasystems.com>
Signed-off-by: Fred Weigel <fred.weigel@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>

Pull-request: #15890 part 2/2
Don Brady
ddt: dedup table quota enforcement

This adds two new pool properties:
- dedup_table_size, the total size of all DDTs on the pool; and
- dedup_table_quota, the maximum possible size of all DDTs in the pool

When set, quota will be enforced by checking when a new entry is about
to be created. If the pool is over its dedup quota, the entry won't be
created, and the corresponding write will be converted to a regular
non-dedup write. Note that existing entries can be updated (ie their
refcounts changed), as that reuses the space rather than requiring more.

dedup_table_quota can be set to 'auto', which will set it based on the
size of the devices backing the "dedup" allocation device. This makes it
possible to limit the DDTs to the size of a dedup vdev only, such that
when the device fills, no new blocks are deduplicated.

Sponsored-by: iXsystems, Inc.
Sponsored-By: Klara Inc.
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Sean Eric Fagan <sean.fagan@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>

Pull-request: #15889 part 1/1
Jason Lee
ZFS Interface for Accelerators (Z.I.A.)

The ZIO pipeline has been modified to allow for external,
alternative implementations of existing operations to be
used. The original ZFS functions remain in the code as
fallback in case the external implementation fails.

Definitions:
    Accelerator - an entity (usually hardware) that is
                  intended to accelerate operations
    Offloader  - synonym of accelerator; used interchangeably
    Data Processing Unit Services Module (DPUSM)
                - https://github.com/hpc/dpusm
                - defines a "provider API" for accelerator
                  vendors to set up
                - defines a "user API" for accelerator consumers
                  to call
                - maintains list of providers and coordinates
                  interactions between providers and consumers.
    Provider    - a DPUSM wrapper for an accelerator's API
    Offload    - moving data from ZFS/memory to the accelerator
    Onload      - the opposite of offload

In order for Z.I.A. to be extensible, it does not directly
communicate with a fixed accelerator. Rather, Z.I.A. acquires
a handle to a DPUSM, which is then used to acquire handles
to providers.

Using ZFS with Z.I.A.:
    1. Build and start the DPUSM
    2. Implement, build, and register a provider with the DPUSM
    3. Reconfigure ZFS with '--with-zia=<DPUSM root>'
    4. Rebuild and start ZFS
    5. Create a zpool
    6. Select the provider
          zpool set zia_provider=<provider name> <zpool>
    7. Select operations to offload
          zpool set zia_<property>=on <zpool>

The operations that have been modified are:
    - compression
        - non-raw-writes only
    - decompression
    - checksum
        - not handling embedded checksums
        - checksum compute and checksum error call the same function
    - raidz
        - generation
        - reconstruction
    - vdev_file
        - open
        - write
        - close
    - vdev_disk
        - open
        - invalidate
        - write
        - flush
        - close

Successful operations do not bring data back into memory after
they complete, allowing for subsequent offloader operations
reuse the data. This results in only one data movement per ZIO
at the beginning of a pipeline that is necessary for getting
data from ZFS to the accelerator.

When errors ocurr and the offloaded data is still accessible,
the offloaded data will be onloaded (or dropped if it still
matches the in-memory copy) for that ZIO pipeline stage and
processed with ZFS. This will cause thrashing if a later
operation offloads data. This should not happen often, as
constant errors (resulting in data movement) is not expected
to be the norm.

Unrecoverable errors such as hardware failures will trigger
pipeline restarts (if necessary) in order to complete the
original ZIO using the software path.

The modifications to ZFS can be thought of as two sets of changes:
    - The ZIO write pipeline
        - compression, checksum, RAIDZ generation, and write
        - Each stage starts by offloading data that was not
          previously offloaded
            - This allows for ZIOs to be offloaded at any point
              in the pipeline
    - Resilver
        - vdev_raidz_io_done (RAIDZ reconstruction, checksum, and
          RAIDZ generation), and write
        - Because the core of resilver is vdev_raidz_io_done, data
          is only offloaded once at the beginning of
          vdev_raidz_io_done
            - Errors cause data to be onloaded, but will not
              re-offload in subsequent steps within resilver
            - Write is a separate ZIO pipeline stage, so it will
              attempt to offload data

The zio_decompress function has been modified to allow for
offloading but the ZIO read pipeline as a whole has not, so it
is not part of the above list.

An example provider implementation can be found in
module/zia-software-provider
    - The provider's "hardware" is actually software - data is
      "offloaded" to memory not owned by ZFS
    - Calls ZFS functions in order to not reimplement operations
    - Has kernel module parameters that can be used to trigger
      ZIA_ACCELERATOR_DOWN states for testing pipeline restarts.

abd_t, raidz_row_t, and vdev_t have each been given an additional
"void *<prefix>_zia_handle" member. These opaque handles point to
data that is located on an offloader. abds are still allocated,
but their payloads are expected to diverge from the offloaded copy
as operations are run.

Encryption and deduplication are disabled for zpools with Z.I.A.
operations enabled

Aggregation is disabled for offloaded abds

RPMs will build with Z.I.A.

Signed-off-by: Jason Lee <jasonlee@lanl.gov>

Pull-request: #13628 part 1/1