Home - Waterfall Grid T-Grid Console Builders Recent Builds Buildslaves Changesources - JSON API - About

Console View


Tags: Platforms default
Legend:   Passed Failed Warnings Failed Again Running Exception Offline No data

Platforms default
GitHub
Linux 6.10 compat: Remove second parameter of __assign_str()

In Linux kernel commit 2c92ca849fcc6ee7d0c358e9959abc9f58661aea,
the second parameter of __assign_str() was removed.
As the instruction says, there is no need to pass the second parameter.
But I didn't check if it would break anything to simply remove
the second parameter.

Signed-off-by: Pinghigh <pinghigh24678@outlook.com>

Pull-request: #16390 part 1/1
Rob Norris
ZTS: remove skips for zvol_misc tests

Last commit should fix the underlying problem, so these should be
passing reliably again.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16364 part 4/4
Rob Norris
zvol: ensure device minors are properly cleaned up

Currently, if a minor is in use when we try to remove it, we'll skip it
and never come back to it again. Since the zvol state is hung off the
minor in the kernel, this can get us into weird situations if something
tries to use it after the removal fails. It's even worse at pool export,
as there's now a vestigial zvol state with no pool under it. It's
weirder again if the pool is subsequently reimported, as the zvol code
(reasonably) assumes the zvol state has been properly setup, when it's
actually left over from the previous import of the pool.

This commit attempts to tackle that by setting a flag on the zvol if its
minor can't be removed, and then checking that flag when a request is
made and rejecting it, thus stopping new work coming in.

The flag also causes a condvar to be signaled when the last client
finishes. For the case where a single minor is being removed (eg
changing volmode), it will wait for this signal before proceeding.
Meanwhile, when removing all minors, a background task is created for
each minor that couldn't be removed on the spot, and those tasks then
wake and clean up.

Since any new tasks are queued on to the pool's spa_zvol_taskq,
spa_export_common() will continue to wait at export until all minors are
removed.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16364 part 3/4
Rob Norris
linux/zvol_os: fix SET_ERROR with negative return codes

SET_ERROR is our facility for tracking errors internally. The negation
is to match the what the kernel expects from us. Thus, the negation
should happen outside of the SET_ERROR.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16364 part 2/4
Rob Norris
zvol_impl: document and tidy flags

ZVOL_DUMPIFIED is a vestigial Solaris leftover, and not used anywhere.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16364 part 1/4
Brian Atkinson
Block cloning conditionally destroy ARC buffer

dmu_buf_will_clone() calls arc_buf_destroy() if there is an assosciated
ARC buffer with the dbuf. However, this can only be done conditionally.
If the preivous dirty record's dr_data is pointed at db_dbf then
destroying it can lead to NULL pointer deference when syncing out the
previous dirty record.

This updates dmu_buf_fill_clone() to only call arc_buf_destroy() if the
previous dirty records dr_data is not pointing to db_buf. The block
clone wil still set the dbuf's db_buf and db_data to NULL, but this will
not cause any issues as any previous dirty record dr_data will still be
pointing at the ARC buffer.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>

Pull-request: #16337 part 1/1
Rob Norris
abd: lift ABD zero scan from zio_compress_data() to abd_cmp_zero()

It's now the caller's responsibility do special handling for holes if
that's something it wants.

This also makes zio_compress_data() and zio_decompress_data() properly
the inverse of each other.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16326 part 1/1
Rob Norris
wip zstd and such

Pull-request: #16323 part 2/2
Rob Norris
WIP compress abd

Pull-request: #16323 part 1/2
Rob Norris
libzpool/abd_os: iovec-based scatter abd

This is intended to be a simple userspace scatter abd based on struct
iovec. It's not very sophisticated as-is, but sets a base for something
much more interesting.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16253 part 5/5
Rob Norris
abd_os: break out platform-specific header parts

Removing the platform #ifdefs from shared headers in favour of
per-platform headers. Makes abd_t much leaner, among other things.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16253 part 4/5
Rob Norris
abd_os: split userspace and Linux kernel code

The Linux abd_os.c serves double-duty as the userspace scatter abd
implementation, by carrying an emulation of kernel scatterlists. This
commit lifts common and userspace-specific parts out into a separate
abd_os.c for libzpool.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16253 part 3/5
Rob Norris
zio: no alloc canary in userspace

Makes it harder to use memory debuggers like valgrind directly, because
they can't see canary overruns.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16253 part 2/5
Rob Norris
abd: remove ABD_FLAG_ZEROS

Nothing ever checks it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16253 part 1/5
Umer Saleem
JSON: Fix class values for mirrored special vdevs

This fixes things so mirrored special vdevs report themselves as
"class=special" rather than "class=normal".

This happens due to the way the vdev nvlists are constructed:

mirrored special devices - The 'mirror' vdev has allocation bias as
"special" and it's leaf vdevs are "normal"

single or RAID0 special devices - Leaf vdevs have allocation bias as
"special".

This commit adds in code to check if a leaf's parent is a "special"
vdev to see if it should also report "special".

Signed-off-by: Tony Hutter <hutter2@llnl.gov>

Pull-request: #16217 part 9/9
Rob Norris
spl-proc: remove old taskq stats

These had minimal useful information for the admin, didn't work properly
in some places, and knew far too much about taskq internals.

With the new stats available, these should never be needed anymore.

Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16171 part 5/5
Rob Norris
spl-taskq: summary stats for all taskqs

This adds /proc/spl/kstats/taskq/summary, which attempts to show a
useful subset of stats for all taskqs in the system.

Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16171 part 4/5
Rob Norris
spl-taskq: task timing stats

For each task, record the time it reaches various points in the pipeline
(dispatched, on queue, off queue, executed, completed). When the entry
is destroyed, update rolling averages for each stage of the taskq. This
makes it possible to see where taskqs are struggling, eg, tasks are on
queue longer than they should be, or enqueue or dequeue is taking a long
time due to lock contention.

Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16171 part 3/5
Rob Norris
spl-taskq: per-taskq kstats

This exposes a variety of per-taskq stats under /proc/spl/kstat/taskq,
one file per taskq, named for the taskq name.instance.

These include a small amount of info about the taskq config, the current
state of the threads and queues, and various counters for thread and
queue activity since the taskq was created.

To assist with decrementing queue size counters, the list an entry is on
is encoded in spare bits in the entry flags.

Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16171 part 2/5
Rob Norris
spl-generic: bring up kstats subsystem before taskq

For spl-taskq to use the kstats infrastructure, it has to be available
first.

Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #16171 part 1/5
Rob Norris
linux: log a scary warning when used with an experimental kernel

Since the person using the kernel may not be the person who built it,
show a warning at module load too, in case they aren't aware that it
might be weird.

Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/

Pull-request: #15986 part 2/2
Rob Norris
config/kernel: introduce "maximum experimental" kernel version

META lists the maximum kernel version we consider to be fully supported.
However, we don't enforce this.

Sometimes we ship experimental patches for a newer kernel than we're
ready to support or, less often, we compile just fine against a newer
kernel. Invariably, something doesn't quite work properly, and it's
difficult for users to understan that they're actually running against a
kernel that we're not yet ready to support.

This commit tries to improve this situation. First, it simply enforces
Linux-Maximum, by having configure bail out if you try to compile
against a newer version that.

Then, it adds the --enable-linux-experimental switch to configure. When
supplied, this disables enforcing the maximum version, allowing the user
to attempt to build against a kernel with version higher than
Linux-Maximum.

Finally, if the switch is supplied _and_ configure is run against a
higher kernel version, it shows a big warning message when configure
finishes, and defines HAVE_LINUX_EXPERIMENTAL for the build. This allows
us to add code to modify runtime behaviour as well.  also define HAVE

Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/

Pull-request: #15986 part 1/2
Rob Norris
zio: log ZIO and ABD flags for debug

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Pull-request: #15945 part 8/8
Allan Jude
dnode: allow storage class to be overridden by object type

spa_preferred_class() selects a storage class based on (among other
things) the DMU object type. This only works for old-style object types
that match only one specific kind of thing. For DMU_OTN_ types we need
another way to signal the storage class.

This commit allows the object type to be overridden in the IO policy for
the purposes of choosing a storage class. It then adds the ability to
set the storage type on a dnode hold, such that all writes generated
under that hold will get it.

This method has two shortcomings:

- it would be better if we could "name" a set of storage class
  preferences rather than it being implied by the object type.
- it would be better if this info were stored in the dnode on disk.

In the absence of those things, this seems like the smallest possible
change.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.

Pull-request: #15894 part 2/2
Allan Jude
spa_preferred_class: pass the entire zio

Rather than picking out specific values out of the properties, just pass
the entire zio in, to make it easier in the future to use more of that
info to decide on the storage class.

I would have rathered just pass io_prop in, but having spa.h include
zio.h gets a bit tricky.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.

Pull-request: #15894 part 1/2
Allan Jude
ddt: add "flat phys" feature

Traditional dedup keeps a separate ddt_phys_t "type" for each possible
count of DVAs (that is, copies=) parameter. Each of these are tracked
independently of each other, and have their own set of DVAs. This leads
to an (admittedly rare) situation where you can create as many as six
copies of the data, by changing the copies= parameter between copying.
This is both a waste of storage on disk, but also a waste of space in
the stored DDT entries, since there never needs to be more than three
DVAs to handle all possible values of copies=.

This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the
first ddt_phys_t is used. Each time a block is written with the dedup
bit set, this single phys is checked to see if it has enough DVAs to
fulfill the request. If it does, the block is filled with the saved DVAs
as normal. If not, an adjusted write is issued to create as many extra
copies as are needed to fulfill the request, which are then saved into
the entry too.

Because a single phys is no longer an all-or-nothing, but can be
transitioning from fewer to more DVAs, the write path now has to keep a
copy of the previous "known good" DVA set so we can revert to it in case
an error occurs. zio_ddt_write() has been restructured and heavily
commented to make it much easier to see what's happening.

Backwards compatibility is maintained simply by allocating four
ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys
selection macros to check the flag. In the old arrangement, each number
of copies gets a whole phys, so it will always have either zero or all
necessary DVAs filled, with no in-between, so the old behaviour
naturally falls out of the new code.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.

Pull-request: #15893 part 7/7
Allan Jude
ZTS: tests for dedup legacy/FDT tables

Very basic coverage to make sure things appear to work, have the right
format on disk, and pool upgrades and mixed table types work as
expected.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.

Pull-request: #15892 part 3/3
Allan Jude
ddt: add FDT feature and support for legacy and new on-disk formats

This is the supporting infrastructure for the upcoming dedup features.

Traditionally, dedup objects live directly in the MOS root. While their
details vary (checksum, type and class), they are all the same "kind" of
thing - a store of dedup entries.

The new features are more varied than that, and are better thought of as
a set of related stores for the overall state of a dedup table.

This adds a new feature flag, SPA_FEATURE_FAST_DEDUP. Enabling this will
cause new DDTs to be created as a ZAP in the MOS root, named
DDT-<checksum>. The is used as the root object for the normal type/class
store objects, but will also be a place for any storage required by new
features.

This commit adds two new fields to ddt_t, for version and flags. These
are intended to describe the structure and features of the overall dedup
table, and are stored as-is in the DDT root. In this commit, flags are
always zero, but the intent is that they can be used to hang optional
logic or state onto for new dedup features. Version is always 1.

For a "legacy" dedup table, where no DDT root directory exists, the
version will be 0.

ddt_configure() is expected to determine the version and flags features
currently in operation based on whether or not the fast_dedup feature is
enabled, and from what's available on disk. In this way, its possible to
support both old and new tables.

This also provides a migration path. A legacy setup can be upgraded to
FDT by creating the DDT root ZAP, moving the existing objects into it,
and setting version and flags appropriately. There's no support for that
here, but it would be straightforward to add later and allows the
possibility that newer features could be applied to existing dedup
tables.

Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.

Pull-request: #15892 part 2/3
Allan Jude
ddt: add support for prefetching tables into the ARC

This change adds a new `zpool prefetch -t ddt $pool` command which
causes a pool's DDT to be loaded into the ARC. The primary goal is to
remove the need to "warm" a pool's cache before deduplication stops
slowing write performance. It may also provide a way to reload portions
of a DDT if they have been flushed due to inactivity.

Sponsored-by: iXsystems, Inc.
Sponsored-by: Catalogics, Inc.
Sponsored-by: Klara, Inc.

Co-authored-by: Will Andrews <will.andrews@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Will Andrews <will.andrews@klarasystems.com>
Signed-off-by: Fred Weigel <fred.weigel@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>

Pull-request: #15890 part 1/1
Rob Norris
send/recv: open up additional stream feature flags

The docs for drr_versioninfo have marked the top 32 bits as "reserved"
since its introduction (illumos/illumos-gate@9e69d7d). There's no
indication of why they're reserved, so it seems uncontroversial to make
a lot more flags available.

I'm keeping the top eight reserved, and explicitly calling them out as
such, so we can extend the header further in the future if we run out of
flags or want to do some kind of change that isn't about feature flags.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #15454 part 1/1
Rob Norris
Chacha20-Poly1305 encryption

This commit implements the Chacha20-Poly1305 AEAD from RFC 8439 as a new
algorithm option for encrypted datasets.

AES (and particularly the default AES-GCM mode used in OpenZFS) is known
to be very slow on systems without hardware assistance. There are many
such machines out there could make good use of OpenZFS, especially
low-cost machines and small boards that would otherwise make very nice
storage machines. The Raspberry Pi series of machines are a good
example.

The best option for these systems is an encryption option that performs
well in software. Chacha20-Poly1305 is the current "standard" option for
this in many contexts, and is a good choice for OpenZFS.

The core Chacha20 and Poly1305 implementations are taken from Loup
Valliant's Monocypher. These were chosen because they are compact, easy
to read, easy to use and the author has written extensively about its
development, all of which give me confidence that there are unlikely to
be any surprises.

I've added a KCF-style module to the ICP to implement the AEAD. This
implements just enough for OpenZFS, and is not suitable as a
general-purpose KCF for Illumos (though it could be the starting point
for one).

For FreeBSD, which does not use the ICP, I've instead hooked it up to
FreeBSD's builtin crypto stack.

The rest is adding an enabling property value and a feature flag and and
hooking it up to all the right touch points, and documentation updates.

The existing tests that cycle through the possible encryption options
have been extended to add one more.

I've added a test to ensure that raw receives of chacha20-poly1305
datasets do the right thing based on the state of the feature flag on
the receiving side.

There's also a test unit that runs the test vectors in RFC 8439 against
Chacha20, Poly1305 and the AEAD in the ICP that combines them. This is
most useful as a sanity check during future work to add alternate
(accelerated) implementations.

Finally, manual interop testing has been done to confirm that pools and
streams can be moved between Linux and FreeBSD correctly.

Light and uncontrolled performance testing on a Raspberry Pi 4B
(Broadcom BCM2711, no hardware AES) writing to a chacha20-poly1305
dataset was ~2.4x faster than aes-256-gcm on the same hardware. On a
Fitlet2 (Celeron J3455, AES-NI but no AVX (#10846)) it was ~1.3x faster.

Signed-off-by: Rob Norris <robn@despairlabs.com>

Pull-request: #14249 part 1/1
Brian Atkinson
Updating based on PR Feedback(3)

1. Unified the block cloning and Direct I/O code paths further. As part
  of this unification, it is important to outline that Direct I/O
  writes transition the db_state to DB_UNCACHED. This is used so that
  dbuf_unoverride() is called when dbuf_undirty() is called. This is
  needed to cleanup space accounting in a TXG. When a dbuf is redirtied
  through dbuf_redirty(), then dbuf_unoverride() is also called to
  clean up space accounting. This is a bit of a different approach that
  block cloning, which always calls dbuf_undirty().
2. As part of uniying the two, Direct I/O also performs the same check
  in dmu_buf_will_fill() so that on failure the previous contents of
  the dbuf are set correctly.
3. General just code cleanup removing checks that are no longer
  necessary.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>

Pull-request: #10018 part 5/5
Brian Atkinson
Updating based on PR Feedback(2)

Updating code base on PR code comments. I adjusted the following parts
of the code base on the comments:
1. Updated zfs_check_direct_enabled() so it now just returns an error.
  This removed the need for the added enum and cleaned up the code.
2. Moved acquiring the rangelock from zfs_fillpage() out to
  zfs_getpage(). This cleans up the code and gets rid of the need to
  pass a boolean into zfs_fillpage() to conditionally gra the
  rangelock.
3. Cleaned up the code in both zfs_uio_get_dio_pages() and
  zfs_uio_get_dio_pages_iov(). There was no need to have wanted and
  maxsize as they were the same thing. Also, since the previous
  commit cleaned up the call to zfs_uio_iov_step() the code is much
  cleaner over all.
4. Removed dbuf_get_dirty_direct() function.
5. Unified dbuf_read() to account for both block clones and direct I/O
  writes. This removes redundant code from dbuf_read_impl() for
  grabbingthe BP.
6. Removed zfs_map_page() and zfs_unmap_page() declarations from Linux
  headers as those were never called.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>

Pull-request: #10018 part 4/5
Brian Atkinson
Updating based on PR Feedback(1)

Updating code based on PR code comments. I adjusted the following parts
of code based on comments:
1.  Revert dbuf_undirty() to original logic and got rid of uncessary
    code change.
2.  Cleanup in abd_impl.h
3.  Cleanup in abd.h
4.  Got rid of duplicate declaration of dmu_buf_hold_noread() in dmu.h.
5.  Cleaned up comment for db_mtx in dmu_imp.h
6.  Updated zfsprop man page to state correct ZFS version
7.  Updated to correct cast in zfs_uio_page_aligned() calls to use
    uintptr_t.
8.  Cleaned up comment in FreeBSD uio code.
9.  Removed unnecessary format changes in comments in Linux abd code.
10.  Updated ZFS VFS hook for direct_IO to use PANIC().
11. Updated comment above dbuf_undirty to use double space again.
12. Converted module paramter zfs_vdev_direct_write_verify_pct OS
    indedepent and in doing so this removed the uneccessary check for
    bounds.
13. Updated to casting in zfs_dio_page_aligned to uniptr_t and added
    kernel guard.
14. Updated zfs_dio_size_aligned() to use modulo math because
    dn->dn_datablksz is not required to be a power of 2.
15. Removed abd scatter stats update calls from all
    ABD_FLAG_FROM_PAGES.
16. Updated check in abd_alloc_from_pages() for the linear page. This
    way a single page that is even 4K can represented as an
    ABD_FLAG_LINEAR_PAGE.
17. Fixing types for UIO code. In FreeBSD the vm code expects and
    returns int's for values. In linux the interfaces return long value
    in get_user_pages_unlocked() and rest of the IOV interfaces return
    int's. Stuck with the worse case and used long for npages in Linux.
    Updated the uio npage struct to correspond to the correct types and
    that type checking is consistent in the UIO code.
18. Updated comments about what zfs_uio_get_pages_alloc() is doing.
19. Updated error handeling in zfs_uio_get_dio_pages_alloc() for Linux.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>

Pull-request: #10018 part 3/5
Brian Atkinson
Fixing race condition with rangelocks

There existed a race condition between when a Direct I/O write could
complete and if a sync operation was issued. This was due to the fact
that a Direct I/O would sleep waiting on previous TXG's to sync out
their dirty records assosciated with a dbuf if there was an ARC buffer
associated with the dbuf. This was necessay to safely destroy the ARC
buffer in case previous dirty records dr_data as pointed at that the
db_buf. The main issue with this approach is a Direct I/o write holds
the rangelock across the entire block, so when a sync on that same block
was issued and tried to grab the rangelock as reader, it would be
blocked indefinitely because the Direct I/O that was now sleeping was
holding that same rangelock as writer. This led to a complete deadlock.

This commit fixes this issue and removes the wait in
dmu_write_direct_done().

The way this is now handled is the ARC buffer is destroyed, if there an
associated one with dbuf, before ever issuing the Direct I/O write.
This implemenation heavily borrows from the block cloning
implementation.

A new function dmu_buf_wil_clone_or_dio() is called in both
dmu_write_direct() and dmu_brt_clone() that does the following:
1. Undirties a dirty record for that db if there one currently
  associated with the current TXG.
2. Destroys the ARC buffer if the previous dirty record dr_data does not
  point at the dbufs ARC buffer (db_buf).
3. Sets the dbufs data pointers to NULL.
4. Redirties the dbuf using db_state = DB_NOFILL.

As part of this commit, the dmu_write_direct_done() function was also
cleaned up. Now dmu_sync_done() is called before undirtying the dbuf
dirty record associated with a failed Direct I/O write. This is correct
logic and how it always should have been.

The additional benefits of these modifications is there is no longer a
stall in a Direct I/O write if the user is mixing bufferd and O_DIRECT
together. Also it unifies the block cloning and Direct I/O write path as
they both need to call dbuf_fix_old_data() before destroying the ARC
buffer.

As part of this commit, there is also just general code cleanup. Various
dbuf stats were removed because they are not necesary any longer.
Additionally, useless functions were removed to make the code paths
cleaner for Direct I/O.

Below is the race condtion stack trace that was being consistently
observed in the CI runs for the dio_random test case that prompted
these changes:
trace:
[ 7795.294473] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 9954.769075] INFO: task z_wr_int:1051869 blocked for more than
120
seconds.
[ 9954.770512]      Tainted: P          OE    -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[ 9954.772159] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.773848] task:z_wr_int        state:D stack:0
pid:1051869
ppid:2      flags:0x80004080
[ 9954.775512] Call Trace:
[ 9954.776406]  __schedule+0x2d1/0x870
[ 9954.777386]  ? free_one_page+0x204/0x530
[ 9954.778466]  schedule+0x55/0xf0
[ 9954.779355]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.780491]  ? finish_wait+0x80/0x80
[ 9954.781450]  dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs]
[ 9954.782889]  dmu_write_direct_done+0x90/0x3b0 [zfs]
[ 9954.784255]  zio_done+0x373/0x1d50 [zfs]
[ 9954.785410]  zio_execute+0xee/0x210 [zfs]
[ 9954.786588]  taskq_thread+0x205/0x3f0 [spl]
[ 9954.787673]  ? wake_up_q+0x60/0x60
[ 9954.788571]  ? zio_execute_stack_check.constprop.1+0x10/0x10
[zfs]
[ 9954.790079]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[ 9954.791199]  kthread+0x134/0x150
[ 9954.792082]  ? set_kthread_struct+0x50/0x50
[ 9954.793189]  ret_from_fork+0x35/0x40
[ 9954.794108] INFO: task txg_sync:1051894 blocked for more than
120
seconds.
[ 9954.795535]      Tainted: P          OE    -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[ 9954.797103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.798669] task:txg_sync        state:D stack:0
pid:1051894
ppid:2      flags:0x80004080
[ 9954.800267] Call Trace:
[ 9954.801096]  __schedule+0x2d1/0x870
[ 9954.801972]  ? __wake_up_common+0x7a/0x190
[ 9954.802963]  schedule+0x55/0xf0
[ 9954.803884]  schedule_timeout+0x19f/0x320
[ 9954.804837]  ? __next_timer_interrupt+0xf0/0xf0
[ 9954.805932]  ? taskq_dispatch+0xab/0x280 [spl]
[ 9954.806959]  io_schedule_timeout+0x19/0x40
[ 9954.807989]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 9954.809110]  ? finish_wait+0x80/0x80
[ 9954.810068]  __cv_timedwait_io+0x15/0x20 [spl]
[ 9954.811103]  zio_wait+0x1ad/0x4f0 [zfs]
[ 9954.812255]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[ 9954.813442]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[ 9954.814648]  spa_sync_iterate_to_convergence+0xcb/0x310 [zfs]
[ 9954.816023]  spa_sync+0x362/0x8f0 [zfs]
[ 9954.817110]  txg_sync_thread+0x27a/0x3b0 [zfs]
[ 9954.818267]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[ 9954.819510]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[ 9954.820643]  thread_generic_wrapper+0x63/0x90 [spl]
[ 9954.821709]  kthread+0x134/0x150
[ 9954.822590]  ? set_kthread_struct+0x50/0x50
[ 9954.823584]  ret_from_fork+0x35/0x40
[ 9954.824444] INFO: task fio:1055501 blocked for more than 120
seconds.
[ 9954.825781]      Tainted: P          OE    -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[ 9954.827315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.828871] task:fio            state:D stack:0
pid:1055501
ppid:1055490 flags:0x00004080
[ 9954.830463] Call Trace:
[ 9954.831280]  __schedule+0x2d1/0x870
[ 9954.832159]  ? dbuf_hold_copy+0xec/0x230 [zfs]
[ 9954.833396]  schedule+0x55/0xf0
[ 9954.834286]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.835291]  ? finish_wait+0x80/0x80
[ 9954.836235]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 9954.837543]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 9954.838838]  zfs_get_data+0x566/0x810 [zfs]
[ 9954.840034]  zil_lwb_commit+0x194/0x3f0 [zfs]
[ 9954.841154]  zil_lwb_write_issue+0x68/0xb90 [zfs]
[ 9954.842367]  ? __list_add+0x12/0x30 [zfs]
[ 9954.843496]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 9954.844665]  ? zil_alloc_lwb+0x217/0x360 [zfs]
[ 9954.845852]  zil_commit_waiter_timeout+0x1f3/0x570 [zfs]
[ 9954.847203]  zil_commit_waiter+0x1d2/0x3b0 [zfs]
[ 9954.848380]  zil_commit_impl+0x6d/0xd0 [zfs]
[ 9954.849550]  zfs_fsync+0x66/0x90 [zfs]
[ 9954.850640]  zpl_fsync+0xe5/0x140 [zfs]
[ 9954.851729]  do_fsync+0x38/0x70
[ 9954.852585]  __x64_sys_fsync+0x10/0x20
[ 9954.853486]  do_syscall_64+0x5b/0x1b0
[ 9954.854416]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.855466] RIP: 0033:0x7eff236bb057
[ 9954.856388] Code: Unable to access opcode bytes at RIP
0x7eff236bb02d.
[ 9954.857651] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[ 9954.859141] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007eff236bb057
[ 9954.860496] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI:
0000000000000006
[ 9954.861945] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09:
0000000000000000
[ 9954.863327] R10: 0000000000056000 R11: 0000000000000293 R12:
0000000000000003
[ 9954.864765] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15:
000055e4d1f13ae8
[ 9954.866149] INFO: task fio:1055502 blocked for more than 120
seconds.
[ 9954.867490]      Tainted: P          OE    -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[ 9954.869029] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.870571] task:fio            state:D stack:0
pid:1055502
ppid:1055490 flags:0x00004080
[ 9954.872162] Call Trace:
[ 9954.872947]  __schedule+0x2d1/0x870
[ 9954.873844]  schedule+0x55/0xf0
[ 9954.874716]  schedule_timeout+0x19f/0x320
[ 9954.875645]  ? __next_timer_interrupt+0xf0/0xf0
[ 9954.876722]  io_schedule_timeout+0x19/0x40
[ 9954.877677]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 9954.878822]  ? finish_wait+0x80/0x80
[ 9954.879694]  __cv_timedwait_io+0x15/0x20 [spl]
[ 9954.880763]  zio_wait+0x1ad/0x4f0 [zfs]
[ 9954.881865]  dmu_write_abd+0x174/0x1c0 [zfs]
[ 9954.883074]  dmu_write_uio_direct+0x79/0x100 [zfs]
[ 9954.884285]  dmu_write_uio_dnode+0xb2/0x320 [zfs]
[ 9954.885507]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[ 9954.886687]  zfs_write+0x581/0xe20 [zfs]
[ 9954.887822]  ? iov_iter_get_pages+0xe9/0x390
[ 9954.888862]  ? trylock_page+0xd/0x20 [zfs]
[ 9954.890005]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 9954.891217]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 9954.892391]  zpl_iter_write_direct+0xd4/0x170 [zfs]
[ 9954.893663]  ? rrw_exit+0xc6/0x200 [zfs]
[ 9954.894764]  zpl_iter_write+0xd5/0x110 [zfs]
[ 9954.895911]  new_sync_write+0x112/0x160
[ 9954.896881]  vfs_write+0xa5/0x1b0
[ 9954.897701]  ksys_write+0x4f/0xb0
[ 9954.898569]  do_syscall_64+0x5b/0x1b0
[ 9954.899417]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.900515] RIP: 0033:0x7eff236baa47
[ 9954.901363] Code: Unable to access opcode bytes at RIP
0x7eff236baa1d.
[ 9954.902673] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[ 9954.904099] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007eff236baa47
[ 9954.905535] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI:
0000000000000005
[ 9954.906902] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09:
0000000000000000
[ 9954.908339] R10: 0000000000000000 R11: 0000000000000293 R12:
00000000000e4000
[ 9954.909705] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15:
000055e4d1f13ae8
[ 9954.911129] INFO: task fio:1055504 blocked for more than 120
seconds.
[ 9954.912381]      Tainted: P          OE    -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[ 9954.913978] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.915434] task:fio            state:D stack:0
pid:1055504
ppid:1055493 flags:0x00000080
[ 9954.917082] Call Trace:
[ 9954.917773]  __schedule+0x2d1/0x870
[ 9954.918648]  ? zilog_dirty+0x4f/0xc0 [zfs]
[ 9954.919831]  schedule+0x55/0xf0
[ 9954.920717]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.921704]  ? finish_wait+0x80/0x80
[ 9954.922639]  zfs_rangelock_enter_writer+0x46/0x1c0 [zfs]
[ 9954.923940]  zfs_rangelock_enter_impl+0x12a/0x1b0 [zfs]
[ 9954.925306]  zfs_write+0x703/0xe20 [zfs]
[ 9954.926406]  zpl_iter_write_buffered+0xb2/0x120 [zfs]
[ 9954.927687]  ? rrw_exit+0xc6/0x200 [zfs]
[ 9954.928821]  zpl_iter_write+0xbe/0x110 [zfs]
[ 9954.930028]  new_sync_write+0x112/0x160
[ 9954.930913]  vfs_write+0xa5/0x1b0
[ 9954.931758]  ksys_write+0x4f/0xb0
[ 9954.932666]  do_syscall_64+0x5b/0x1b0
[ 9954.933544]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.934689] RIP: 0033:0x7fcaee8f0a47
[ 9954.935551] Code: Unable to access opcode bytes at RIP
0x7fcaee8f0a1d.
[ 9954.936893] RSP: 002b:00007fff56b2c240 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[ 9954.938327] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007fcaee8f0a47
[ 9954.939777] RDX: 000000000001d000 RSI: 00007fca8300b010 RDI:
0000000000000006
[ 9954.941187] RBP: 00007fca8300b010 R08: 0000000000000000 R09:
0000000000000000
[ 9954.942655] R10: 0000000000000000 R11: 0000000000000293 R12:
000000000001d000
[ 9954.944062] R13: 0000557a2006bac0 R14: 000000000001d000 R15:
0000557a2006bae8
[ 9954.945525] INFO: task fio:1055505 blocked for more than 120
seconds.
[ 9954.946819]      Tainted: P          OE    -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[ 9954.948466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.949959] task:fio            state:D stack:0
pid:1055505
ppid:1055493 flags:0x00004080
[ 9954.951653] Call Trace:
[ 9954.952417]  __schedule+0x2d1/0x870
[ 9954.953393]  ? finish_wait+0x3e/0x80
[ 9954.954315]  schedule+0x55/0xf0
[ 9954.955212]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.956211]  ? finish_wait+0x80/0x80
[ 9954.957159]  zil_commit_waiter+0xfa/0x3b0 [zfs]
[ 9954.958343]  zil_commit_impl+0x6d/0xd0 [zfs]
[ 9954.959524]  zfs_fsync+0x66/0x90 [zfs]
[ 9954.960626]  zpl_fsync+0xe5/0x140 [zfs]
[ 9954.961763]  do_fsync+0x38/0x70
[ 9954.962638]  __x64_sys_fsync+0x10/0x20
[ 9954.963520]  do_syscall_64+0x5b/0x1b0
[ 9954.964470]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.965567] RIP: 0033:0x7fcaee8f1057
[ 9954.966490] Code: Unable to access opcode bytes at RIP
0x7fcaee8f102d.
[ 9954.967752] RSP: 002b:00007fff56b2c230 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[ 9954.969260] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007fcaee8f1057
[ 9954.970628] RDX: 0000000000000000 RSI: 0000557a2006bac0 RDI:
0000000000000005
[ 9954.972092] RBP: 00007fca84152a18 R08: 0000000000000000 R09:
0000000000000000
[ 9954.973484] R10: 0000000000035000 R11: 0000000000000293 R12:
0000000000000003
[ 9954.974958] R13: 0000557a2006bac0 R14: 0000000000000000 R15:
0000557a2006bae8
[10077.648150] INFO: task z_wr_int:1051869 blocked for more than
120
seconds.
[10077.649541]      Tainted: P          OE    -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[10077.651116] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.652782] task:z_wr_int        state:D stack:0
pid:1051869
ppid:2      flags:0x80004080
[10077.654420] Call Trace:
[10077.655267]  __schedule+0x2d1/0x870
[10077.656179]  ? free_one_page+0x204/0x530
[10077.657192]  schedule+0x55/0xf0
[10077.658004]  cv_wait_common+0x16d/0x280 [spl]
[10077.659018]  ? finish_wait+0x80/0x80
[10077.660013]  dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs]
[10077.661396]  dmu_write_direct_done+0x90/0x3b0 [zfs]
[10077.662617]  zio_done+0x373/0x1d50 [zfs]
[10077.663783]  zio_execute+0xee/0x210 [zfs]
[10077.664921]  taskq_thread+0x205/0x3f0 [spl]
[10077.665982]  ? wake_up_q+0x60/0x60
[10077.666842]  ? zio_execute_stack_check.constprop.1+0x10/0x10
[zfs]
[10077.668295]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[10077.669360]  kthread+0x134/0x150
[10077.670191]  ? set_kthread_struct+0x50/0x50
[10077.671209]  ret_from_fork+0x35/0x40
[10077.672076] INFO: task txg_sync:1051894 blocked for more than
120
seconds.
[10077.673467]      Tainted: P          OE    -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[10077.675112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.676612] task:txg_sync        state:D stack:0
pid:1051894
ppid:2      flags:0x80004080
[10077.678288] Call Trace:
[10077.679024]  __schedule+0x2d1/0x870
[10077.679948]  ? __wake_up_common+0x7a/0x190
[10077.681042]  schedule+0x55/0xf0
[10077.681899]  schedule_timeout+0x19f/0x320
[10077.682951]  ? __next_timer_interrupt+0xf0/0xf0
[10077.684005]  ? taskq_dispatch+0xab/0x280 [spl]
[10077.685085]  io_schedule_timeout+0x19/0x40
[10077.686080]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[10077.687227]  ? finish_wait+0x80/0x80
[10077.688123]  __cv_timedwait_io+0x15/0x20 [spl]
[10077.689206]  zio_wait+0x1ad/0x4f0 [zfs]
[10077.690300]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[10077.691435]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[10077.692636]  spa_sync_iterate_to_convergence+0xcb/0x310 [zfs]
[10077.693997]  spa_sync+0x362/0x8f0 [zfs]
[10077.695112]  txg_sync_thread+0x27a/0x3b0 [zfs]
[10077.696239]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[10077.697512]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[10077.698639]  thread_generic_wrapper+0x63/0x90 [spl]
[10077.699687]  kthread+0x134/0x150
[10077.700567]  ? set_kthread_struct+0x50/0x50
[10077.701502]  ret_from_fork+0x35/0x40
[10077.702430] INFO: task fio:1055501 blocked for more than 120
seconds.
[10077.703697]      Tainted: P          OE    -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[10077.705309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.706780] task:fio            state:D stack:0
pid:1055501
ppid:1055490 flags:0x00004080
[10077.708479] Call Trace:
[10077.709231]  __schedule+0x2d1/0x870
[10077.710190]  ? dbuf_hold_copy+0xec/0x230 [zfs]
[10077.711368]  schedule+0x55/0xf0
[10077.712286]  cv_wait_common+0x16d/0x280 [spl]
[10077.713316]  ? finish_wait+0x80/0x80
[10077.714262]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[10077.715566]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[10077.716878]  zfs_get_data+0x566/0x810 [zfs]
[10077.718032]  zil_lwb_commit+0x194/0x3f0 [zfs]
[10077.719234]  zil_lwb_write_issue+0x68/0xb90 [zfs]
[10077.720413]  ? __list_add+0x12/0x30 [zfs]
[10077.721525]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[10077.722708]  ? zil_alloc_lwb+0x217/0x360 [zfs]
[10077.723931]  zil_commit_waiter_timeout+0x1f3/0x570 [zfs]
[10077.725273]  zil_commit_waiter+0x1d2/0x3b0 [zfs]
[10077.726438]  zil_commit_impl+0x6d/0xd0 [zfs]
[10077.727586]  zfs_fsync+0x66/0x90 [zfs]
[10077.728675]  zpl_fsync+0xe5/0x140 [zfs]
[10077.729755]  do_fsync+0x38/0x70
[10077.730607]  __x64_sys_fsync+0x10/0x20
[10077.731482]  do_syscall_64+0x5b/0x1b0
[10077.732415]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[10077.733487] RIP: 0033:0x7eff236bb057
[10077.734399] Code: Unable to access opcode bytes at RIP
0x7eff236bb02d.
[10077.735657] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[10077.737163] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007eff236bb057
[10077.738526] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI:
0000000000000006
[10077.739966] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09:
0000000000000000
[10077.741336] R10: 0000000000056000 R11: 0000000000000293 R12:
0000000000000003
[10077.742773] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15:
000055e4d1f13ae8
[10077.744168] INFO: task fio:1055502 blocked for more than 120
seconds.
[10077.745505]      Tainted: P          OE    -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[10077.747073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.748642] task:fio            state:D stack:0
pid:1055502
ppid:1055490 flags:0x00004080
[10077.750233] Call Trace:
[10077.751011]  __schedule+0x2d1/0x870
[10077.751915]  schedule+0x55/0xf0
[10077.752811]  schedule_timeout+0x19f/0x320
[10077.753762]  ? __next_timer_interrupt+0xf0/0xf0
[10077.754824]  io_schedule_timeout+0x19/0x40
[10077.755782]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[10077.756922]  ? finish_wait+0x80/0x80
[10077.757788]  __cv_timedwait_io+0x15/0x20 [spl]
[10077.758845]  zio_wait+0x1ad/0x4f0 [zfs]
[10077.759941]  dmu_write_abd+0x174/0x1c0 [zfs]
[10077.761144]  dmu_write_uio_direct+0x79/0x100 [zfs]
[10077.762327]  dmu_write_uio_dnode+0xb2/0x320 [zfs]
[10077.763523]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[10077.764749]  zfs_write+0x581/0xe20 [zfs]
[10077.765825]  ? iov_iter_get_pages+0xe9/0x390
[10077.766842]  ? trylock_page+0xd/0x20 [zfs]
[10077.767956]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[10077.769189]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[10077.770343]  zpl_iter_write_direct+0xd4/0x170 [zfs]
[10077.771570]  ? rrw_exit+0xc6/0x200 [zfs]
[10077.772674]  zpl_iter_write+0xd5/0x110 [zfs]
[10077.773834]  new_sync_write+0x112/0x160
[10077.774805]  vfs_write+0xa5/0x1b0
[10077.775634]  ksys_write+0x4f/0xb0
[10077.776526]  do_syscall_64+0x5b/0x1b0
[10077.777386]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[10077.778488] RIP: 0033:0x7eff236baa47
[10077.779339] Code: Unable to access opcode bytes at RIP
0x7eff236baa1d.
[10077.780655] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[10077.782056] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007eff236baa47
[10077.783507] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI:
0000000000000005
[10077.784890] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09:
0000000000000000
[10077.786303] R10: 0000000000000000 R11: 0000000000000293 R12:
00000000000e4000
[10077.787637] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15:
000055e4d1f13ae8

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>

Pull-request: #10018 part 2/5
Brian Atkinson
Adding Direct IO Support

Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.

O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated  TXG is synced.
For both O_DIRECT read and write request the offset and requeset sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).

For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.

For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.

For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.

Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Dedup
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
          any data in the user buffers and written directly down to the
  VDEV disks is guaranteed to not change. There is no concern
  with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
        protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify_pct` that contols the
percentage of O_DIRECT writes that can occur to a top-level
VDEV before a checksum verify is run before the contents of the
user buffers are committed to disk. In the event of a checksum
verification failure the write will be redirected through the
ARC. The deafault value for `zfs_vdev_direct_write_verify_pct`
is 2 percent of Direct I/O writes to a top-level VDEV. The
number of O_DIRECT write checksum verification errors can be
observed by doing `zpool status -d`, which will list all
verification errors that have occurred on a top-level VDEV.
Along with `zpool status`, a ZED event will be issues as
`dio_verify` when a checksum verification error occurs.

A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
  the request as a buffered IO request.
standard - Follows the alignment restrictions  outlined above for
  write/read IO requests when the O_DIRECT flag is used.
always  - Treats every write/read IO request as though it passed
          O_DIRECT and will do O_DIRECT if the alignment restrictions
  are met otherwise will redirect through the ARC. This
  property will not allow a request to fail.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>

Pull-request: #10018 part 1/5
George Melikov
arc_hdr_authenticate: make explicit error

On compression we could be more explicit here for cases
where we can not recompress the data.

Co-authored-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: George Melikov <mail@gmelikov.ru>

Pull-request: #9416 part 3/3
George Melikov
ZLE compression: don't use BPE_PAYLOAD_SIZE

ZLE compressor needs additional bytes to process
d_len argument efficiently.
Don't use BPE_PAYLOAD_SIZE as d_len with it
before we rework zle compressor somehow.

Signed-off-by: George Melikov <mail@gmelikov.ru>

Pull-request: #9416 part 2/3
George Melikov
zio_compress: introduce max size threshold

Now default compression is lz4, which can stop
compression process by itself on incompressible data.
If there are additional size checks -
we will only make our compressratio worse.

New usable compression thresholds are:
- less than BPE_PAYLOAD_SIZE (embedded_data feature);
- at least one saved sector.

Old 12.5% threshold is left to minimize affect
on existing user expectations of CPU utilization.

If data wasn't compressed - it will be saved as
ZIO_COMPRESS_OFF, so if we really need to recompress
data without ashift info and check anything -
we can just compress it with zero threshold.
So, we don't need a new feature flag here!

Signed-off-by: George Melikov <mail@gmelikov.ru>

Pull-request: #9416 part 1/3