Список изменений в ядре 6.1.92

admin-guide/hw-vuln/core-scheduling: fix return type of PR_SCHED_CORE_GET [+ + +]

Author: Thomas Weiц÷schuh <linux@weissschuh.net>
Date:   Tue Apr 23 12:34:25 2024 +0200

    admin-guide/hw-vuln/core-scheduling: fix return type of PR_SCHED_CORE_GET
    
    commit 8af2d1ab78f2342f8c4c3740ca02d86f0ebfac5a upstream.
    
    sched_core_share_pid() copies the cookie to userspace with
    put_user(id, (u64 __user *)uaddr), expecting 64 bits of space.
    The "unsigned long" datatype that is documented in core-scheduling.rst
    however is only 32 bits large on 32 bit architectures.
    
    Document "unsigned long long" as the correct data type that is always
    64bits large.
    
    This matches what the selftest cs_prctl_test.c has been doing all along.
    
    Fixes: 0159bb020ca9 ("Documentation: Add usecases, design and interface for core scheduling")
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/util-linux/df7a25a0-7923-4f8b-a527-5e6f0064074d@t-8ch.de/
    Signed-off-by: Thomas Weiц÷schuh <linux@weissschuh.net>
    Reviewed-by: Chris Hyser <chris.hyser@oracle.com>
    Signed-off-by: Jonathan Corbet <corbet@lwn.net>
    Link: https://lore.kernel.org/r/20240423-core-scheduling-cookie-v1-1-5753a35f8dfc@weissschuh.net
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

arm64: atomics: lse: remove stale dependency on JUMP_LABEL [+ + +]

Author: Mark Rutland <mark.rutland@arm.com>
Date:   Mon Nov 14 12:54:24 2022 +0000

    arm64: atomics: lse: remove stale dependency on JUMP_LABEL
    
    commit 657eef0a5420a02c02945ed8c87f2ddcbd255772 upstream.
    
    Currently CONFIG_ARM64_USE_LSE_ATOMICS depends upon CONFIG_JUMP_LABEL,
    as the inline atomics were indirected with a static branch.
    
    However, since commit:
    
      21fb26bfb01ffe0d ("arm64: alternatives: add alternative_has_feature_*()")
    
    ... we use an alternative_branch (which is always available) rather than
    a static branch, and hence the dependency is unnecessary.
    
    Remove the stale dependency, along with the stale include. This will
    allow the use of LSE atomics in kernels built with CONFIG_JUMP_LABEL=n,
    and reduces the risk of circular header dependencies via <asm/lse.h>.
    
    Signed-off-by: Mark Rutland <mark.rutland@arm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Link: https://lore.kernel.org/r/20221114125424.2998268-1-mark.rutland@arm.com
    Signed-off-by: Will Deacon <will@kernel.org>
    Signed-off-by: Oleksandr Tymoshenko <ovt@google.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

binder: fix max_thread type inconsistency [+ + +]

Author: Carlos Llamas <cmllamas@google.com>
Date:   Sun Apr 21 17:37:49 2024 +0000

    binder: fix max_thread type inconsistency
    
    commit 42316941335644a98335f209daafa4c122f28983 upstream.
    
    The type defined for the BINDER_SET_MAX_THREADS ioctl was changed from
    size_t to __u32 in order to avoid incompatibility issues between 32 and
    64-bit kernels. However, the internal types used to copy from user and
    store the value were never updated. Use u32 to fix the inconsistency.
    
    Fixes: a9350fc859ae ("staging: android: binder: fix BINDER_SET_MAX_THREADS declaration")
    Reported-by: Arve Hjц╦nnevц╔g <arve@android.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: Carlos Llamas <cmllamas@google.com>
    Reviewed-by: Alice Ryhl <aliceryhl@google.com>
    Link: https://lore.kernel.org/r/20240421173750.3117808-1-cmllamas@google.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

docs: kernel_include.py: Cope with docutils 0.21 [+ + +]

Author: Akira Yokosawa <akiyks@gmail.com>
Date:   Wed May 1 12:16:11 2024 +0900

    docs: kernel_include.py: Cope with docutils 0.21
    
    commit d43ddd5c91802a46354fa4c4381416ef760676e2 upstream.
    
    Running "make htmldocs" on a newly installed Sphinx 7.3.7 ends up in
    a build error:
    
        Sphinx parallel build error:
        AttributeError: module 'docutils.nodes' has no attribute 'reprunicode'
    
    docutils 0.21 has removed nodes.reprunicode, quote from release note [1]:
    
      * Removed objects:
    
        docutils.nodes.reprunicode, docutils.nodes.ensure_str()
            Python 2 compatibility hacks
    
    Sphinx 7.3.0 supports docutils 0.21 [2]:
    
    kernel_include.py, whose origin is misc.py of docutils, uses reprunicode.
    
    Upstream docutils removed the offending line from the corresponding file
    (docutils/docutils/parsers/rst/directives/misc.py) in January 2022.
    Quoting the changelog [3]:
    
        Deprecate `nodes.reprunicode` and `nodes.ensure_str()`.
    
        Drop uses of the deprecated constructs (not required with Python 3).
    
    Do the same for kernel_include.py.
    
    Tested against:
      - Sphinx 2.4.5 (docutils 0.17.1)
      - Sphinx 3.4.3 (docutils 0.17.1)
      - Sphinx 5.3.0 (docutils 0.18.1)
      - Sphinx 6.2.1 (docutils 0.19)
      - Sphinx 7.2.6 (docutils 0.20.1)
      - Sphinx 7.3.7 (docutils 0.21.2)
    
    Link: http://www.docutils.org/RELEASE-NOTES.html#release-0-21-2024-04-09 [1]
    Link: https://www.sphinx-doc.org/en/master/changes.html#release-7-3-0-released-apr-16-2024 [2]
    Link: https://github.com/docutils/docutils/commit/c8471ce47a24 [3]
    Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: Jonathan Corbet <corbet@lwn.net>
    Link: https://lore.kernel.org/r/faf5fa45-2a9d-4573-9d2e-3930bdc1ed65@gmail.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

drm/amd/display: Fix division by zero in setup_dsc_config [+ + +]

Author: Jose Fernandez <josef@netflix.com>
Date:   Mon Apr 22 08:35:44 2024 -0600

    drm/amd/display: Fix division by zero in setup_dsc_config
    
    commit 130afc8a886183a94cf6eab7d24f300014ff87ba upstream.
    
    When slice_height is 0, the division by slice_height in the calculation
    of the number of slices will cause a division by zero driver crash. This
    leaves the kernel in a state that requires a reboot. This patch adds a
    check to avoid the division by zero.
    
    The stack trace below is for the 6.8.4 Kernel. I reproduced the issue on
    a Z16 Gen 2 Lenovo Thinkpad with a Apple Studio Display monitor
    connected via Thunderbolt. The amdgpu driver crashed with this exception
    when I rebooted the system with the monitor connected.
    
    kernel: ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
    kernel: ? do_trap (arch/x86/kernel/traps.c:113 arch/x86/kernel/traps.c:154)
    kernel: ? setup_dsc_config (drivers/gpu/drm/amd/amdgpu/../display/dc/dsc/dc_dsc.c:1053) amdgpu
    kernel: ? do_error_trap (./arch/x86/include/asm/traps.h:58 arch/x86/kernel/traps.c:175)
    kernel: ? setup_dsc_config (drivers/gpu/drm/amd/amdgpu/../display/dc/dsc/dc_dsc.c:1053) amdgpu
    kernel: ? exc_divide_error (arch/x86/kernel/traps.c:194 (discriminator 2))
    kernel: ? setup_dsc_config (drivers/gpu/drm/amd/amdgpu/../display/dc/dsc/dc_dsc.c:1053) amdgpu
    kernel: ? asm_exc_divide_error (./arch/x86/include/asm/idtentry.h:548)
    kernel: ? setup_dsc_config (drivers/gpu/drm/amd/amdgpu/../display/dc/dsc/dc_dsc.c:1053) amdgpu
    kernel: dc_dsc_compute_config (drivers/gpu/drm/amd/amdgpu/../display/dc/dsc/dc_dsc.c:1109) amdgpu
    
    After applying this patch, the driver no longer crashes when the monitor
    is connected and the system is rebooted. I believe this is the same
    issue reported for 3113.
    
    Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
    Signed-off-by: Jose Fernandez <josef@netflix.com>
    Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3113
    Signed-off-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
    Cc: "Limonciello, Mario" <mario.limonciello@amd.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

drm/amdgpu: Fix possible NULL dereference in amdgpu_ras_query_error_status_helper() [+ + +]

Author: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Date:   Tue Dec 26 15:32:19 2023 +0530

    drm/amdgpu: Fix possible NULL dereference in amdgpu_ras_query_error_status_helper()
    
    commit b8d55a90fd55b767c25687747e2b24abd1ef8680 upstream.
    
    Return invalid error code -EINVAL for invalid block id.
    
    Fixes the below:
    
    drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1183 amdgpu_ras_query_error_status_helper() error: we previously assumed 'info' could be null (see line 1176)
    
    Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com>
    Cc: Tao Zhou <tao.zhou1@amd.com>
    Cc: Hawking Zhang <Hawking.Zhang@amd.com>
    Cc: Christian Kц╤nig <christian.koenig@amd.com>
    Cc: Alex Deucher <alexander.deucher@amd.com>
    Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
    Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
    [Ajay: applied AMDGPU_RAS_BLOCK_COUNT condition to amdgpu_ras_query_error_status()
           as amdgpu_ras_query_error_status_helper() not present in v6.6, v6.1
           amdgpu_ras_query_error_status_helper() was introduced in 8cc0f5669eb6]
    Signed-off-by: Ajay Kaher <ajay.kaher@broadcom.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ice: pass VSI pointer into ice_vc_isvalid_q_id [+ + +]

Author: Jacob Keller <jacob.e.keller@intel.com>
Date:   Fri Feb 16 14:06:35 2024 -0800

    ice: pass VSI pointer into ice_vc_isvalid_q_id
    
    commit a21605993dd5dfd15edfa7f06705ede17b519026 upstream.
    
    The ice_vc_isvalid_q_id() function takes a VSI index and a queue ID. It
    looks up the VSI from its index, and then validates that the queue number
    is valid for that VSI.
    
    The VSI ID passed is typically a VSI index from the VF. This VSI number is
    validated by the PF to ensure that it matches the VSI associated with the
    VF already.
    
    In every flow where ice_vc_isvalid_q_id() is called, the PF driver already
    has a pointer to the VSI associated with the VF. This pointer is obtained
    using ice_get_vf_vsi(), rather than looking up the VSI using the index sent
    by the VF.
    
    Since we already know which VSI to operate on, we can modify
    ice_vc_isvalid_q_id() to take a VSI pointer instead of a VSI index. Pass
    the VSI we found from ice_get_vf_vsi() instead of re-doing the lookup. This
    removes some unnecessary computation and scanning of the VSI list.
    
    It also removes the last place where the driver directly used the VSI
    number from the VF. This will pave the way for refactoring to communicate
    relative VSI numbers to the VF instead of absolute numbers from the PF
    space.
    
    Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
    Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
    Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
    Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ice: remove unnecessary duplicate checks for VF VSI ID [+ + +]

Author: Jacob Keller <jacob.e.keller@intel.com>
Date:   Fri Feb 16 14:06:36 2024 -0800

    ice: remove unnecessary duplicate checks for VF VSI ID
    
    commit 363f689600dd010703ce6391bcfc729a97d21840 upstream.
    
    The ice_vc_fdir_param_check() function validates that the VSI ID of the
    virtchnl flow director command matches the VSI number of the VF. This is
    already checked by the call to ice_vc_isvalid_vsi_id() immediately
    following this.
    
    This check is unnecessary since ice_vc_isvalid_vsi_id() already confirms
    this by checking that the VSI ID can locate the VSI associated with the VF
    structure.
    
    Furthermore, a following change is going to refactor the ice driver to
    report VSI IDs using a relative index for each VF instead of reporting the
    PF VSI number. This additional check would break that logic since it
    enforces that the VSI ID matches the VSI number.
    
    Since this check duplicates  the logic in ice_vc_isvalid_vsi_id() and gets
    in the way of refactoring that logic, remove it.
    
    Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
    Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
    Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
    Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

iomap: buffered write failure should not truncate the page cache [+ + +]

Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed May 1 11:40:53 2024 -0700

    iomap: buffered write failure should not truncate the page cache
    
    [ Upstream commit f43dc4dc3eff028b5ddddd99f3a66c5a6bdd4e78 ]
    
    iomap_file_buffered_write_punch_delalloc() currently invalidates the
    page cache over the unused range of the delalloc extent that was
    allocated. While the write allocated the delalloc extent, it does
    not own it exclusively as the write does not hold any locks that
    prevent either writeback or mmap page faults from changing the state
    of either the page cache or the extent state backing this range.
    
    Whilst xfs_bmap_punch_delalloc_range() already handles races in
    extent conversion - it will only punch out delalloc extents and it
    ignores any other type of extent - the page cache truncate does not
    discriminate between data written by this write or some other task.
    As a result, truncating the page cache can result in data corruption
    if the write races with mmap modifications to the file over the same
    range.
    
    generic/346 exercises this workload, and if we randomly fail writes
    (as will happen when iomap gets stale iomap detection later in the
    patchset), it will randomly corrupt the file data because it removes
    data written by mmap() in the same page as the write() that failed.
    
    Hence we do not want to punch out the page cache over the range of
    the extent we failed to write to - what we actually need to do is
    detect the ranges that have dirty data in cache over them and *not
    punch them out*.
    
    To do this, we have to walk the page cache over the range of the
    delalloc extent we want to remove. This is made complex by the fact
    we have to handle partially up-to-date folios correctly and this can
    happen even when the FSB size == PAGE_SIZE because we now support
    multi-page folios in the page cache.
    
    Because we are only interested in discovering the edges of data
    ranges in the page cache (i.e. hole-data boundaries) we can make use
    of mapping_seek_hole_data() to find those transitions in the page
    cache. As we hold the invalidate_lock, we know that the boundaries
    are not going to change while we walk the range. This interface is
    also byte-based and is sub-page block aware, so we can find the data
    ranges in the cache based on byte offsets rather than page, folio or
    fs block sized chunks. This greatly simplifies the logic of finding
    dirty cached ranges in the page cache.
    
    Once we've identified a range that contains cached data, we can then
    iterate the range folio by folio. This allows us to determine if the
    data is dirty and hence perform the correct delalloc extent punching
    operations. The seek interface we use to iterate data ranges will
    give us sub-folio start/end granularity, so we may end up looking up
    the same folio multiple times as the seek interface iterates across
    each discontiguous data region in the folio.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

iomap: write iomap validity checks [+ + +]

Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed May 1 11:40:55 2024 -0700

    iomap: write iomap validity checks
    
    [ Upstream commit d7b64041164ca177170191d2ad775da074ab2926 ]
    
    A recent multithreaded write data corruption has been uncovered in
    the iomap write code. The core of the problem is partial folio
    writes can be flushed to disk while a new racing write can map it
    and fill the rest of the page:
    
    writeback                       new write
    
    allocate blocks
      blocks are unwritten
    submit IO
    .....
                                    map blocks
                                    iomap indicates UNWRITTEN range
                                    loop {
                                      lock folio
                                      copyin data
    .....
    IO completes
      runs unwritten extent conv
        blocks are marked written
                                      <iomap now stale>
                                      get next folio
                                    }
    
    Now add memory pressure such that memory reclaim evicts the
    partially written folio that has already been written to disk.
    
    When the new write finally gets to the last partial page of the new
    write, it does not find it in cache, so it instantiates a new page,
    sees the iomap is unwritten, and zeros the part of the page that
    it does not have data from. This overwrites the data on disk that
    was originally written.
    
    The full description of the corruption mechanism can be found here:
    
    https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
    
    To solve this problem, we need to check whether the iomap is still
    valid after we lock each folio during the write. We have to do it
    after we lock the page so that we don't end up with state changes
    occurring while we wait for the folio to be locked.
    
    Hence we need a mechanism to be able to check that the cached iomap
    is still valid (similar to what we already do in buffered
    writeback), and we need a way for ->begin_write to back out and
    tell the high level iomap iterator that we need to remap the
    remaining write range.
    
    The iomap needs to grow some storage for the validity cookie that
    the filesystem provides to travel with the iomap. XFS, in
    particular, also needs to know some more information about what the
    iomap maps (attribute extents rather than file data extents) to for
    the validity cookie to cover all the types of iomaps we might need
    to validate.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

KEYS: trusted: Do not use WARN when encode fails [+ + +]

Author: Jarkko Sakkinen <jarkko@kernel.org>
Date:   Mon May 13 21:19:04 2024 +0300

    KEYS: trusted: Do not use WARN when encode fails
    
    commit 050bf3c793a07f96bd1e2fd62e1447f731ed733b upstream.
    
    When asn1_encode_sequence() fails, WARN is not the correct solution.
    
    1. asn1_encode_sequence() is not an internal function (located
       in lib/asn1_encode.c).
    2. Location is known, which makes the stack trace useless.
    3. Results a crash if panic_on_warn is set.
    
    It is also noteworthy that the use of WARN is undocumented, and it
    should be avoided unless there is a carefully considered rationale to
    use it.
    
    Replace WARN with pr_err, and print the return value instead, which is
    only useful piece of information.
    
    Cc: stable@vger.kernel.org # v5.13+
    Fixes: f2219745250f ("security: keys: trusted: use ASN.1 TPM2 key format for the blobs")
    Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

KEYS: trusted: Fix memory leak in tpm2_key_encode() [+ + +]

Author: Jarkko Sakkinen <jarkko@kernel.org>
Date:   Mon May 20 02:31:53 2024 +0300

    KEYS: trusted: Fix memory leak in tpm2_key_encode()
    
    commit ffcaa2172cc1a85ddb8b783de96d38ca8855e248 upstream.
    
    'scratch' is never freed. Fix this by calling kfree() in the success, and
    in the error case.
    
    Cc: stable@vger.kernel.org # +v5.13
    Fixes: f2219745250f ("security: keys: trusted: use ASN.1 TPM2 key format for the blobs")
    Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Linux: Linux 6.1.92 [+ + +]

Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Sat May 25 16:21:36 2024 +0200

    Linux 6.1.92
    
    Link: https://lore.kernel.org/r/20240523130332.496202557@linuxfoundation.org
    Tested-by: SeongJae Park <sj@kernel.org>
    Tested-by: Mark Brown <broonie@kernel.org>
    Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
    Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
    Tested-by: Pavel Machek (CIP) <pavel@denx.de>
    Tested-by: Shuah Khan <skhan@linuxfoundation.org>
    Tested-by: Jon Hunter <jonathanh@nvidia.com>
    Tested-by: Salvatore Bonaccorso <carnil@debian.org>
    Tested-by: Mateusz Joе└czyk <mat.jonczyk@o2.pl>
    Tested-by: Ron Economos <re@w6rz.net>
    Tested-by: Kelsey Steele <kelseysteele@linux.microsoft.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mfd: stpmic1: Fix swapped mask/unmask in irq chip [+ + +]

Author: Aidan MacDonald <aidanmacdonald.0x0@gmail.com>
Date:   Sat Nov 12 15:18:32 2022 +0000

    mfd: stpmic1: Fix swapped mask/unmask in irq chip
    
    commit c79e387389d5add7cb967d2f7622c3bf5550927b upstream.
    
    The usual behavior of mask registers is writing a '1' bit to
    disable (mask) an interrupt; similarly, writing a '1' bit to
    an unmask register enables (unmasks) an interrupt.
    
    Due to a longstanding issue in regmap-irq, mask and unmask
    registers were inverted when both kinds of registers were
    present on the same chip, ie. regmap-irq actually wrote '1's
    to the mask register to enable an IRQ and '1's to the unmask
    register to disable an IRQ.
    
    This was fixed by commit e8ffb12e7f06 ("regmap-irq: Fix
    inverted handling of unmask registers") but the fix is opt-in
    via mask_unmask_non_inverted = true because it requires manual
    changes for each affected driver. The new behavior will become
    the default once all drivers have been updated.
    
    The STPMIC1 has a normal mask register with separate set and
    clear registers. The driver intends to use the set & clear
    registers with regmap-irq and has compensated for regmap-irq's
    inverted behavior, and should currently be working properly.
    Thus, swap mask_base and unmask_base, and opt in to the new
    non-inverted behavior.
    
    Signed-off-by: Aidan MacDonald <aidanmacdonald.0x0@gmail.com>
    Signed-off-by: Lee Jones <lee@kernel.org>
    Link: https://lore.kernel.org/r/20221112151835.39059-16-aidanmacdonald.0x0@gmail.com
    Cc: Yoann Congal <yoann.congal@smile.fr>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mmc: core: Add HS400 tuning in HS400es initialization [+ + +]

Author: Mengqi Zhang <mengqi.zhang@mediatek.com>
Date:   Mon Dec 25 17:38:40 2023 +0800

    mmc: core: Add HS400 tuning in HS400es initialization
    
    commit 77e01b49e35f24ebd1659096d5fc5c3b75975545 upstream.
    
    During the initialization to HS400es stage, add a HS400 tuning flow as an
    optional process. For Mediatek IP, the HS400es mode requires a specific
    tuning to ensure the correct HS400 timing setting.
    
    Signed-off-by: Mengqi Zhang <mengqi.zhang@mediatek.com>
    Link: https://lore.kernel.org/r/20231225093839.22931-2-mengqi.zhang@mediatek.com
    Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
    Cc: "Lin Gui (Ф║┌Ф·≈)" <Lin.Gui@mediatek.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

net: ks8851: Fix another TX stall caused by wrong ISR flag handling [+ + +]

Author: Ronald Wahl <ronald.wahl@raritan.com>
Date:   Mon May 13 16:39:22 2024 +0200

    net: ks8851: Fix another TX stall caused by wrong ISR flag handling
    
    commit 317a215d493230da361028ea8a4675de334bfa1a upstream.
    
    Under some circumstances it may happen that the ks8851 Ethernet driver
    stops sending data.
    
    Currently the interrupt handler resets the interrupt status flags in the
    hardware after handling TX. With this approach we may lose interrupts in
    the time window between handling the TX interrupt and resetting the TX
    interrupt status bit.
    
    When all of the three following conditions are true then transmitting
    data stops:
    
      - TX queue is stopped to wait for room in the hardware TX buffer
      - no queued SKBs in the driver (txq) that wait for being written to hw
      - hardware TX buffer is empty and the last TX interrupt was lost
    
    This is because reenabling the TX queue happens when handling the TX
    interrupt status but if the TX status bit has already been cleared then
    this interrupt will never come.
    
    With this commit the interrupt status flags will be cleared before they
    are handled. That way we stop losing interrupts.
    
    The wrong handling of the ISR flags was there from the beginning but
    with commit 3dc5d4454545 ("net: ks8851: Fix TX stall caused by TX
    buffer overrun") the issue becomes apparent.
    
    Fixes: 3dc5d4454545 ("net: ks8851: Fix TX stall caused by TX buffer overrun")
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: Paolo Abeni <pabeni@redhat.com>
    Cc: Simon Horman <horms@kernel.org>
    Cc: netdev@vger.kernel.org
    Cc: stable@vger.kernel.org # 5.10+
    Signed-off-by: Ronald Wahl <ronald.wahl@raritan.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

net: usb: ax88179_178a: fix link status when link is set to down/up [+ + +]

Author: Jose Ignacio Tornos Martinez <jtornosm@redhat.com>
Date:   Fri May 10 11:08:28 2024 +0200

    net: usb: ax88179_178a: fix link status when link is set to down/up
    
    commit ecf848eb934b03959918f5269f64c0e52bc23998 upstream.
    
    The idea was to keep only one reset at initialization stage in order to
    reduce the total delay, or the reset from usbnet_probe or the reset from
    usbnet_open.
    
    I have seen that restarting from usbnet_probe is necessary to avoid doing
    too complex things. But when the link is set to down/up (for example to
    configure a different mac address) the link is not correctly recovered
    unless a reset is commanded from usbnet_open.
    
    So, detect the initialization stage (first call) to not reset from
    usbnet_open after the reset from usbnet_probe and after this stage, always
    reset from usbnet_open too (when the link needs to be rechecked).
    
    Apply to all the possible devices, the behavior now is going to be the same.
    
    cc: stable@vger.kernel.org # 6.6+
    Fixes: 56f78615bcb1 ("net: usb: ax88179_178a: avoid writing the mac address before first reading")
    Reported-by: Isaac Ganoung <inventor500@vivaldi.net>
    Reported-by: Yongqin Liu <yongqin.liu@linaro.org>
    Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Link: https://lore.kernel.org/r/20240510090846.328201-1-jtornosm@redhat.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

nfsd: don't allow nfsd threads to be signalled. [+ + +]

Author: NeilBrown <neilb@suse.de>
Date:   Tue Jul 18 16:38:08 2023 +1000

    nfsd: don't allow nfsd threads to be signalled.
    
    commit 3903902401451b1cd9d797a8c79769eb26ac7fe5 upstream.
    
    The original implementation of nfsd used signals to stop threads during
    shutdown.
    In Linux 2.3.46pre5 nfsd gained the ability to shutdown threads
    internally it if was asked to run "0" threads.  After this user-space
    transitioned to using "rpc.nfsd 0" to stop nfsd and sending signals to
    threads was no longer an important part of the API.
    
    In commit 3ebdbe5203a8 ("SUNRPC: discard svo_setup and rename
    svc_set_num_threads_sync()") (v5.17-rc1~75^2~41) we finally removed the
    use of signals for stopping threads, using kthread_stop() instead.
    
    This patch makes the "obvious" next step and removes the ability to
    signal nfsd threads - or any svc threads.  nfsd stops allowing signals
    and we don't check for their delivery any more.
    
    This will allow for some simplification in later patches.
    
    A change worth noting is in nfsd4_ssc_setup_dul().  There was previously
    a signal_pending() check which would only succeed when the thread was
    being shut down.  It should really have tested kthread_should_stop() as
    well.  Now it just does the latter, not the former.
    
    Signed-off-by: NeilBrown <neilb@suse.de>
    Reviewed-by: Jeff Layton <jlayton@kernel.org>
    Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

pinctrl: core: handle radix_tree_insert() errors in pinctrl_register_one_pin() [+ + +]

Author: Sergey Shtylyov <s.shtylyov@omp.ru>
Date:   Wed Jul 19 23:22:52 2023 +0300

    pinctrl: core: handle radix_tree_insert() errors in pinctrl_register_one_pin()
    
    commit ecfe9a015d3e1e46504d5b3de7eef1f2d186194a upstream.
    
    pinctrl_register_one_pin() doesn't check the result of radix_tree_insert()
    despite they both may return a negative error code.  Linus Walleij said he
    has copied the radix tree code from kernel/irq/ where the functions calling
    radix_tree_insert() are *void* themselves; I think it makes more sense to
    propagate the errors from radix_tree_insert() upstream if we can do that...
    
    Found by Linux Verification Center (linuxtesting.org) with the Svace static
    analysis tool.
    
    Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru>
    Link: https://lore.kernel.org/r/20230719202253.13469-3-s.shtylyov@omp.ru
    Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
    Cc: "Hemdan, Hagar Gamal Halim" <hagarhem@amazon.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

remoteproc: mediatek: Make sure IPI buffer fits in L2TCM [+ + +]

Author: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Date:   Thu Mar 21 09:46:13 2024 +0100

    remoteproc: mediatek: Make sure IPI buffer fits in L2TCM
    
    commit 331f91d86f71d0bb89a44217cc0b2a22810bbd42 upstream.
    
    The IPI buffer location is read from the firmware that we load to the
    System Companion Processor, and it's not granted that both the SRAM
    (L2TCM) size that is defined in the devicetree node is large enough
    for that, and while this is especially true for multi-core SCP, it's
    still useful to check on single-core variants as well.
    
    Failing to perform this check may make this driver perform R/W
    operations out of the L2TCM boundary, resulting (at best) in a
    kernel panic.
    
    To fix that, check that the IPI buffer fits, otherwise return a
    failure and refuse to boot the relevant SCP core (or the SCP at
    all, if this is single core).
    
    Fixes: 3efa0ea743b7 ("remoteproc/mediatek: read IPI buffer offset from FW")
    Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20240321084614.45253-2-angelogioacchino.delregno@collabora.com
    Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

serial: kgdboc: Fix NMI-safety problems from keyboard reset code [+ + +]

Author: Daniel Thompson <daniel.thompson@linaro.org>
Date:   Wed Apr 24 15:21:41 2024 +0100

    serial: kgdboc: Fix NMI-safety problems from keyboard reset code
    
    commit b2aba15ad6f908d1a620fd97f6af5620c3639742 upstream.
    
    Currently, when kdb is compiled with keyboard support, then we will use
    schedule_work() to provoke reset of the keyboard status.  Unfortunately
    schedule_work() gets called from the kgdboc post-debug-exception
    handler.  That risks deadlock since schedule_work() is not NMI-safe and,
    even on platforms where the NMI is not directly used for debugging, the
    debug trap can have NMI-like behaviour depending on where breakpoints
    are placed.
    
    Fix this by using the irq work system, which is NMI-safe, to defer the
    call to schedule_work() to a point when it is safe to call.
    
    Reported-by: Liuye <liu.yeC@h3c.com>
    Closes: https://lore.kernel.org/all/20240228025602.3087748-1-liu.yeC@h3c.com/
    Cc: stable@vger.kernel.org
    Reviewed-by: Douglas Anderson <dianders@chromium.org>
    Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Link: https://lore.kernel.org/r/20240424-kgdboc_fix_schedule_work-v2-1-50f5a490aec5@linaro.org
    Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

usb: dwc3: Wait unconditionally after issuing EndXfer command [+ + +]

Author: Prashanth K <quic_prashk@quicinc.com>
Date:   Thu May 2 10:11:03 2024 +0530

    usb: dwc3: Wait unconditionally after issuing EndXfer command
    
    commit 1d26ba0944d398f88aaf997bda3544646cf21945 upstream.
    
    Currently all controller IP/revisions except DWC3_usb3 >= 310a
    wait 1ms unconditionally for ENDXFER completion when IOC is not
    set. This is because DWC_usb3 controller revisions >= 3.10a
    supports GUCTL2[14: Rst_actbitlater] bit which allows polling
    CMDACT bit to know whether ENDXFER command is completed.
    
    Consider a case where an IN request was queued, and parallelly
    soft_disconnect was called (due to ffs_epfile_release). This
    eventually calls stop_active_transfer with IOC cleared, hence
    send_gadget_ep_cmd() skips waiting for CMDACT cleared during
    EndXfer. For DWC3 controllers with revisions >= 310a, we don't
    forcefully wait for 1ms either, and we proceed by unmapping the
    requests. If ENDXFER didn't complete by this time, it leads to
    SMMU faults since the controller would still be accessing those
    requests.
    
    Fix this by ensuring ENDXFER completion by adding 1ms delay in
    __dwc3_stop_active_transfer() unconditionally.
    
    Cc: stable@vger.kernel.org
    Fixes: b353eb6dc285 ("usb: dwc3: gadget: Skip waiting for CMDACT cleared during endxfer")
    Signed-off-by: Prashanth K <quic_prashk@quicinc.com>
    Acked-by: Thinh Nguyen <Thinh.Nguyen@synopsys.com>
    Link: https://lore.kernel.org/r/20240502044103.1066350-1-quic_prashk@quicinc.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

usb: typec: tipd: fix event checking for tps6598x [+ + +]

Author: Javier Carrasco <javier.carrasco@wolfvision.net>
Date:   Mon Apr 29 15:35:58 2024 +0200

    usb: typec: tipd: fix event checking for tps6598x
    
    commit 409c1cfb5a803f3cf2d17aeaf75c25c4be951b07 upstream.
    
    The current interrupt service routine of the tps6598x only reads the
    first 64 bits of the INT_EVENT1 and INT_EVENT2 registers, which means
    that any event above that range will be ignored, leaving interrupts
    unattended. Moreover, those events will not be cleared, and the device
    will keep the interrupt enabled.
    
    This issue has been observed while attempting to load patches, and the
    'ReadyForPatch' field (bit 81) of INT_EVENT1 was set.
    
    Given that older versions of the tps6598x (1, 2 and 6) provide 8-byte
    registers, a mechanism based on the upper byte of the version register
    (0x0F) has been included. The manufacturer has confirmed [1] that this
    byte is always 0 for older versions, and either 0xF7 (DH parts) or 0xF9
    (DK parts) is returned in newer versions (7 and 8).
    
    Read the complete INT_EVENT registers to handle all interrupts generated
    by the device and account for the hardware version to select the
    register size.
    
    Link: https://e2e.ti.com/support/power-management-group/power-management/f/power-management-forum/1346521/tps65987d-register-command-to-distinguish-between-tps6591-2-6-and-tps65987-8 [1]
    Fixes: 0a4c005bd171 ("usb: typec: driver for TI TPS6598x USB Power Delivery controllers")
    Cc: stable@vger.kernel.org
    Signed-off-by: Javier Carrasco <javier.carrasco@wolfvision.net>
    Link: https://lore.kernel.org/r/20240429-tps6598x_fix_event_handling-v3-2-4e8e58dce489@wolfvision.net
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

usb: typec: ucsi: displayport: Fix potential deadlock [+ + +]

Author: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Date:   Tue May 7 16:43:16 2024 +0300

    usb: typec: ucsi: displayport: Fix potential deadlock
    
    commit b791a67f68121d69108640d4a3e591d210ffe850 upstream.
    
    The function ucsi_displayport_work() does not access the
    connector, so it also must not acquire the connector lock.
    
    This fixes a potential deadlock scenario:
    
    ucsi_displayport_work() -> lock(&con->lock)
    typec_altmode_vdm()
    dp_altmode_vdm()
    dp_altmode_work()
    typec_altmode_enter()
    ucsi_displayport_enter() -> lock(&con->lock)
    
    Reported-by: Mathias Nyman <mathias.nyman@linux.intel.com>
    Fixes: af8622f6a585 ("usb: typec: ucsi: Support for DisplayPort alt mode")
    Cc: stable@vger.kernel.org
    Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
    Link: https://lore.kernel.org/r/20240507134316.161999-1-heikki.krogerus@linux.intel.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs,iomap: move delalloc punching to iomap [+ + +]

Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed May 1 11:40:52 2024 -0700

    xfs,iomap: move delalloc punching to iomap
    
    [ Upstream commit 9c7babf94a0d686b552e53aded8d4703d1b8b92b ]
    
    Because that's what Christoph wants for this error handling path
    only XFS uses.
    
    It requires a new iomap export for handling errors over delalloc
    ranges. This is basically the XFS code as is stands, but even though
    Christoph wants this as iomap funcitonality, we still have
    to call it from the filesystem specific ->iomap_end callback, and
    call into the iomap code with yet another filesystem specific
    callback to punch the delalloc extent within the defined ranges.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: allow inode inactivation during a ro mount log recovery [+ + +]

Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed May 1 11:41:09 2024 -0700

    xfs: allow inode inactivation during a ro mount log recovery
    
    [ Upstream commit 76e589013fec672c3587d6314f2d1f0aeddc26d9 ]
    
    In the next patch, we're going to prohibit log recovery if the primary
    superblock contains an unrecognized rocompat feature bit even on
    readonly mounts.  This requires removing all the code in the log
    mounting process that temporarily disables the readonly state.
    
    Unfortunately, inode inactivation disables itself on readonly mounts.
    Clearing the iunlinked lists after log recovery needs inactivation to
    run to free the unreferenced inodes, which (AFAICT) is the only reason
    why log mounting plays games with the readonly state in the first place.
    
    Therefore, change the inactivation predicates to allow inactivation
    during log recovery of a readonly mount.
    
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: attach dquots to inode before reading data/cow fork mappings [+ + +]

Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed May 1 11:41:03 2024 -0700

    xfs: attach dquots to inode before reading data/cow fork mappings
    
    [ Upstream commit 4c6dbfd2756bd83a0085ed804e2bb7be9cc16bc5 ]
    
    I've been running near-continuous integration testing of online fsck,
    and I've noticed that once a day, one of the ARM VMs will fail the test
    with out of order records in the data fork.
    
    xfs/804 races fsstress with online scrub (aka scan but do not change
    anything), so I think this might be a bug in the core xfs code.  This
    also only seems to trigger if one runs the test for more than ~6 minutes
    via TIME_FACTOR=13 or something.
    https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/tree/tests/xfs/804?h=djwong-wtf
    
    I added a debugging patch to the kernel to check the data fork extents
    after taking the ILOCK, before dropping ILOCK, and before and after each
    bmapping operation.  So far I've narrowed it down to the delalloc code
    inserting a record in the wrong place in the iext tree:
    
    xfs_bmap_add_extent_hole_delay, near line 2691:
    
            case 0:
                    /*
                     * New allocation is not contiguous with another
                     * delayed allocation.
                     * Insert a new entry.
                     */
                    oldlen = newlen = 0;
                    xfs_iunlock_check_datafork(ip);         <-- ok here
                    xfs_iext_insert(ip, icur, new, state);
                    xfs_iunlock_check_datafork(ip);         <-- bad here
                    break;
            }
    
    I recorded the state of the data fork mappings and iext cursor state
    when a corrupt data fork is detected immediately after the
    xfs_bmap_add_extent_hole_delay call in xfs_bmapi_reserve_delalloc:
    
    ino 0x140bb3 func xfs_bmapi_reserve_delalloc line 4164 data fork:
        ino 0x140bb3 nr 0x0 nr_real 0x0 offset 0xb9 blockcount 0x1f startblock 0x935de2 state 1
        ino 0x140bb3 nr 0x1 nr_real 0x1 offset 0xe6 blockcount 0xa startblock 0xffffffffe0007 state 0
        ino 0x140bb3 nr 0x2 nr_real 0x1 offset 0xd8 blockcount 0xe startblock 0x935e01 state 0
    
    Here we see that a delalloc extent was inserted into the wrong position
    in the iext leaf, same as all the other times.  The extra trace data I
    collected are as follows:
    
    ino 0x140bb3 fork 0 oldoff 0xe6 oldlen 0x4 oldprealloc 0x6 isize 0xe6000
        ino 0x140bb3 oldgotoff 0xea oldgotstart 0xfffffffffffffffe oldgotcount 0x0 oldgotstate 0
        ino 0x140bb3 crapgotoff 0x0 crapgotstart 0x0 crapgotcount 0x0 crapgotstate 0
        ino 0x140bb3 freshgotoff 0xd8 freshgotstart 0x935e01 freshgotcount 0xe freshgotstate 0
        ino 0x140bb3 nowgotoff 0xe6 nowgotstart 0xffffffffe0007 nowgotcount 0xa nowgotstate 0
        ino 0x140bb3 oldicurpos 1 oldleafnr 2 oldleaf 0xfffffc00f0609a00
        ino 0x140bb3 crapicurpos 2 crapleafnr 2 crapleaf 0xfffffc00f0609a00
        ino 0x140bb3 freshicurpos 1 freshleafnr 2 freshleaf 0xfffffc00f0609a00
        ino 0x140bb3 newicurpos 1 newleafnr 3 newleaf 0xfffffc00f0609a00
    
    The first line shows that xfs_bmapi_reserve_delalloc was called with
    whichfork=XFS_DATA_FORK, off=0xe6, len=0x4, prealloc=6.
    
    The second line ("oldgot") shows the contents of @got at the beginning
    of the call, which are the results of the first iext lookup in
    xfs_buffered_write_iomap_begin.
    
    Line 3 ("crapgot") is the result of duplicating the cursor at the start
    of the body of xfs_bmapi_reserve_delalloc and performing a fresh lookup
    at @off.
    
    Line 4 ("freshgot") is the result of a new xfs_iext_get_extent right
    before the call to xfs_bmap_add_extent_hole_delay.  Totally garbage.
    
    Line 5 ("nowgot") is contents of @got after the
    xfs_bmap_add_extent_hole_delay call.
    
    Line 6 is the contents of @icur at the beginning fo the call.  Lines 7-9
    are the contents of the iext cursors at the point where the block
    mappings were sampled.
    
    I think @oldgot is a HOLESTARTBLOCK extent because the first lookup
    didn't find anything, so we filled in imap with "fake hole until the
    end".  At the time of the first lookup, I suspect that there's only one
    32-block unwritten extent in the mapping (hence oldicurpos==1) but by
    the time we get to recording crapgot, crapicurpos==2.
    
    Dave then added:
    
    Ok, that's much simpler to reason about, and implies the smoke is
    coming from xfs_buffered_write_iomap_begin() or
    xfs_bmapi_reserve_delalloc(). I suspect the former - it does a lot
    of stuff with the ILOCK_EXCL held.....
    
    .... including calling xfs_qm_dqattach_locked().
    
    xfs_buffered_write_iomap_begin
      ILOCK_EXCL
      look up icur
      xfs_qm_dqattach_locked
        xfs_qm_dqattach_one
          xfs_qm_dqget_inode
            dquot cache miss
            xfs_iunlock(ip, XFS_ILOCK_EXCL);
            error = xfs_qm_dqread(mp, id, type, can_alloc, &dqp);
            xfs_ilock(ip, XFS_ILOCK_EXCL);
      ....
      xfs_bmapi_reserve_delalloc(icur)
    
    Yup, that's what is letting the magic smoke out -
    xfs_qm_dqattach_locked() can cycle the ILOCK. If that happens, we
    can pass a stale icur to xfs_bmapi_reserve_delalloc() and it all
    goes downhill from there.
    
    Back to Darrick now:
    
    So.  Fix this by moving the dqattach_locked call up before we take the
    ILOCK, like all the other callers in that file.
    
    Fixes: a526c85c2236 ("xfs: move xfs_file_iomap_begin_delay around") # goes further back than this
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: drop write error injection is unfixable, remove it [+ + +]

Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed May 1 11:40:57 2024 -0700

    xfs: drop write error injection is unfixable, remove it
    
    [ Upstream commit 6e8af15ccdc4e138a5b529c1901a0013e1dcaa09 ]
    
    With the changes to scan the page cache for dirty data to avoid data
    corruptions from partial write cleanup racing with other page cache
    operations, the drop writes error injection no longer works the same
    way it used to and causes xfs/196 to fail. This is because xfs/196
    writes to the file and populates the page cache before it turns on
    the error injection and starts failing -overwrites-.
    
    The result is that the original drop-writes code failed writes only
    -after- overwriting the data in the cache, followed by invalidates
    the cached data, then punching out the delalloc extent from under
    that data.
    
    On the surface, this looks fine. The problem is that page cache
    invalidation *doesn't guarantee that it removes anything from the
    page cache* and it doesn't change the dirty state of the folio. When
    block size == page size and we do page aligned IO (as xfs/196 does)
    everything happens to align perfectly and page cache invalidation
    removes the single page folios that span the written data. Hence the
    followup delalloc punch pass does not find cached data over that
    range and it can punch the extent out.
    
    IOWs, xfs/196 "works" for block size == page size with the new
    code. I say "works", because it actually only works for the case
    where IO is page aligned, and no data was read from disk before
    writes occur. Because the moment we actually read data first, the
    readahead code allocates multipage folios and suddenly the
    invalidate code goes back to zeroing subfolio ranges without
    changing dirty state.
    
    Hence, with multipage folios in play, block size == page size is
    functionally identical to block size < page size behaviour, and
    drop-writes is manifestly broken w.r.t to this case. Invalidation of
    a subfolio range doesn't result in the folio being removed from the
    cache, just the range gets zeroed. Hence after we've sequentially
    walked over a folio that we've dirtied (via write data) and then
    invalidated, we end up with a dirty folio full of zeroed data.
    
    And because the new code skips punching ranges that have dirty
    folios covering them, we end up leaving the delalloc range intact
    after failing all the writes. Hence failed writes now end up
    writing zeroes to disk in the cases where invalidation zeroes folios
    rather than removing them from cache.
    
    This is a fundamental change of behaviour that is needed to avoid
    the data corruption vectors that exist in the old write fail path,
    and it renders the drop-writes injection non-functional and
    unworkable as it stands.
    
    As it is, I think the error injection is also now unnecessary, as
    partial writes that need delalloc extent are going to be a lot more
    common with stale iomap detection in place. Hence this patch removes
    the drop-writes error injection completely. xfs/196 can remain for
    testing kernels that don't have this data corruption fix, but those
    that do will report:
    
    xfs/196 3s ... [not run] XFS error injection drop_writes unknown on this kernel.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: estimate post-merge refcounts correctly [+ + +]

Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed May 1 11:41:07 2024 -0700

    xfs: estimate post-merge refcounts correctly
    
    [ Upstream commit b25d1984aa884fc91a73a5a407b9ac976d441e9b ]
    
    Upon enabling fsdax + reflink for XFS, xfs/179 began to report refcount
    metadata corruptions after being run.  Specifically, xfs_repair noticed
    single-block refcount records that could be combined but had not been.
    
    The root cause of this is improper MAXREFCOUNT edge case handling in
    xfs_refcount_merge_extents.  When we're trying to find candidates for a
    refcount btree record merge, we compute the refcount attribute of the
    merged record, but we fail to account for the fact that once a record
    hits rc_refcount == MAXREFCOUNT, it is pinned that way forever.  Hence
    the computed refcount is wrong, and we fail to merge the extents.
    
    Fix this by adjusting the merge predicates to compute the adjusted
    refcount correctly.
    
    Fixes: 3172725814f9 ("xfs: adjust refcount of an extent of blocks in refcount btree")
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Xiao Yang <yangx.jy@fujitsu.com>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: fix incorrect error-out in xfs_remove [+ + +]

Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed May 1 11:40:59 2024 -0700

    xfs: fix incorrect error-out in xfs_remove
    
    [ Upstream commit 2653d53345bda90604f673bb211dd060a5a5c232 ]
    
    Clean up resources if resetting the dotdot entry doesn't succeed.
    Observed through code inspection.
    
    Fixes: 5838d0356bb3 ("xfs: reset child dir '..' entry when unlinking child")
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: fix incorrect i_nlink caused by inode racing [+ + +]

Author: Long Li <leo.lilong@huawei.com>
Date:   Wed May 1 11:41:01 2024 -0700

    xfs: fix incorrect i_nlink caused by inode racing
    
    [ Upstream commit 28b4b0596343d19d140da059eee0e5c2b5328731 ]
    
    The following error occurred during the fsstress test:
    
    XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2452
    
    The problem was that inode race condition causes incorrect i_nlink to be
    written to disk, and then it is read into memory. Consider the following
    call graph, inodes that are marked as both XFS_IFLUSHING and
    XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original
    value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk
    may be set to 1.
    
      xfsaild
          xfs_inode_item_push
              xfs_iflush_cluster
                  xfs_iflush
                      xfs_inode_to_disk
    
      xfs_iget
          xfs_iget_cache_hit
              xfs_iget_recycle
                  xfs_reinit_inode
                      inode_init_always
    
    xfs_reinit_inode() needs to hold the ILOCK_EXCL as it is changing internal
    inode state and can race with other RCU protected inode lookups. On the
    read side, xfs_iflush_cluster() grabs the ILOCK_SHARED while under rcu +
    ip->i_flags_lock, and so xfs_iflush/xfs_inode_to_disk() are protected from
    racing inode updates (during transactions) by that lock.
    
    Fixes: ff7bebeb91f8 ("xfs: refactor the inode recycling code") # goes further back than this
    Signed-off-by: Long Li <leo.lilong@huawei.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: fix log recovery when unknown rocompat bits are set [+ + +]

Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed May 1 11:41:10 2024 -0700

    xfs: fix log recovery when unknown rocompat bits are set
    
    [ Upstream commit 74ad4693b6473950e971b3dc525b5ee7570e05d0 ]
    
    Log recovery has always run on read only mounts, even where the primary
    superblock advertises unknown rocompat bits.  Due to a misunderstanding
    between Eric and Darrick back in 2018, we accidentally changed the
    superblock write verifier to shutdown the fs over that exact scenario.
    As a result, the log cleaning that occurs at the end of the mounting
    process fails if there are unknown rocompat bits set.
    
    As we now allow writing of the superblock if there are unknown rocompat
    bits set on a RO mount, we no longer want to turn off RO state to allow
    log recovery to succeed on a RO mount.  Hence we also remove all the
    (now unnecessary) RO state toggling from the log recovery path.
    
    Fixes: 9e037cb7972f ("xfs: check for unknown v5 feature bits in superblock write verifier"
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: fix off-by-one-block in xfs_discard_folio() [+ + +]

Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed May 1 11:40:58 2024 -0700

    xfs: fix off-by-one-block in xfs_discard_folio()
    
    [ Upstream commit 8ac5b996bf5199f15b7687ceae989f8b2a410dda ]
    
    The recent writeback corruption fixes changed the code in
    xfs_discard_folio() to calculate a byte range to for punching
    delalloc extents. A mistake was made in using round_up(pos) for the
    end offset, because when pos points at the first byte of a block, it
    does not get rounded up to point to the end byte of the block. hence
    the punch range is short, and this leads to unexpected behaviour in
    certain cases in xfs_bmap_punch_delalloc_range.
    
    e.g. pos = 0 means we call xfs_bmap_punch_delalloc_range(0,0), so
    there is no previous extent and it rounds up the punch to the end of
    the delalloc extent it found at offset 0, not the end of the range
    given to xfs_bmap_punch_delalloc_range().
    
    Fix this by handling the zero block offset case correctly.
    
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=217030
    Link: https://lore.kernel.org/linux-xfs/Y+vOfaxIWX1c%2Fyy9@bfoster/
    Fixes: 7348b322332d ("xfs: xfs_bmap_punch_delalloc_range() should take a byte range")
    Reported-by: Pengfei Xu <pengfei.xu@intel.com>
    Found-by: Brian Foster <bfoster@redhat.com>
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: fix sb write verify for lazysbcount [+ + +]

Author: Long Li <leo.lilong@huawei.com>
Date:   Wed May 1 11:41:00 2024 -0700

    xfs: fix sb write verify for lazysbcount
    
    [ Upstream commit 59f6ab40fd8735c9a1a15401610a31cc06a0bbd6 ]
    
    When lazysbcount is enabled, fsstress and loop mount/unmount test report
    the following problems:
    
    XFS (loop0): SB summary counter sanity check failed
    XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460,
            xfs_sb block 0x0
    XFS (loop0): Unmount and run xfs_repair
    XFS (loop0): First 128 bytes of corrupted metadata buffer:
    00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
    00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
    00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
    00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
    00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
    00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
    00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
    XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
            +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
    XFS (loop0): Please unmount the filesystem and rectify the problem(s)
    XFS (loop0): log mount/recovery failed: error -117
    XFS (loop0): log mount failed
    
    This corruption will shutdown the file system and the file system will
    no longer be mountable. The following script can reproduce the problem,
    but it may take a long time.
    
     #!/bin/bash
    
     device=/dev/sda
     testdir=/mnt/test
     round=0
    
     function fail()
     {
             echo "$*"
             exit 1
     }
    
     mkdir -p $testdir
     while [ $round -lt 10000 ]
     do
             echo "******* round $round ********"
             mkfs.xfs -f $device
             mount $device $testdir || fail "mount failed!"
             fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &
             sleep 4
             killall -w fsstress
             umount $testdir
             xfs_repair -e $device > /dev/null
             if [ $? -eq 2 ];then
                     echo "ERR CODE 2: Dirty log exception during repair."
                     exit 1
             fi
             round=$(($round+1))
     done
    
    With lazysbcount is enabled, There is no additional lock protection for
    reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
    m_ifree, this will make the m_ifree greater than m_icount. For example,
    consider the following sequence and ifreedelta is postive:
    
     CPU0                            CPU1
     xfs_log_sb                      xfs_trans_unreserve_and_mod_sb
     ----------                      ------------------------------
     percpu_counter_sum(&mp->m_icount)
                                     percpu_counter_add_batch(&mp->m_icount,
                                                    idelta, XFS_ICOUNT_BATCH)
                                     percpu_counter_add(&mp->m_ifree, ifreedelta);
     percpu_counter_sum(&mp->m_ifree)
    
    After this, incorrect inode count (sb_ifree > sb_icount) will be writen to
    the log. In the subsequent writing of sb, incorrect inode count (sb_ifree >
    sb_icount) will fail to pass the boundary check in xfs_validate_sb_write()
    that cause the file system shutdown.
    
    When lazysbcount is enabled, we don't need to guarantee that Lazy sb
    counters are completely correct, but we do need to guarantee that sb_ifree
    <= sb_icount. On the other hand, the constraint that m_ifree <= m_icount
    must be satisfied any time that there /cannot/ be other threads allocating
    or freeing inode chunks. If the constraint is violated under these
    circumstances, sb_i{count,free} (the ondisk superblock inode counters)
    maybe incorrect and need to be marked sick at unmount, the count will
    be rebuilt on the next mount.
    
    Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks")
    Signed-off-by: Long Li <leo.lilong@huawei.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: fix super block buf log item UAF during force shutdown [+ + +]

Author: Guo Xuenan <guoxuenan@huawei.com>
Date:   Wed May 1 11:41:05 2024 -0700

    xfs: fix super block buf log item UAF during force shutdown
    
    [ Upstream commit 575689fc0ffa6c4bb4e72fd18e31a6525a6124e0 ]
    
    xfs log io error will trigger xlog shut down, and end_io worker call
    xlog_state_shutdown_callbacks to unpin and release the buf log item.
    The race condition is that when there are some thread doing transaction
    commit and happened not to be intercepted by xlog_is_shutdown, then,
    these log item will be insert into CIL, when unpin and release these
    buf log item, UAF will occur. BTW, add delay before `xlog_cil_commit`
    can increase recurrence probability.
    
    The following call graph actually encountered this bad situation.
    fsstress                    io end worker kworker/0:1H-216
                                xlog_ioend_work
                                  ->xlog_force_shutdown
                                    ->xlog_state_shutdown_callbacks
                                      ->xlog_cil_process_committed
                                        ->xlog_cil_committed
                                          ->xfs_trans_committed_bulk
    ->xfs_trans_apply_sb_deltas             ->li_ops->iop_unpin(lip, 1);
      ->xfs_trans_getsb
        ->_xfs_trans_bjoin
          ->xfs_buf_item_init
            ->if (bip) { return 0;} //relog
    ->xlog_cil_commit
      ->xlog_cil_insert_items //insert into CIL
                                               ->xfs_buf_ioend_fail(bp);
                                                 ->xfs_buf_ioend
                                                   ->xfs_buf_item_done
                                                     ->xfs_buf_item_relse
                                                       ->xfs_buf_item_free
    
    when cil push worker gather percpu cil and insert super block buf log item
    into ctx->log_items then uaf occurs.
    
    ==================================================================
    BUG: KASAN: use-after-free in xlog_cil_push_work+0x1c8f/0x22f0
    Write of size 8 at addr ffff88801800f3f0 by task kworker/u4:4/105
    
    CPU: 0 PID: 105 Comm: kworker/u4:4 Tainted: G W
    6.1.0-rc1-00001-g274115149b42 #136
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
    1.13.0-1ubuntu1.1 04/01/2014
    Workqueue: xfs-cil/sda xlog_cil_push_work
    Call Trace:
     <TASK>
     dump_stack_lvl+0x4d/0x66
     print_report+0x171/0x4a6
     kasan_report+0xb3/0x130
     xlog_cil_push_work+0x1c8f/0x22f0
     process_one_work+0x6f9/0xf70
     worker_thread+0x578/0xf30
     kthread+0x28c/0x330
     ret_from_fork+0x1f/0x30
     </TASK>
    
    Allocated by task 2145:
     kasan_save_stack+0x1e/0x40
     kasan_set_track+0x21/0x30
     __kasan_slab_alloc+0x54/0x60
     kmem_cache_alloc+0x14a/0x510
     xfs_buf_item_init+0x160/0x6d0
     _xfs_trans_bjoin+0x7f/0x2e0
     xfs_trans_getsb+0xb6/0x3f0
     xfs_trans_apply_sb_deltas+0x1f/0x8c0
     __xfs_trans_commit+0xa25/0xe10
     xfs_symlink+0xe23/0x1660
     xfs_vn_symlink+0x157/0x280
     vfs_symlink+0x491/0x790
     do_symlinkat+0x128/0x220
     __x64_sys_symlink+0x7a/0x90
     do_syscall_64+0x35/0x80
     entry_SYSCALL_64_after_hwframe+0x63/0xcd
    
    Freed by task 216:
     kasan_save_stack+0x1e/0x40
     kasan_set_track+0x21/0x30
     kasan_save_free_info+0x2a/0x40
     __kasan_slab_free+0x105/0x1a0
     kmem_cache_free+0xb6/0x460
     xfs_buf_ioend+0x1e9/0x11f0
     xfs_buf_item_unpin+0x3d6/0x840
     xfs_trans_committed_bulk+0x4c2/0x7c0
     xlog_cil_committed+0xab6/0xfb0
     xlog_cil_process_committed+0x117/0x1e0
     xlog_state_shutdown_callbacks+0x208/0x440
     xlog_force_shutdown+0x1b3/0x3a0
     xlog_ioend_work+0xef/0x1d0
     process_one_work+0x6f9/0xf70
     worker_thread+0x578/0xf30
     kthread+0x28c/0x330
     ret_from_fork+0x1f/0x30
    
    The buggy address belongs to the object at ffff88801800f388
     which belongs to the cache xfs_buf_item of size 272
    The buggy address is located 104 bytes inside of
     272-byte region [ffff88801800f388, ffff88801800f498)
    
    The buggy address belongs to the physical page:
    page:ffffea0000600380 refcount:1 mapcount:0 mapping:0000000000000000
    index:0xffff88801800f208 pfn:0x1800e
    head:ffffea0000600380 order:1 compound_mapcount:0 compound_pincount:0
    flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
    raw: 001fffff80010200 ffffea0000699788 ffff88801319db50 ffff88800fb50640
    raw: ffff88801800f208 000000000015000a 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected
    
    Memory state around the buggy address:
     ffff88801800f280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
     ffff88801800f300: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff88801800f380: fc fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                                 ^
     ffff88801800f400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
     ffff88801800f480: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
    ==================================================================
    Disabling lock debugging due to kernel taint
    
    Signed-off-by: Guo Xuenan <guoxuenan@huawei.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: get root inode correctly at bulkstat [+ + +]

Author: Hironori Shiina <shiina.hironori@gmail.com>
Date:   Wed May 1 11:41:11 2024 -0700

    xfs: get root inode correctly at bulkstat
    
    [ Upstream commit 817644fa4525258992f17fecf4f1d6cdd2e1b731 ]
    
    The root inode number should be set to `breq->startino` for getting stat
    information of the root when XFS_BULK_IREQ_SPECIAL_ROOT is used.
    Otherwise, the inode search is started from 1
    (XFS_BULK_IREQ_SPECIAL_ROOT) and the inode with the lowest number in a
    filesystem is returned.
    
    Fixes: bf3cb3944792 ("xfs: allow single bulkstat of special inodes")
    Signed-off-by: Hironori Shiina <shiina.hironori@fujitsu.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: hoist refcount record merge predicates [+ + +]

Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed May 1 11:41:06 2024 -0700

    xfs: hoist refcount record merge predicates
    
    [ Upstream commit 9d720a5a658f5135861773f26e927449bef93d61 ]
    
    Hoist these multiline conditionals into separate static inline helpers
    to improve readability and set the stage for corruption fixes that will
    be introduced in the next patch.
    
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Xiao Yang <yangx.jy@fujitsu.com>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: invalidate block device page cache during unmount [+ + +]

Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed May 1 11:41:02 2024 -0700

    xfs: invalidate block device page cache during unmount
    
    [ Upstream commit 032e160305f6872e590c77f11896fb28365c6d6c ]
    
    Every now and then I see fstests failures on aarch64 (64k pages) that
    trigger on the following sequence:
    
    mkfs.xfs $dev
    mount $dev $mnt
    touch $mnt/a
    umount $mnt
    xfs_db -c 'path /a' -c 'print' $dev
    
    99% of the time this succeeds, but every now and then xfs_db cannot find
    /a and fails.  This turns out to be a race involving udev/blkid, the
    page cache for the block device, and the xfs_db process.
    
    udev is triggered whenever anyone closes a block device or unmounts it.
    The default udev rules invoke blkid to read the fs super and create
    symlinks to the bdev under /dev/disk.  For this, it uses buffered reads
    through the page cache.
    
    xfs_db also uses buffered reads to examine metadata.  There is no
    coordination between xfs_db and udev, which means that they can run
    concurrently.  Note there is no coordination between the kernel and
    blkid either.
    
    On a system with 64k pages, the page cache can cache the superblock and
    the root inode (and hence the root dir) with the same 64k page.  If
    udev spawns blkid after the mkfs and the system is busy enough that it
    is still running when xfs_db starts up, they'll both read from the same
    page in the pagecache.
    
    The unmount writes updated inode metadata to disk directly.  The XFS
    buffer cache does not use the bdev pagecache, nor does it invalidate the
    pagecache on umount.  If the above scenario occurs, the pagecache no
    longer reflects what's on disk, xfs_db reads the stale metadata, and
    fails to find /a.  Most of the time this succeeds because closing a bdev
    invalidates the page cache, but when processes race, everyone loses.
    
    Fix the problem by invalidating the bdev pagecache after flushing the
    bdev, so that xfs_db will see up to date metadata.
    
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: invalidate xfs_bufs when allocating cow extents [+ + +]

Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed May 1 11:41:08 2024 -0700

    xfs: invalidate xfs_bufs when allocating cow extents
    
    [ Upstream commit ddfdd530e43fcb3f7a0a69966e5f6c33497b4ae3 ]
    
    While investigating test failures in xfs/17[1-3] in alwayscow mode, I
    noticed through code inspection that xfs_bmap_alloc_userdata isn't
    setting XFS_ALLOC_USERDATA when allocating extents for a file's CoW
    fork.  COW staging extents should be flagged as USERDATA, since user
    data are persisted to these blocks before being remapped into a file.
    
    This mis-classification has a few impacts on the behavior of the system.
    First, the filestreams allocator is supposed to keep allocating from a
    chosen AG until it runs out of space in that AG.  However, it only does
    that for USERDATA allocations, which means that COW allocations aren't
    tied to the filestreams AG.  Fortunately, few people use filestreams, so
    nobody's noticed.
    
    A more serious problem is that xfs_alloc_ag_vextent_small looks for a
    buffer to invalidate *if* the USERDATA flag is set and the AG is so full
    that the allocation had to come from the AGFL because the cntbt is
    empty.  The consequences of not invalidating the buffer are severe --
    if the AIL incorrectly checkpoints a buffer that is now being used to
    store user data, that action will clobber the user's written data.
    
    Fix filestreams and yet another data corruption vector by flagging COW
    allocations as USERDATA.
    
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: punching delalloc extents on write failure is racy [+ + +]

Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed May 1 11:40:50 2024 -0700

    xfs: punching delalloc extents on write failure is racy
    
    [ Upstream commit 198dd8aedee6a7d2de0dfa739f9a008a938f6848 ]
    
    xfs_buffered_write_iomap_end() has a comment about the safety of
    punching delalloc extents based holding the IOLOCK_EXCL. This
    comment is wrong, and punching delalloc extents is not race free.
    
    When we punch out a delalloc extent after a write failure in
    xfs_buffered_write_iomap_end(), we punch out the page cache with
    truncate_pagecache_range() before we punch out the delalloc extents.
    At this point, we only hold the IOLOCK_EXCL, so there is nothing
    stopping mmap() write faults racing with this cleanup operation,
    reinstantiating a folio over the range we are about to punch and
    hence requiring the delalloc extent to be kept.
    
    If this race condition is hit, we can end up with a dirty page in
    the page cache that has no delalloc extent or space reservation
    backing it. This leads to bad things happening at writeback time.
    
    To avoid this race condition, we need the page cache truncation to
    be atomic w.r.t. the extent manipulation. We can do this by holding
    the mapping->invalidate_lock exclusively across this operation -
    this will prevent new pages from being inserted into the page cache
    whilst we are removing the pages and the backing extent and space
    reservation.
    
    Taking the mapping->invalidate_lock exclusively in the buffered
    write IO path is safe - it naturally nests inside the IOLOCK (see
    truncate and fallocate paths). iomap_zero_range() can be called from
    under the mapping->invalidate_lock (from the truncate path via
    either xfs_zero_eof() or xfs_truncate_page(), but iomap_zero_iter()
    will not instantiate new delalloc pages (because it skips holes) and
    hence will not ever need to punch out delalloc extents on failure.
    
    Fix the locking issue, and clean up the code logic a little to avoid
    unnecessary work if we didn't allocate the delalloc extent or wrote
    the entire region we allocated.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: short circuit xfs_growfs_data_private() if delta is zero [+ + +]

Author: Eric Sandeen <sandeen@redhat.com>
Date:   Wed May 1 11:41:12 2024 -0700

    xfs: short circuit xfs_growfs_data_private() if delta is zero
    
    [ Upstream commit 84712492e6dab803bf595fb8494d11098b74a652 ]
    
    Although xfs_growfs_data() doesn't call xfs_growfs_data_private()
    if in->newblocks == mp->m_sb.sb_dblocks, xfs_growfs_data_private()
    further massages the new block count so that we don't i.e. try
    to create a too-small new AG.
    
    This may lead to a delta of "0" in xfs_growfs_data_private(), so
    we end up in the shrink case and emit the EXPERIMENTAL warning
    even if we're not changing anything at all.
    
    Fix this by returning straightaway if the block delta is zero.
    
    (nb: in older kernels, the result of entering the shrink case
    with delta == 0 may actually let an -ENOSPC escape to userspace,
    which is confusing for users.)
    
    Fixes: fb2fc1720185 ("xfs: support shrinking unused space in the last AG")
    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
    Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: use byte ranges for write cleanup ranges [+ + +]

Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed May 1 11:40:51 2024 -0700

    xfs: use byte ranges for write cleanup ranges
    
    [ Upstream commit b71f889c18ada210a97aa3eb5e00c0de552234c6 ]
    
    xfs_buffered_write_iomap_end() currently converts the byte ranges
    passed to it to filesystem blocks to pass them to the bmap code to
    punch out delalloc blocks, but then has to convert filesytem
    blocks back to byte ranges for page cache truncate.
    
    We're about to make the page cache truncate go away and replace it
    with a page cache walk, so having to convert everything to/from/to
    filesystem blocks is messy and error-prone. It is much easier to
    pass around byte ranges and convert to page indexes and/or
    filesystem blocks only where those units are needed.
    
    In preparation for the page cache walk being added, add a helper
    that converts byte ranges to filesystem blocks and calls
    xfs_bmap_punch_delalloc_range() and convert
    xfs_buffered_write_iomap_end() to calculate limits in byte ranges.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: use iomap_valid method to detect stale cached iomaps [+ + +]

Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed May 1 11:40:56 2024 -0700

    xfs: use iomap_valid method to detect stale cached iomaps
    
    [ Upstream commit 304a68b9c63bbfc1f6e159d68e8892fc54a06067 ]
    
    Now that iomap supports a mechanism to validate cached iomaps for
    buffered write operations, hook it up to the XFS buffered write ops
    so that we can avoid data corruptions that result from stale cached
    iomaps. See:
    
    https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
    
    or the ->iomap_valid() introduction commit for exact details of the
    corruption vector.
    
    The validity cookie we store in the iomap is based on the type of
    iomap we return. It is expected that the iomap->flags we set in
    xfs_bmbt_to_iomap() is not perturbed by the iomap core and are
    returned to us in the iomap passed via the .iomap_valid() callback.
    This ensures that the validity cookie is always checking the correct
    inode fork sequence numbers to detect potential changes that affect
    the extent cached by the iomap.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: wait iclog complete before tearing down AIL [+ + +]

Author: Guo Xuenan <guoxuenan@huawei.com>
Date:   Wed May 1 11:41:04 2024 -0700

    xfs: wait iclog complete before tearing down AIL
    
    [ Upstream commit 1eb52a6a71981b80f9acbd915acd6a05a5037196 ]
    
    Fix uaf in xfs_trans_ail_delete during xlog force shutdown.
    In commit cd6f79d1fb32 ("xfs: run callbacks before waking waiters in
    xlog_state_shutdown_callbacks") changed the order of running callbacks
    and wait for iclog completion to avoid unmount path untimely destroy AIL.
    But which seems not enough to ensue this, adding mdelay in
    `xfs_buf_item_unpin` can prove that.
    
    The reproduction is as follows. To ensure destroy AIL safely,
    we should wait all xlog ioend workers done and sync the AIL.
    
    ==================================================================
    BUG: KASAN: use-after-free in xfs_trans_ail_delete+0x240/0x2a0
    Read of size 8 at addr ffff888023169400 by task kworker/1:1H/43
    
    CPU: 1 PID: 43 Comm: kworker/1:1H Tainted: G        W
    6.1.0-rc1-00002-gc28266863c4a #137
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
    1.13.0-1ubuntu1.1 04/01/2014
    Workqueue: xfs-log/sda xlog_ioend_work
    Call Trace:
     <TASK>
     dump_stack_lvl+0x4d/0x66
     print_report+0x171/0x4a6
     kasan_report+0xb3/0x130
     xfs_trans_ail_delete+0x240/0x2a0
     xfs_buf_item_done+0x7b/0xa0
     xfs_buf_ioend+0x1e9/0x11f0
     xfs_buf_item_unpin+0x4c8/0x860
     xfs_trans_committed_bulk+0x4c2/0x7c0
     xlog_cil_committed+0xab6/0xfb0
     xlog_cil_process_committed+0x117/0x1e0
     xlog_state_shutdown_callbacks+0x208/0x440
     xlog_force_shutdown+0x1b3/0x3a0
     xlog_ioend_work+0xef/0x1d0
     process_one_work+0x6f9/0xf70
     worker_thread+0x578/0xf30
     kthread+0x28c/0x330
     ret_from_fork+0x1f/0x30
     </TASK>
    
    Allocated by task 9606:
     kasan_save_stack+0x1e/0x40
     kasan_set_track+0x21/0x30
     __kasan_kmalloc+0x7a/0x90
     __kmalloc+0x59/0x140
     kmem_alloc+0xb2/0x2f0
     xfs_trans_ail_init+0x20/0x320
     xfs_log_mount+0x37e/0x690
     xfs_mountfs+0xe36/0x1b40
     xfs_fs_fill_super+0xc5c/0x1a70
     get_tree_bdev+0x3c5/0x6c0
     vfs_get_tree+0x85/0x250
     path_mount+0xec3/0x1830
     do_mount+0xef/0x110
     __x64_sys_mount+0x150/0x1f0
     do_syscall_64+0x35/0x80
     entry_SYSCALL_64_after_hwframe+0x63/0xcd
    
    Freed by task 9662:
     kasan_save_stack+0x1e/0x40
     kasan_set_track+0x21/0x30
     kasan_save_free_info+0x2a/0x40
     __kasan_slab_free+0x105/0x1a0
     __kmem_cache_free+0x99/0x2d0
     kvfree+0x3a/0x40
     xfs_log_unmount+0x60/0xf0
     xfs_unmountfs+0xf3/0x1d0
     xfs_fs_put_super+0x78/0x300
     generic_shutdown_super+0x151/0x400
     kill_block_super+0x9a/0xe0
     deactivate_locked_super+0x82/0xe0
     deactivate_super+0x91/0xb0
     cleanup_mnt+0x32a/0x4a0
     task_work_run+0x15f/0x240
     exit_to_user_mode_prepare+0x188/0x190
     syscall_exit_to_user_mode+0x12/0x30
     do_syscall_64+0x42/0x80
     entry_SYSCALL_64_after_hwframe+0x63/0xcd
    
    The buggy address belongs to the object at ffff888023169400
     which belongs to the cache kmalloc-128 of size 128
    The buggy address is located 0 bytes inside of
     128-byte region [ffff888023169400, ffff888023169480)
    
    The buggy address belongs to the physical page:
    page:ffffea00008c5a00 refcount:1 mapcount:0 mapping:0000000000000000
    index:0xffff888023168f80 pfn:0x23168
    head:ffffea00008c5a00 order:1 compound_mapcount:0 compound_pincount:0
    flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
    raw: 001fffff80010200 ffffea00006b3988 ffffea0000577a88 ffff88800f842ac0
    raw: ffff888023168f80 0000000000150007 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected
    
    Memory state around the buggy address:
     ffff888023169300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
     ffff888023169380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff888023169400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                       ^
     ffff888023169480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
     ffff888023169500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ==================================================================
    Disabling lock debugging due to kernel taint
    
    Fixes: cd6f79d1fb32 ("xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks")
    Signed-off-by: Guo Xuenan <guoxuenan@huawei.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: write page faults in iomap are not buffered writes [+ + +]

Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed May 1 11:40:49 2024 -0700

    xfs: write page faults in iomap are not buffered writes
    
    [ Upstream commit 118e021b4b66f758f8e8f21dc0e5e0a4c721e69e ]
    
    When we reserve a delalloc region in xfs_buffered_write_iomap_begin,
    we mark the iomap as IOMAP_F_NEW so that the the write context
    understands that it allocated the delalloc region.
    
    If we then fail that buffered write, xfs_buffered_write_iomap_end()
    checks for the IOMAP_F_NEW flag and if it is set, it punches out
    the unused delalloc region that was allocated for the write.
    
    The assumption this code makes is that all buffered write operations
    that can allocate space are run under an exclusive lock (i_rwsem).
    This is an invalid assumption: page faults in mmap()d regions call
    through this same function pair to map the file range being faulted
    and this runs only holding the inode->i_mapping->invalidate_lock in
    shared mode.
    
    IOWs, we can have races between page faults and write() calls that
    fail the nested page cache write operation that result in data loss.
    That is, the failing iomap_end call will punch out the data that
    the other racing iomap iteration brought into the page cache. This
    can be reproduced with generic/34[46] if we arbitrarily fail page
    cache copy-in operations from write() syscalls.
    
    Code analysis tells us that the iomap_page_mkwrite() function holds
    the already instantiated and uptodate folio locked across the iomap
    mapping iterations. Hence the folio cannot be removed from memory
    whilst we are mapping the range it covers, and as such we do not
    care if the mapping changes state underneath the iomap iteration
    loop:
    
    1. if the folio is not already dirty, there is no writeback races
       possible.
    2. if we allocated the mapping (delalloc or unwritten), the folio
       cannot already be dirty. See #1.
    3. If the folio is already dirty, it must be up to date. As we hold
       it locked, it cannot be reclaimed from memory. Hence we always
       have valid data in the page cache while iterating the mapping.
    4. Valid data in the page cache can exist when the underlying
       mapping is DELALLOC, UNWRITTEN or WRITTEN. Having the mapping
       change from DELALLOC->UNWRITTEN or UNWRITTEN->WRITTEN does not
       change the data in the page - it only affects actions if we are
       initialising a new page. Hence #3 applies  and we don't care
       about these extent map transitions racing with
       iomap_page_mkwrite().
    5. iomap_page_mkwrite() checks for page invalidation races
       (truncate, hole punch, etc) after it locks the folio. We also
       hold the mapping->invalidation_lock here, and hence the mapping
       cannot change due to extent removal operations while we are
       iterating the folio.
    
    As such, filesystems that don't use bufferheads will never fail
    the iomap_folio_mkwrite_iter() operation on the current mapping,
    regardless of whether the iomap should be considered stale.
    
    Further, the range we are asked to iterate is limited to the range
    inside EOF that the folio spans. Hence, for XFS, we will only map
    the exact range we are asked for, and we will only do speculative
    preallocation with delalloc if we are mapping a hole at the EOF
    page. The iterator will consume the entire range of the folio that
    is within EOF, and anything beyond the EOF block cannot be accessed.
    We never need to truncate this post-EOF speculative prealloc away in
    the context of the iomap_page_mkwrite() iterator because if it
    remains unused we'll remove it when the last reference to the inode
    goes away.
    
    Hence we don't actually need an .iomap_end() cleanup/error handling
    path at all for iomap_page_mkwrite() for XFS. This means we can
    separate the page fault processing from the complexity of the
    .iomap_end() processing in the buffered write path. This also means
    that the buffered write path will also be able to take the
    mapping->invalidate_lock as necessary.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: xfs_bmap_punch_delalloc_range() should take a byte range [+ + +]

Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed May 1 11:40:54 2024 -0700

    xfs: xfs_bmap_punch_delalloc_range() should take a byte range
    
    [ Upstream commit 7348b322332d8602a4133f0b861334ea021b134a ]
    
    All the callers of xfs_bmap_punch_delalloc_range() jump through
    hoops to convert a byte range to filesystem blocks before calling
    xfs_bmap_punch_delalloc_range(). Instead, pass the byte range to
    xfs_bmap_punch_delalloc_range() and have it do the conversion to
    filesystem blocks internally.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Список изменений в Linux 6.1.92