Gentwo Git Trees - linux/.git/commit

author	Chenglei Xie <Chenglei.Xie@amd.com>
	Thu, 7 Aug 2025 20:52:34 +0000 (16:52 -0400)
committer	Alex Deucher <alexander.deucher@amd.com>
	Fri, 15 Aug 2025 17:07:30 +0000 (13:07 -0400)
commit	d2fa0ec6e0aea6ffbd41939d0c7671db16991ca4
tree	1b32b60d80016dba1fb45df8fa1b35e4b6388a9d	tree \| snapshot
parent	fc4e990a326e608eb8937eba737908c660b7a410	commit \| diff

drm/amdgpu: refactor bad_page_work for corner case handling

When a poison is consumed on the guest before the guest receives the host's poison creation msg, a corner case may occur to have poison_handler complete processing earlier than it should to cause the guest to hang waiting for the req_bad_pages reply during a VF FLR, resulting in the VM becoming inaccessible in stress tests.

To fix this issue, this patch refactored the mailbox sequence by seperating the bad_page_work into two parts req_bad_pages_work and handle_bad_pages_work.
Old sequence:
  1.Stop data exchange work
  2.Guest sends MB_REQ_RAS_BAD_PAGES to host and keep polling for IDH_RAS_BAD_PAGES_READY
  3.If the IDH_RAS_BAD_PAGES_READY arrives within timeout limit, re-init the data exchange region for updated bad page info
    else timeout with error message
New sequence:
req_bad_pages_work:
  1.Stop data exhange work
  2.Guest sends MB_REQ_RAS_BAD_PAGES to host
Once Guest receives IDH_RAS_BAD_PAGES_READY event
handle_bad_pages_work:
  3.re-init the data exchange region for updated bad page info

Signed-off-by: Chenglei Xie <Chenglei.Xie@amd.com>
Reviewed-by: Shravan Kumar Gande <Shravankumar.Gande@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h		diff \| blob \| blame \| history
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c		diff \| blob \| blame \| history
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c		diff \| blob \| blame \| history
drivers/gpu/drm/amd/amdgpu/soc15.c		diff \| blob \| blame \| history