Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid VK_EXT_surface_maintenance1 behavior in presence of multiple ICD (nvidia + mesa-vulkan-drivers) #1557

Closed
romanl-g opened this issue Sep 20, 2024 · 17 comments · Fixed by #1559
Labels
bug Something isn't working

Comments

@romanl-g
Copy link

Describe the bug
According to @ShabbyX this appears to be a loader bug, was previously discussed with LunarG.

Environment (please complete the following information):

  • OS: Linux with nvidia AND mesa-vulkan-drivers
  • Graphics Driver: observed on nvidia 535.183.01, 525.147.5
  • SDK or header version if building from repo: latest

To Reproduce

  1. both nvidia AND mesa-vulkan-drivers needs to be installed
  2. vulkaninfo will show two GPU ids
	Devices: count = 2
		GPU id = 0 (Quadro P1000)
		Layer-Device Extensions: count = 0

		GPU id = 1 (llvmpipe (LLVM 17.0.6, 256 bits))
		Layer-Device Extensions: count = 0
  1. VK_EXT_surface_maintenance1 info is correct for id=1:
GPU id : 1 (llvmpipe (LLVM 17.0.6, 256 bits)):
...
	VK_EXT_surface_maintenance1:
	----------------------------
		PRESENT_MODE_IMMEDIATE_KHR:
			VkSurfacePresentScalingCapabilitiesEXT:
				supportedPresentScaling:
					None
				supportedPresentGravityX:
					None
				supportedPresentGravityY:
					None
				minScaledImageExtent:
					width  = 256
					height = 256
				maxScaledImageExtent:
					width  = 256
					height = 256
			VkSurfacePresentModeCompatibilityEXT:
				presentModeCount                = 1
				pPresentModes: count = 1
					PRESENT_MODE_IMMEDIATE_KHR
  1. VK_EXT_surface_maintenance1 is displayed for id=0, uninitialized (it's not actually supported by the driver)
	VK_EXT_surface_maintenance1:
	----------------------------
		PRESENT_MODE_FIFO_KHR:
			VkSurfacePresentScalingCapabilitiesEXT:
				supportedPresentScaling:
					None
				supportedPresentGravityX:
					None
				supportedPresentGravityY:
					None
				minScaledImageExtent:
					width  = 0
					height = 0
				maxScaledImageExtent:
					width  = 0
					height = 0
			VkSurfacePresentModeCompatibilityEXT:
				presentModeCount                = 0
				pPresentModes                   = NULL
		PRESENT_MODE_FIFO_RELAXED_KHR:
			VkSurfacePresentScalingCapabilitiesEXT:
				supportedPresentScaling:
					None
				supportedPresentGravityX:
					None
				supportedPresentGravityY:
					None
				minScaledImageExtent:
					width  = 0
					height = 0
				maxScaledImageExtent:
					width  = 0
					height = 0
			VkSurfacePresentModeCompatibilityEXT:
				presentModeCount                = 0
				pPresentModes                   = NULL

This results in VK_EXT_surface_maintenance1 reported to be supported by the driver but vkGetPhysicalDeviceSurfaceCapabilities2KHR not updating the data structure. In ANGLE we hit it here:
https://crsrc.org/c/third_party/angle/src/libANGLE/renderer/vulkan/SurfaceVk.cpp;drc=1b4d6185c4de95f1fed15b08cc7d044eba78ece3;l=1684

Workaround1: VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia_icd.json

Workaround2: sudo apt remove mesa-vulkan-drivers

@romanl-g romanl-g added the bug Something isn't working label Sep 20, 2024
@romanl-g
Copy link
Author

To clarify, in the step (4) above vkGetPhysicalDeviceSurfaceCapabilities2KHR doesn't set everything to 0. It doesn't do anything at all. Everything was probably zero-initialized by the caller (vulkaninfo). In ANGLE we use non-zero values and everything remains unchanged (which makes sense given the driver doesn't actually support the call)

@charles-lunarg
Copy link
Collaborator

I'm sorry, but excuse my ignorance. What exactly is invalid?

When I tried to reproduce this locally, with nvidia driver 550 I found that vkGetPhysicalDeviceSurfaceFormats2KHR would crash in the nvidia driver when querying the wayland_surface formats. Version 550 supports surface_maintenance1. Downgrading to 535 and the crash goes away. 535 does NOT support surface_maintenance1.

I do reproduce that the struct is empty, but isn't that to be expected if the driver version doesn't support an extension?

GPU id : 1 (NVIDIA GeForce GTX 1060 6GB):
	Surface types: count = 2
		VK_KHR_xcb_surface
		VK_KHR_xlib_surface
...
	VK_EXT_surface_maintenance1:
	----------------------------
		PRESENT_MODE_FIFO_KHR:
			minImageCount = 2
			maxImageCount = 8
			VkSurfacePresentScalingCapabilitiesEXT:
				supportedPresentScaling:
					None
				supportedPresentGravityX:
					None
				supportedPresentGravityY:
					None
				minScaledImageExtent:
					width  = 0
					height = 0
				maxScaledImageExtent:
					width  = 0
					height = 0
			VkSurfacePresentModeCompatibilityEXT:
				presentModeCount                = 0
				pPresentModes                   = NULL

The way I confirmed this is by setting VK_LOADER_DRIVERS_SELECT=nvidia and running vulkaninfo --summary to show me the list of available instance extensions with the 535 nvidia driver.

cdgiessen:~ $ VK_LOADER_DRIVERS_SELECT=*nvidia* vulkaninfo --summary
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.290


Instance Extensions: count = 21
-------------------------------
VK_EXT_acquire_drm_display             : extension revision 1
VK_EXT_acquire_xlib_display            : extension revision 1
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_direct_mode_display             : extension revision 1
VK_EXT_display_surface_counter         : extension revision 1
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_display                         : extension revision 23
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2         : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_portability_enumeration         : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_surface_protected_capabilities  : extension revision 1
VK_KHR_wayland_surface                 : extension revision 6
VK_KHR_xcb_surface                     : extension revision 6
VK_KHR_xlib_surface                    : extension revision 6
VK_LUNARG_direct_driver_loading        : extension revision 1

Whereas with the 550 driver the above command would have VK_EXT_surface_maintenance1 in that list.

@romanl-g
Copy link
Author

Focusing on version 535, without --summary:

% VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia_icd.json vulkaninfo | egrep '^GPU id|VK_EXT_surface_maintenance1'
GPU id : 0 (Quadro P1000):

That makes sense: VK_EXT_surface_maintenance1 is not listed because it is not supported by the driver.

% vulkaninfo | egrep '^GPU id|VK_EXT_surface_maintenance1'                  
	VK_EXT_surface_maintenance1            : extension revision 1
GPU id : 0 (Quadro P1000):
	VK_EXT_surface_maintenance1:
GPU id : 1 (llvmpipe (LLVM 17.0.6, 256 bits)):
	VK_EXT_surface_maintenance1:

This does not make sense: VK_EXT_surface_maintenance1 is now listed for both llvmpipe (correct) and Quadro P1000 (incorrect).

We think this behavior is incorrect because:

  • support for VK_EXT_surface_maintenance1 is advertised (VK_EXT_SURFACE_MAINTENANCE_1_EXTENSION_NAME is present in the list of extensions obtained by vkEnumerateInstanceExtensionProperties)
  • however, vkGetPhysicalDeviceSurfaceCapabilities2KHR call does not update VkSurfacePresentModeCompatibilityEXT data (for example, presentModeCount=1 does not get set to 0 regardless of pPresentModes being nullptr or not)

Note that removing llvmpipe from the system (or using the VK_ICD_FILENAMES workaround) results in correct behavior: support for VK_EXT_surface_maintenance1 is now not advertised, and we do not expect VkSurfacePresentModeCompatibilityEXT to be updated.

@romanl-g
Copy link
Author

Hmm, I wonder if there is an issue with the way we are querying supported extensions. What would be the vk api equivalent of VK_LOADER_DRIVERS_SELECT?

@charles-lunarg
Copy link
Collaborator

I think the bug is both in the spec - making it impossible for an application to know if an instance extension is supported by a particular physical device, and with vulkaninfo - where it is printing the struct regardless of the values returned. Vulkaninfo could check if the structs were filled out, and if not, not print the structs. That said, the crash in the 550 driver is still concerning and may be a loader bug so I don't want to rule that out.

I do wonder if this is all because the spec has an instance extensions which may not be supported by all physical devices. The loader typically "fixes" any instance level functions when called on drivers that don't support it. The loader could do the same here by removing the surface maintenance1 structs from pnext chains, but that's not ideal as modifying pnext chains is likely to break more things in the process. Maybe there should be a way to query if a specific physical device supports an instance extension so applications can do that checking themselves? That puts the onus on app devs which is unfortunate (yet another thing to have to manage).

In short I think the output of vulkaninfo is misleading - the nvidia 535 driver doesn't support the surface maintenance extension yet vulkaninfo makes it appear that it does.

@romanl-g
Copy link
Author

Vulkaninfo could check if the structs were filled out, and if not, not print the structs

Checking whether a struct was filled or not really feels like a hack though, probably because this isn't something APIs normally have to rely on and there isn't an obvious way to do it. If the spec required that a field has to change in a specific manner in response to queries (0->1 or any other way) that would be well defined. Otherwise, a zero initialized struct might be a correct response as well as non-zero-initialized, but unchanged value that was passed in.

In short I think the output of vulkaninfo is misleading

I don't see a good way how vulkaninfo would be supposed to handle this in the general case. Also, we hit this same issue in ANGLE. We could workaround with hacks like checking for unchanged out-of-range values but is there something we (Khronos) could do to improve the situation in the longer run, for all users?

@ShabbyX
Copy link
Contributor

ShabbyX commented Sep 23, 2024

The loader could do the same here by removing the surface maintenance1 structs from pnext chains, but that's not ideal as modifying pnext chains is likely to break more things in the process.

There is a much easier solution for this particular problem. Removing the struct from the pnext chain doesn't help unsuspecting apps, because they queried and were ensured surface_maint1 exists, but after selecting the device the query doesn't work.

The problem in particular stems from the fact that the loader takes the instance extensions supported by all the devices (like nvidia and lavapipe), and advertises the union to the application.

The solution is easy, the loader can fill this struct itself if the actual device doesn't support the extension. VkSurfacePresentModeCompatibilityEXT asks what the compatible present modes are to a given present mode. For drivers that don't support the extension, the answer is simple: Only the input present mode is compatible with itself, so the loader itself can simply set presentModeCount to 1, and pPresentModes[0] to the input VkSurfacePresentModeEXT::presentMode.

@ShabbyX
Copy link
Contributor

ShabbyX commented Sep 23, 2024

As to the general problem of instance extensions, either the loader should advertise the intersection as opposed to the union of instance extensions advertised by ICDs (pessimistic behavior), or it should make sure it has a meaningful implementation itself if an ICD is chosen that doesn't advertise the extension.

My preference is the latter where possible.

@romanl-g
Copy link
Author

Is it even meaningful to query extensions without selecting a driver first? What is the purpose of this aggregation?

@ShabbyX
Copy link
Contributor

ShabbyX commented Sep 23, 2024

Is it even meaningful to query extensions without selecting a driver first? What is the purpose of this aggregation?

That's unfortunately how Vulkan is, the instance extensions are supposed to be independent of the driver, or at least that's what the intention originally was (as I understand). In this case, technically surface_maint1 should be something that depends on the window system alone (like wayland or X11), but is somehow implemented by the driver instead of some other driver-agnostic component in the system (which is the way Android's implementation is).

@charles-lunarg
Copy link
Collaborator

Seems that this needs a loader fix to look for surface maintenance 1 structs in the pNext chain, and fill them out with appropriate fall back info.

And yes, this is unfortunately the outcome of the choice to make surfaces an "instance" level object. A similar issue comes up with vkCreateDisplayPlaneSurfaceKHR, where the function has a given physical device that it is operating on, but due to the use of a surface, it is being called on every driver. #1541 for more details.

@charles-lunarg
Copy link
Collaborator

While implementing the emulation of VK_EXT_surface_maintenance1 in the loader, I've hit a bit of a roadblock.

VkSurfacePresentModeCompatibilityEXT is simple as it was described above. The only present mode that is compatible is the present mode that is being queried.

VkSurfacePresentScalingCapabilitiesEXT::supportedPresentScaling, VkSurfacePresentScalingCapabilitiesEXT::supportedPresentGravityX, and VkSurfacePresentScalingCapabilitiesEXT::supportedPresentGravityY are similarly easy, return 0 for "not supported". But the VkSurfacePresentScalingCapabilitiesEXT::maxScaledImageExtent and VkSurfacePresentScalingCapabilitiesEXT::minScaledImageExtent have no values for 'not supported', nor any reasonable default that I can see.

Maybe I missed something, but in the meantime I am simply setting the values to 0 (which acts as memset'ting the struct to 0).

@ShabbyX
Copy link
Contributor

ShabbyX commented Sep 24, 2024

Oh that one is easy too, just populate them with the min/maxImageExtent for the surface.

@charles-lunarg
Copy link
Collaborator

I have the emulation PR up.

While I have locally validated that it fixes the issue of vulkaninfo reporting erroneous values, additional testing is always welcome.

@ShabbyX
Copy link
Contributor

ShabbyX commented Sep 26, 2024

Thank you, @romanl-g would you do the honors? :)

@charles-lunarg, FWIW, you should be able to run CTS over your implementation, hopefully the tests don't make any assumptions about swapchain_maint1 being supported at the same time!

@romanl-g
Copy link
Author

ANGLE: I didn't test thoroughly but in a basic test with that PR presentModeCount got set to 1, we don't hit the assertion and the test passed. Seems to be working!

@charles-lunarg
Copy link
Collaborator

After running through local CTS testing, I didn't encounter any glaring failures. All the tests which weren't skipped passed, and on multiple drivers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants