-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic: Assertion vmap != NULL failed at /usr/src/sys/dev/drm/freebsd/drm_os_freebsd.c:370 #2097
Comments
So DRM maintains a global linked list of VMA structures that reference mapped GEM objects. The VM object corresponding to a GEM object keeps a pointer to the GEM object in The VM object and GEM object look fine, i.e., they're valid, haven't been destroyed. The refcount on the VM object is 2. Interestingly, at the time of the panic, a child process of plasmashell was in the middle of exec'ing, so it was busy freeing its mappings and dropping VM object references. In
So in the Thus my questions are:
|
I am trying to figure out. I inserted printfs, opened chromium (10 tabs), so far it does not reach any of these printf paths.
Note there are around 300 entries in the drm_vma_head when chromium is open. |
In my experience, sudden termination of the display server can cause unhappiness in the kernel when other applications bump into the mess left behind. I wonder if it could be the case that something else triggered a bug in the display server .. and then |
could be no memory situation that is not handled correctly. We don't have IOMMU enabled (iommu code exists and working). I got another panic during vm_page_reclaim_contig() that panfrost is calling when it can't allocate large chunks of contiguos memory
|
I just hit this on my new morello system, will debug a bit. |
Well, I don't really understand why we use an OBJT_MGTDEVICE object for panfrost GEM objects. They are for containing fictitious pages (i.e., device memory or stolen pages of RAM), but the panfrost driver is allocating regular physical RAM. It should be OBJT_PHYS or OBJT_SWAP. The latter if we care about being able to remove all mappings of a given page, but I don't see anything in panfrost that relies on that...? |
New panic today:
|
@rwatson @markjdb @bukinr I found the bug and have a C program that can reproduce the panic. The C program also triggers fairly reliably the bug without this patch if this sleep() call is commented out: https://github.com/CTSRD-CHERI/pffm2/blob/c50cb472bc278114d82ae328382243c493467278/misc/issue2097/main.c#L48 In summary, the problem is that during an mmap() call that maps a GEM GPU buffer object this check if a corresponding VMA object already exists can be positive (meaning the object exists) even though another thread is in a code path that will free the VMA object. With the interleaving described below, the check in Here is the interleaving that causes the panic with the C program that I mentioned above:
I don't have a fully thought through idea for a fix yet. Maybe a ref counter for VMA objects, like @markjdb suggested, would fix it. |
Thank you for the thorough investigation. The links to your reproducers don't work, however - could you please update them? The DRM code currently assumes that there is a 1-1 mapping between VMAs and GEM objects, which is wrong since it's apparently possible for multiple processes to share mappings of a GEM object (and presumably it's possible for a single process to map a GEM object multiple times). I think the solution is to rework Note that there are several DRM object types potentially affected by this change, corresponding to: I'll try to implement this today. |
I updated the links to the reproducer. They were pointing into a private repository. I also updated the paragraph that starts with "In summary,..." to make it a bit clearer.
Note: I removed the patch file because I noticed a mistake in it. |
Here is the patch that I mentioned yesterday, which adds a reference counter for the VMA objects. I'd be happy to test if with the latest version of CheriBSD (I'm using a fork of the 2022 release of CheriBSD with some backported DRM fixes) and create a PR if we want to use it in the interim. |
Running with a kernel/userlevel from #2080, I saw this kernel panic when starting an aarch64 Chromium web browser within an otherwise entirely purecap (kernel, userlevel, desktop) environment:
panic: Assertion vmap != NULL failed at /usr/src/sys/dev/drm/freebsd/drm_os_freebsd.c:370
The kernel build was:
FreeBSD cheri-blossom.sec.cl.cam.ac.uk 15.0-CURRENT FreeBSD 15.0-CURRENT #19 c18n_procstat-n268168-8e6f163a2c50: Tue Apr 9 02:29:44 UTC 2024 robert@cheri-blossom.sec.cl.cam.ac.uk:/usr/obj/usr/src/arm64.aarch64c/sys/GENERIC-MORELLO-PURECAP arm64
Async revocation and default enabled c18n are both turned on:
Console output:
KGDB on the crashdump reports:
The process in question was
plasmashell
The text was updated successfully, but these errors were encountered: