Headline
Linux unmap_mapping_range() Race Condition
For VM_PFNMAP VMAs, there is a race between unmap_mapping_range() and munmap() that can lead to a page being freed by a device driver while the page still has stale TLB entries.
Linux: unmap_mapping_range() race with munmap() on VM_PFNMAP mappings leads to stale TLB entryFor VM_PFNMAP VMAs, there is a race between unmap_mapping_range() andmunmap() that can lead to a page being freed by a device driver whilethe page still has stale TLB entries.There are drivers (in particular GPU drivers) that createVM_PFNMAP VMAs containing PTEs that point to normal pagesfrom the page allocator. VM_PFNMAP means that the core kernelwon't track this using the page mapcounts; instead, the driveris responsible for holding references to the page as long asit is mapped into userspace.Some of these drivers have codepaths that can remove userspacemappings of such pages using unmap_mapping_range(), then give thesepages back to the page allocator.For example, i915 has a shrinker callback i915_gem_shrink() that doesthis.To make this driver behavior correct, it is necessary that by the timeunmap_mapping_range() returns, all the PTEs in the specified range havebeen removed and the corresponding TLB flushes have been executed.However, munmap() ends up in unmap_region(), which does this: struct mmu_gather tlb; lru_add_drain(); tlb_gather_mmu(&tlb, mm); update_hiwater_rss(mm); unmap_vmas(&tlb, vma, start, end); free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, next ? next->vm_start : USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb);unmap_vmas() removes all PTEs in the range, but does not necessarilyperform a TLB flush yet.free_pgtables() then removes the VMA from the mapping's rbtree(unlink_file_vma()) before tearing down page tables in the range:void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long floor, unsigned long ceiling){ while (vma) { struct vm_area_struct *next = vma->vm_next; unsigned long addr = vma->vm_start; /* * Hide vma from rmap and truncate_pagecache before freeing * pgtables */ unlink_anon_vmas(vma); unlink_file_vma(vma); if (is_vm_hugetlb_page(vma)) { [...] } else { [... irrelevant optimization ...] free_pgd_range(tlb, addr, vma->vm_end, floor, next ? next->vm_start : ceiling); } vma = next; }}The TLB flush corresponding to the PTEs that were removed inunmap_vmas() might only happen afterwards, in tlb_finish_mmu().This is bad because starting at unlink_file_vma(), the VMA won'tbe visible to unmap_mapping_range() anymore. If the driver callsunmap_mapping_range() directly after munmap() calledunlink_file_vma(), unmap_mapping_range() won't notice theexistence of this VMA, it might return while there are stillstale TLB entries pointing to this page, and the driver couldthen free the page while userspace can still read/write itthrough the stale TLB entry.It would be a pain to actually hit this bug through the i915driver though, since the only time it ever usesunmap_mapping_range() like this is in the i915_gem_shrink()shrinker callback. Instead, I wrote a reproducer against someout-of-tree GPU driver where the unmap_mapping_range() pathcan be triggered directly from userspace, and on a systemwith CONFIG_PAGE_POISONING, I managed to read PAGE_POISON(0xaa) out of the stale PTE from userspace after a fewiterations. So sadly I don't have a nice reproducer for thisissue that works upstream.I guess if we want to avoid having extra TLB flushes fornon-PFNMAP/MIXEDMAP VMAs, a possible fix would be to adda new bit in struct mmu_gather to track the existence ofPTEs without struct page, and then conditionally flushbefore free_pgtables() if either that bit is set ormm_tlb_flush_nested() is true?This bug is subject to a 90-day disclosure deadline. If a fix for thisissue is made available to users before the end of the 90-day deadline,this bug report will become public 30 days after the fix was madeavailable. Otherwise, this bug report will become public at the deadline.The scheduled deadline is 2022-10-04.Found by: [email protected]