ARM Linux Fast Address-Space Switching (FASS) This is current (hopefully) as of patch diff-2.4.18-rmk4-fass4 it gives an overview of the design and describes bits of the implementation, its structure and status. If you have any comments/corrections please let me know at Cheers, Adam Basic Idea: Ok the basic Idea is as follows. Rather then switch page directories (pg_dir) which requires the cache and TLB flushes a single pg_dir is used. I call this global pg_dir the "Caching Page Directory" or CPD. It can be thought of as a software TLB. The CPD stores pg_dir entries from multiple user page directories as well as the kernel's pg_dir entries. The pg_dir entries are tagged with ARM domains which are assigned uniquely to user tasks as well as a small set being reserved for tagging kernel mappings. A context switch from one user process to another is carried out by disabling the domain of the old process and enabling the domain of the new process. The set of domains allocated to the kernel mappings are manipulated as in the unmodified ARM Linux kernel. Using this technique the caches and TLBs need only be flushed if one of two events happen: 1) An entry in the CPD is replace by one owned by a different user process, hence tagged with a different domain. 2) A domain is currently owned by one user process is reallocated to another one. These issues both present a challenge to the design. 1) is a problem since the Unix address-space layout is typically the same for every process. 2) is a problem since typically we have many more processes then ARM domains which are fixed at 16 by the architecture. These problems are solved as follows: 1) A dual approach is used here to handle the two types of memory regions used in a user process: a) All memory regions with run time address bindings, ie those mapped in by mmap() without the MAP_FIXED flag, are grouped into 1MB regions (because pg_dir entries cover 1MB) that are already being used by the process. If the new mapping doesn't fit in an existing 1MB region allocated to the user process. A new one will be allocated, the newly allocated region will be one that causes the least contention with other processes using that same 1MB region, or the virtual address-space. Multiple processes can use the same 1MB region safely because of the domain tags and CPD mechanism. 'Least contention' can be defined as either least number of processes sharing a domain (likelihood of a contention) or least number of previous conflicts (history of contention) or some combination of the these. b) All memory regions with compile time address bindings, the text/data/bss/etc which are mapped in by mmap() with the MAP_FIXED flag, are as much as possible compiled to be within the first 32MB, ie below 0x02000000. The mappings are then relocated using the ARM PID register into one of the 32MB slots in the first 2GBs of the address-space. The relocation staggers 64 user processes uniquely then PIDs well have to be shared. Again the overlapping is safe as the CPD mechanism handles the contention. As there are more PIDs then domains the number shouldn't be an issue. The stack is also set to the top of the 32MB region, even though this could be placed anywhere, hence, staggered by 1a) it is placed in the PID relocation area because fork()ing will keep the same stack pointer casuing a collision between the two processes. While this will limit the size of brk() to be inside the 32MB area it does not limit malloc() as it will use mmap() to allocate new heaps if brk() fails. If the stack is simply staggered by 1a) the brk() grow past the PID relocation area as the PID relocation is transparent to the user process and most of the kernel. Only the CPD mechanism is aware of it. Actual allocation of PIDs still up in the air. At the moment it simply round-robins through them. Smarter allocation schemes might take into account the number of address-spaces using the PID and allocate the one with the least usage or with the least busy address-spaces, etc. To minimise collisions due to fork mappings flagged MAP_PRIVATE are as much as possible allocated in the 32MB relocation area. 2) At the moment the approach is to revoke what causes the least ammount of work, first we look for a clean domain to revoke. This means no cache flush and only invalidating the CPD entries tagged with the victim domain. If no clean domains are avaliable then a dirty one is used. In both cases the one with the least number of entries in the CPD is used. This might not be the most optimal way of doing it as tasks newly allocated a domain may be preempted first in times of domain thrashing. Data Cache Aliasing: Aliases occur when the same physical frame is mapped to more then one virtual address. This can happen within the same address-space or between different address-spaces. It is only a problem for writable pages as the different address lines will map to different cache lines. If one is modified the other will not see the change so accesses to that virtual address will access stale data. The problem can be address in one of two ways: 1) Mark the shared mapping as uncachable. This slows down accesses to the shared mappings but never requires cache flushing. ARM Linux currently uses this method for shared mappings within the same address-space. 2) Only activate one mapping at a time. When another mapping is accessed it will fault. At that stage the page in question is cleaned/flushed from the DCache, the old mapping is deactivated, and the accessed mapping is activated. faster accesses on the shared mappings but will cause slow downs when the frame changes mappings. I plan to implement and benchmark both solutions. Or at least analise it further, possibly have a flag to set which one is used. If the shared mapping uses the same virtual address in each address-space no cache flushing or uncacheable setting is needed. Although this will cause a CPD conflict anyway so no gain. Seperate domains for unique sharing groups could solve this but is definately future work. Lazy Cache/TLB Flushing: All the cache and tlb flushing routines used to flush user mappings make use of the following fact. If the CPD entry/entries mapping in the flush range are not tagged with the domain tag of the address-space being flushed, the caches cannot contain any data from that address-space and the TLBs cannot contain any mappings for that address-space. This is used to avoid many cache flushes. Another bit of information can be used. When the caches/TLBs are flushed all domains are marked as clean with respect to them. They remain clean until the domain is activated. This way an address-space with CPD entries in the flush region may still be clean if none of it's mappings have been touched. This technique of lazy flushing could be used by normal ARM Linux using the single user domain. Shared Domains: As well as using domains as a kind of address-space identifier (ASID) tagging address-spaces uniquely they can be used to tagged shared mappings that map to the same virtual addresses with the same size/permissions in each of the sharing address-spaces. This is known as shared domains. To support this every vm_area_struct (Linux mapping representation) is allocated a region, one is also allocated to the address-space. Mappings which are private to the address-space simply use the address-spaces region. Domains are now mapped to region's not address spaces and when we switch to a new address space we enabled the domains, if any, allocated to the regions accessable by the address-space. Implimentation Details: The implementation is spread around a LOT of the ARM specific files. However FASS slides nicely into Linux with only ARM arch specific modifications. The main part of FASS, the CPD mechanism, fits in the following way: * Entries are added into the CPD by ARM Data/Prefetch Aborts in arch/arm/mm/fault-armv.c the abort handlers are modifed and the functions do_cpd_fault() and do_domain_fault() are added. This functions go on to call CPD management functions in arch/arm/mm/cpd.c * Entries are removed from the CPD by the functions in arch/arm/mm/cpd.c this includes cpd_set_pmd() which updates the CPD to reflect changes in the user pg_dir. CPD entries are also removed in cpd_load() due to a conflict (replacement) and in domain_unallocate() due to domain preemption or the address-space being torn down. * The cache/TLB flush routines in include/asm-arm/proc-armv/cache.h all call an equivalent cpd_*() functions in arch/arm/mm/cpd.c to implement lazy flushing of caches/TLBs. * ARM PID management code is in arch/arm/mm/pid.c and include/asm-arm/proc-armv/pid.h * ARM domain management code is in arch/arm/mm/cpd.c , include/asm-arm/proc-armv/cpd.h and include/asm-arm/proc-armv/domain.h * An address-space mmu_context is defined in include/asm-arm/mmu.h and include/asm-arm/mmu_context.h defining a domain/PID allocated to the address-space. * Some architecture independent code changes have been made to the code that deals with vma's to add a new field 'vm_sharing_data' which points to the region data structure in ARM and should be used for similar TLB sharing support in other architectures like IA-64. 'vm_private_data' was not used as it is unclear how this field is used by other Linux code such as drivers. All other changes should be clear from the patch. To do: * arch/arm/mm/pid.c:pid_allocate() simply finds a PID with the lowest number of address-spaces using it. There could possibly be a better way of doing this. It also needs to decide when a ARM PID of 0, ie no VM relocation is used, this might have to be decided by the caller of pid_allocate(). Binaries located outside of the relocation area cause problems. * arch/arm/mm/fault-armv.c:update_mmu_cache() allocates region_structs to shared mapping to support shared domains. However at the moment it doesn't check the size or protection rights of the mapping allowing the max of each out the process set to be accessable to all address-spaces sharing the mapping. This is incorrect behaviour and needs to be delt fixed up some how. * arch/arm/mm/cpd.c:cpd_get_domain()/cpd_get_active_domain(), these functions should be updated to use the address-spaces private region's domain if the shared region has a count of one. This is to minimise domain thrashing, it may not provide much benifit and is a little tricky to do correctly. * The code in cpd.c could probably be split up into seperate files, it really should only have the cpd_load function and some helpers. Future Work: * Using sub-pages to deal with Cache Aliasing. This is another experimental idea. If you look at the StrongARM-1 Core the cache is made up of 1KB colours, if you don't understand cache colouring ignore this. But basically we use the (de)activation technique 2) for dealing with cache aliases except we (de)activate the 1KB sub-pages and flush 1KB sub-pages at a time. This better targets the flush and should reduce the overall cost. The implementation might be tricky though. * Modifying the architectural independent scheduling goodness() function to schedule around flushes, ie delay the flush as long as possible within a scheduling epoch. This is maybe not worth the trouble. * Domain-less CPD entries. This is only applicable to write-back caches that do not require a TLB access to retrieve the physical address of a write-back such at the ARM9 with it's TAG-RAM. What we can do is tag the CPD entry with a inactive domain (reserved for this task) effectively disabling the region and allowing the entries to get cleaned by normal opperation. These domain-list CPD entries can then be cleared removed on a cache flush. Very experimental idea, needs a lot more thought and probably won't work. Needs an ARM9 core too. This actually works on the StrongARM as well. I need to think about it more though. * Anything I've forgotton, basically tuning the implementation and experimenting with idea's for choosing a domain to preempt for recycling, choosing an ARM PID, and choosing 1MB regions for mmap().