Linux內核中的內存管理淺談

[十月往昔]——Linux內核中的內存管理淺談

為什麼要叫做“十月往昔”呢？是為了紀念我的原博客。

不知道為什麼，突然想來一個新的開始——而那個博客存活至今剛好十個月，也有十個月裏的文檔。

十月往昔，總有一些覺得珍貴的，所以搬遷到這裏來。

而這篇文章是在09.04.20-09.04.21裏寫的。

Jason Lee

————————————–cut-line

1。基本框架（此處主要談頁式內存管理）

4G是一個比較敏感的字眼，早些日子，大多數機器（或者說操作係統）支持的內存上限都是這個數字。為什麼呢？

之所以說是早些日子，因為現在64位的計算機已經很多了，而對於32位的計算機而言，頁式管理是這麼進行的，邏輯地址格式如下：

0 －11位：頁內偏移OFFSET

12－21位：頁麵表偏移PT

22－31位：頁麵目錄偏移PGD

尋址過程如下：

1）操作係統從寄存器CR3獲得當前頁麵目錄指針（基地址）；

2）基地址＋頁麵目錄偏移->頁麵表指針（基地址）；

3）頁麵表指針＋頁麵表偏移->內存頁基址；

4）內存頁基址＋頁內偏移->具體物理內存單元。

顯然，12位的頁內偏移可以尋址4K，所以一張內存頁為4K；而總共可尋內存為4G＝2^10 * 2^10 * 2^12；因此在32位機器上內存上限一般為4G。

而操作係統是需要支持不同的平台的，比如32位，比如64位等。所以，linux統一使用頁式三層映射：PGD－PMD－PT－OFFSET。

PAE是地址擴充功能（Physical Address Extension）的縮寫，如果將內存管理設置為PAE模式，這時候就需要三層映射了。

三層映射架構是如何實現雙層映射的？linux在暗地裏“弄虛作假”了一番，有點類似領導讓linux給三層映射一個重要位置，但是在32位計算機的地盤裏就“陽奉陰違”了，隻給三層映射一個有名無權的虛職。那麼這個虛職是怎麼實現的呢？

首先，開啟了PAE模式的計算機是真切需要三層映射的，所以它不會給三層映射虛職，而是需要三層映射機製去做實事的；而32位計算機如果沒有開啟PAE模式，那麼它是不需要三層映射的，雙層映射是它更喜歡的。所以，首先是判斷什麼情況下給三層映射虛職——

109/* 110 * The Linux x86 paging architecture is ‘compile-time dual-mode’, it 111 * implements both the traditional 2-level x86 page tables and the 112 * newer 3-level PAE-mode page tables. 113 */ 114#ifndef __ASSEMBLY__ 115#if CONFIG_X86_PAE 116# include <asm/pgtable-3level.h> 117 118/* 119 * Need to initialise the X86 PAE caches 120 */ 121extern void pgtable_cache_init(void); 122 123#else 124# include <asm/pgtable-2level.h>

從第一段的注釋說明我們可以知道Linux x86 的頁式映射機製在編譯時可以選擇使用傳統的雙層映射和新的 PAE 模式下的三層映射。而從接下來的代碼可以知道，如果對 CONFIG_X86_PAE進行了預處理，即開啟了 PAE 模式，那麼就使用 pgtable-3level.h ，並且對 X86 PAE caches 進行初始化，而如果沒有，則包含 pgtable-2level.h ，即使用雙層映射。

pgtable-2level.h實現的雙層映射：

4/* 5 * traditional i386 two-level paging structure: 6 */ 7 8#define PGDIR_SHIFT 22 9#define PTRS_PER_PGD 1024 10 11/* 12 * the i386 is two-level, so we don’t really have any 13 * PMD directory physically. 14 */ 15#define PMD_SHIFT 22 16#define PTRS_PER_PMD 1

從11 行到14 行的注釋我們可以知道這裏並沒有讓PMD 實際存在。 PGDIR_SHIFT 是 PGD 的偏移量——這裏的偏移量是指位於 32 位中的幾位，顯然是 22 位，即第 23 位。而

PTRS_PER_PGD是 pointers per PGD，即每個 PGD 位段能表示的指針。這裏是 1024 ，顯然需要 10 位，那麼 PGD 就是從位 22 到位 31 ，即第 23 位到第 32 位。

於是很顯然我們可以了解到PMD 在這裏是虛設的，掛了個虛職。因為 PTRS_PER_PMD 為 1 ，那麼占用的是 0 位，因為 2^0 = 1 。

到這裏，我們知道什麼人的地盤上給三層映射掛虛職，怎麼設置這個虛職的。而三層映射如果真幹起了實事，本質其實和雙層映射差不多，隻不過多了幾個位而已。

————————————–cut-line

1.數據結構和函數

眾所周知，linux 下有許多與 ANSI C 不同的數據類型，比如 pid_t ；這些類型實際上是通過一層或者若幹層的 typedef 定義而實現的，這樣做的一個主要原因是為了可移植性的實現，而這樣做的影響是看類型即可以很直觀地知道用於何處，比如 pid_t 顯然是一個進程 id 的類型；另外一個影響便是，編譯內核需要使用相應的 gcc 編譯器。

那麼，在內存管理(1) 中提到的 PGD 、 PMD 、 PT 等是什麼呢？在 include/asm-i386/page.h 中有如下代碼：

36/* 37 * These are used to make use of C type-checking.. 38 */ 39#if CONFIG_X86_PAE 40typedef struct { unsigned long pte_low, pte_high; } pte_t; 41typedef struct { unsigned long long pmd; } pmd_t; 42typedef struct { unsigned long long pgd; } pgd_t; 43#define pte_val(x) ((x).pte_low | ((unsigned long long)(x).pte_high << 32))

在開啟了PAE 模式的情況下， pgd_t 、 pmd_t 都是長整形變量，而 pte_t 分為 pte_low 和 pte_high 兩個部分。 PTE 是指 page table entry ，即某個具體的頁表項，指向一張具體的內存頁。但是一個內存頁並不需要 32位全部使用，因為每張內存頁大小都為 4KB ，所以從地址 0 開始，每間隔 4KB 為一張內存頁。所以，內存頁的首地址的低 12 位都為 0 ，我們隻需要高 20 位來指向一個內存頁基址，低 12 位用來設置頁麵狀態和權限。另外，還有一個宏用來讀取 pte_t 類型的成員。

而沒有開啟PAE 模式的情況如下：

44#else 45typedef struct { unsigned long pte_low; } pte_t; 46typedef struct { unsigned long pmd; } pmd_t; 47typedef struct { unsigned long pgd; } pgd_t; 48#define pte_val(x) ((x).pte_low) 49#endif

有了PMD 等結構後就有地方存儲地址信息了，那麼如何獲取這些信息呢？見如下幾個宏：

54#define pmd_val(x) ((x).pmd) 55#define pgd_val(x) ((x).pgd) 56#define pgprot_val(x) ((x).pgprot) 57 58#define __pte(x) ((pte_t) { (x) } ) 59#define __pmd(x) ((pmd_t) { (x) } ) 60#define __pgd(x) ((pgd_t) { (x) } ) 61#define __pgprot(x) ((pgprot_t) { (x) } )

54 行到 56 行是讀取成員變量的宏，而 58 行到 61 行則是進行類型轉換。這裏出現了一個 pgprot ，展開為 page protection ，頁麵保護。 pgprot 對應著上文提到的頁麵狀態和權限，從而實現頁麵的保護機製：

52typedef struct { unsigned long pgprot; } pgprot_t;

具體的pgprot_t 在 /include/asm-i386/pgtable.h 中定義：

187#define _PAGE_PRESENT 0×001 188#define _PAGE_RW 0×002 189#define _PAGE_USER 0×004 190#define _PAGE_PWT 0×008 191#define _PAGE_PCD 0×010 192#define _PAGE_ACCESSED 0×020 193#define _PAGE_DIRTY 0×040 194#define _PAGE_PSE 0×080 /* 4 MB (or 2MB) page, Pentium+, if present.. */ 195#define _PAGE_GLOBAL 0×100 /* Global TLB entry PPro+ */ 196 197#define _PAGE_PROTNONE 0×080 /* If not present */

顯然，pgprot_t 的位設置都是在低 12 位，而 PTE 的指針部分是高 20 位，共同構成了 32 位。那麼，二者是如何構成 32 位的頁麵表表項呢？我們自然而然想到了 20 位左移 12 位再與 pgprot_t 的低 12 位相或，在 pgtable.h 中是由宏 mk_pte 來完成的：

309#define mk_pte(page, pgprot) __mk_pte((page) - mem_map, (pgprot))

而我們自然又遇到了__mk_pte 。那麼 __mk_pte 是什麼呢？在 /include/asm-i386/pgtable-2level.h中它一個宏：

63#define __mk_pte(page_nr,pgprot) __pte(((page_nr) << PAGE_SHIFT) | pgprot_val(pgprot))

以上為63 行單行。而在/include/asm-i386/page.h 中對 PAGE_SHIFT 進行了宏定義：

5#define PAGE_SHIFT 12

所以實現的是將內存頁麵編號左移12 位再與保護字段pgprot 相或得到了 pte 頁麵表項。另外在上述中出現了 __pte() ，它的原型為： 58#define __pte(x) ((pte_t) { (x) } )，即進行類型轉換。而 pgprot_val(pgprot) 的原型為： 56#define pgprot_val(x) ((x).pgprot)，與 52typedef struct { unsigned long pgprot; } pgprot_t;相對應則易知是獲得某個 pgprot_t 類型變量的成員變量 pgprot 。

最後就剩下一個mem_map 了。我們先來了解一下 /include/linux/mm.h 中的 page 結構。

首先，先看一段前置說明：

139/* 140 * Each physical page in the system has a struct page associated with 141 * it to keep track of whatever it is we are using the page for at the 142 * moment. Note that we have no way to track which tasks are using 143 * a page. 144 * 145 * Try to keep the most commonly accessed fields in single cache lines 146 * here (16 bytes or greater). This ordering should be particularly 147 * beneficial on 32-bit processors. 148 * 149 * The first line is data used in page cache lookup, the second line 150 * is used for linear searches (eg. clock algorithm scans). 151 * 152 * TODO: make this structure smaller, it could be as small as 32 bytes. 153 */

簡略說下，就是page 結構是與物理內存頁相聯係的，從而進行狀態跟蹤；其次，最經常訪問的結構體內的成員字段應該保持在 16 位或者更大的單條緩衝線上——顯然，這樣有利於高速訪問。接著來看page 結構體的定義：

154typedef struct page { 155 struct list_head list; /* ->mapping has some page lists. */ 156 struct address_space *mapping; /* The inode (or …) we belong to. */ 157 unsigned long index; /* Our offset within mapping. */ 158 struct page *next_hash; /* Next page sharing our hash bucket in 159 the pagecache hash table. */ 160 atomic_t count; /* Usage count, see below. */ 161 unsigned long flags; /* atomic flags, some possibly 162 updated asynchronously */ 163 struct list_head lru; /* Pageout list, eg. active_list; 164 protected by pagemap_lru_lock !! */ 165 struct page **pprev_hash; /* Complement to *next_hash. */ 166 struct buffer_head * buffers; /* Buffer maps us to a disk block. */ 167 168 /* 169 * On machines where all RAM is mapped into kernel address space, 170 * we can simply calculate the virtual address. On machines with 171 * highmem some memory is mapped into kernel virtual memory 172 * dynamically, so we need a place to store that address. 173 * Note that this field could be 16 bits on x86 … ;) 174 * 175 * Architectures with slow multiplication can define 176 * WANT_PAGE_VIRTUAL in asm/page.h 177 */ 178#if defined(CONFIG_HIGHMEM) || defined(WANT_PAGE_VIRTUAL) 179 void *virtual; /* Kernel virtual address (NULL if 180 not kmapped, ie. highmem) */ 181#endif /* CONFIG_HIGMEM || WANT_PAGE_VIRTUAL */ 182} mem_map_t;

當我們看到最後一行（182 行）的時候會有種恍然大悟的感覺—— mem_map_t 。於是我們就會聯想 mem_map 是這麼一個類型的變量。

實際上，mem_map 是一個全局變量（目前為止是），而且是一個指向 page 結構數組的指針；係統在初始化時根據物理內存的大小創建該數組。每一個數組元素都對應一張物理內存頁。從軟件方麵來講，頁麵表項的高 20 位是物理頁麵的編號，即 mem_map 數組的索引下標，通過該下標可以訪問到與物理頁麵對應的page 結構。而從硬件方麵來講，頁麵表項的高 20 位再與 12 個 0 結合則構成了 32 位，即每張物理頁麵的基址。

mem_map 映射著全部的物理內存頁，而其本身則分為不同的區，比如 ZONE_DMA、 ZONE_NORMAL和 ZONE_HIGHMEM等。其中 ZONE_DMA 是供 DMA 使用的； ZONE_HIGHMEM 是用於處理物理地址超過 1G 的存儲空間。

事實上，三個管理區是這麼分配的：0 ～ 16MB 分配給 ZONE_DMA ， 16 ～896MB 分配給 ZONE_NORMAL ，最後， 896MB 以上的分配給 ZONE_HIGHMEM 。那麼，為什麼要這麼分配呢？這是由於某些硬件隻能特定地訪問 0 ～ 16MB來執行 DMA 模式；有些機器的配置使得物理內存頁麵無法總是保持被內核地址映射，這時需要使用ZONE_HIGHMEM 進行動態映射；而其餘的就是可以被正常映射的。

那麼，為什麼這裏是896MB 呢，而不是上文提的 1GB ？這是由於內核不僅為 highmem 預留了空間，也為 fixmap 和 vmalloc 預留了虛存空間。

OK ，那內核中的虛擬地址是什麼？虛擬地址其實就是邏輯地址——與物理地址相對應。

我們不妨來看看物理地址和內核中虛擬地址在內核空間的關係：

128#define PAGE_OFFSET ((unsigned long)__PAGE_OFFSET) … 132#define __pa(x) ((unsigned long)(x)-PAGE_OFFSET) 133#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))

pa 表示 physical address ，即物理地址，而 va 表示虛擬地址 virtual address 。這裏，我們不得不去看看 __PAGE_OFFSET ：

68/* 69 * This handles the memory map.. We could make this a config 70 * option, but too many people screw it up, and too few need 71 * it. 72 * 73 * A __PAGE_OFFSET of 0xC0000000 means that the kernel has 74 * a virtual address space of one gigabyte, which limits the 75 * amount of physical memory you can use to about 950MB. 76 * 77 * If you want more physical memory than this then see the CONFIG_HIGHMEM4G 78 * and CONFIG_HIGHMEM64G options in the kernel configuration. 79 */ 80 81#define __PAGE_OFFSET (0xC0000000)

前置注釋有一堆，而宏定義隻有一行。在32 位機器上，通過linux 內核的頁式映射可以實現 4GB 的邏輯地址（虛擬地址）。而在 4G 字節中， 0xC0000000 到 0xFFFFFFFF 的這 1G 最高的邏輯地址用於內核本身，稱之為“內核空間”；而較低的 3G 字節空間為用戶空間。注意，這裏的是虛的、邏輯地址。

於是我們知道了__PAGE_OFFSET 是用戶空間和內核空間在虛地址上的分界。然而，物理地址始終是從 0×00000000開始的；所以對於內核空間來說， pa 與 va 就相差了一個 PAGE_OFFSET 。而同時， PAGE_OFFSET 也代表著用戶空間的上限。

到這裏，我們了解了內核空間隻能“線性映射”1GB“ 的物理地址，如果沒有 ZONE_HIGHMEM 來管理高於 1GB 的物理地址，那麼這些內存就會浪費掉了。於是係統初始化時預留了 128MB的虛存來用於將來可能的映射。以上是對於 x86 體係結構而言，對於其它體係，物理內存可以全部被映射， ZONE_HIGHMEM 為空。

現在回到內存管理區。/include/linux/mmzone.h 中有如下數據結構用於管理區：

（代碼有點長，分段來看）

39/* 40 * On machines where it is needed (eg PCs) we divide physical memory 41 * into multiple physical zones. On a PC we have 3 zones: 42 * 43 * ZONE_DMA < 16 MB ISA DMA capable memory 44 * ZONE_NORMAL 16-896 MB direct mapped by the kernel 45 * ZONE_HIGHMEM > 896 MB only page cache and user processes 46 */

這裏的前置注釋說明了三個管理區的分布。

47typedef struct zone_struct { 48 /* 49 * Commonly accessed fields: 50 */ 51 spinlock_t lock; 52 unsigned long free_pages; 這裏是經常訪問的字段。這裏遇到了spinlock_t這個數據類型，在/include/asm-i386/spinlock.h中有定義： 22/* 23 * Your basic SMP spinlocks, allowing only a single CPU anywhere 24 */ 25 26typedef struct { 27 volatile unsigned int lock; 28#if SPINLOCK_DEBUG 29 unsigned magic; 30#endif 31} spinlock_t;

由注釋我們可以知道這是用來控製SMP 使用的，僅允許單 CPU 工作。

而free_pages 表示著該區目前擁有的空閑頁數。

53 /* 54 * We don’t know if the memory that we’re going to allocate will be freeable 55 * or/and it will be released eventually, so to avoid totally wasting several 56 * GB of ram we must reserve some of the lower zone memory (otherwise we risk 57 * to run OOM on the lower zones despite there’s tons of freeable ram 58 * on the higher zones). 59 */ 60 zone_watermarks_t watermarks[MAX_NR_ZONES];

由前置注釋可知這是為了保留一些低端內存。我們在這裏又遇到了一個新的數據類型：

34typedef struct zone_watermarks_s { 35 unsigned long min, low, high; 36} zone_watermarks_t; 62 /* 63 * The below fields are protected by different locks (or by 64 * no lock at all like need_balance), so they’re longs to 65 * provide an atomic granularity against each other on 66 * all architectures. 67 */ 68 unsigned long need_balance; 69 /* protected by the pagemap_lru_lock */ 70 unsigned long nr_active_pages, nr_inactive_pages; 71 /* protected by the pagecache_lock */ 72 unsigned long nr_cache_pages; 75 /* 76 * free areas of different sizes 77 */ 78 free_area_t free_area[MAX_ORDER]; 引入free_area_t： 27typedef struct free_area_struct { 28 struct list_head free_list; 29 unsigned long *map; 30} free_area_t;

這裏free_area[MAX_ORDER] 是一組隊列，用於分配不連續的內存塊。隊列的實現是通過 free_area_t類型中的成員 struct list_head free_list ，可參加 list.h 。

80 /* 81 * wait_table — the array holding the hash table 82 * wait_table_size — the size of the hash table array 83 * wait_table_shift — wait_table_size 84 * == BITS_PER_LONG (1 << wait_table_bits) 85 * 86 * The purpose of all these is to keep track of the people 87 * waiting for a page to become available and make them 88 * runnable again when possible. The trouble is that this 89 * consumes a lot of space, especially when so few things 90 * wait on pages at a given time. So instead of using 91 * per-page waitqueues, we use a waitqueue hash table. 92 * 93 * The bucket discipline is to sleep on the same queue when 94 * colliding and wake all in that wait queue when removing. 95 * When something wakes, it must check to be sure its page is 96 * truly available, a la thundering herd. The cost of a 97 * collision is great, but given the expected load of the 98 * table, they should be so rare as to be outweighed by the 99 * benefits from the saved space. 100 * 101 * __wait_on_page() and unlock_page() in mm/filemap.c, are the 102 * primary users of these fields, and in mm/page_alloc.c 103 * free_area_init_core() performs the initialization of them. 104 */ 105 wait_queue_head_t * wait_table; 106 unsigned long wait_table_size; 107 unsigned long wait_table_shift;

一些管理區信息如下：

109 /* 110 * Discontig memory support fields. 111 */ 112 struct pglist_data *zone_pgdat; 113 struct page *zone_mem_map; 114 unsigned long zone_start_paddr; 115 unsigned long zone_start_mapnr;

112 表示的是該管理區所在的存儲節點； 113 顯然是一張內存映射表； 114 是該管理區的物理起始地址，而 115 表示的是在 mem_map 中的起始下標。顯然這些都可以直接從變量名看出來。

117 /* 118 * rarely used fields: 119 */ 120 char *name; 121 unsigned long size; 122 unsigned long realsize; 123} zone_t;

120 表示的是管理區的名字， 121 表示的是管理區的大小， 122 表示的是管理區實用大小。

當多CPU 引入之後， NUMA(Non-Uniform Memory Architecture)結構體係出現了，即非勻質存儲結構。於是，每個 CPU 都有自己的物理地址，並且有一個公共的物存模塊。這樣有時候會出現CPU 請求的內存塊無法在自己管轄的物理地址模塊獲得，也不能手伸太長去其它 CPU管理的模塊，那麼就需要到公共模塊請求。同時，新的物理頁麵管理機製也進行了修正。

在NUMA 下，我們稱 CPU 請求的一片連續物理內存頁為 node （節點）。而且，此時的 mem_map 不再是全局變量，而是從屬於具體節點；管理區也不再高高在上，也是被節點所擁有，每個存儲節點至少有兩個管理區。從而在 zone_struct 上便有了 pglist_data 數據結構，在 /include/linux/mmzone.h 定義：

142/* 143 * The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM 144 * (mostly NUMA machines?) to denote a higher-level memory zone than the 145 * zone_struct denotes. 146 * 147 * On NUMA machines, each NUMA node would have a pg_data_t to describe 148 * it’s memory layout. 149 * 150 * XXX: we need to move the global memory statistics (active_list, …) 151 * into the pg_data_t to properly support NUMA. 152 */ 153struct bootmem_data; 154typedef struct pglist_data { 155 zone_t node_zones[MAX_NR_ZONES]; 156 zonelist_t node_zonelists[GFP_ZONEMASK+1]; 157 int nr_zones; 158 struct page *node_mem_map; 159 unsigned long *valid_addr_bitmap; 160 struct bootmem_data *bdata; 161 unsigned long node_start_paddr; 162 unsigned long node_start_mapnr; 163 unsigned long node_size; 164 int node_id; 165 struct pglist_data *node_next; 166} pg_data_t;

首先看看158 行 struct page *node_mem_map ，由於每個節點有一片的內存頁，這裏的 node_mem_map 便是用來映射表示它們的（ page 結構數組）；接著看首行， 155 行 zone_t node_zones[MAX_NR_ZONES]是該節點所擁有的管理區，同時在 zone_struct 也有一行 struct pglist_data *zone_pgdat ，指向所屬節點pglist_data 數據結構。

————————————–cut-line –以上數據結構用於物理內存頁麵管理 –2009-04-20 晚

————————————–cut-line

（續）數據結構和函數

現在開始接觸的是用於虛存管理的數據結構和函數。

通常，一個進程所需要使用的虛存空間是離散的各個區間，而區間的數據結構是/include/linux/mm.h中定義的：

38/* 39 * This struct defines a memory VMM memory area. There is one of these 40 * per VM-area/task. A VM area is any part of the process virtual memory 41 * space that has a special rule for the page-fault handlers (ie a shared 42 * library, the executable area etc). 43 */ 44struct vm_area_struct { 45 struct mm_struct * vm_mm; /* The address space we belong to. */ 46 unsigned long vm_start; /* Our start address within vm_mm. */ 47 unsigned long vm_end; /* The first byte after our end address 48 within vm_mm. */ 49 50 /* linked list of VM areas per task, sorted by address */ 51 struct vm_area_struct *vm_next; 52 53 pgprot_t vm_page_prot; /* Access permissions of this VMA. */ 54 unsigned long vm_flags; /* Flags, listed below. */ 55 56 rb_node_t vm_rb; 57 58 /* 59 * For areas with an address space and backing store, 60 * one of the address_space->i_mmap{,shared} lists, 61 * for shm areas, the list of attaches, otherwise unused. 62 */ 63 struct vm_area_struct *vm_next_share; 64 struct vm_area_struct **vm_pprev_share; 65 66 /* Function pointers to deal with this struct. */ 67 struct vm_operations_struct * vm_ops; 68 69 /* Information about our backing store: */ 70 unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE 71 units, *not* PAGE_CACHE_SIZE */ 72 struct file * vm_file; /* File we map to (can be NULL). */ 73 unsigned long vm_raend; /* XXX: put full readahead info here. */ 74 void * vm_private_data; /* was vm_pte (shared mem) */ 75};

45 行是定義了一個指向 mm_struct 結構體的指針，該結構體稍後了解。 vm_start 和 vm_end 是這一段 vm_area 的開始和結束位置，然而 vm_end 是該 vm_area 之後的第一個地址，不屬於本 vm_area 。

51 行定義了一個指向 vm_area_struct 結構體的指針 vm_next 。這是由於進程使用的區間是離散的，所以各個區間需要形成鏈表來保持聯係，這裏的 vm_next 便是指向下一片 vm_area 的；該鏈表是按地址排序的。

53 行的 pgprot_t vm_page_prot 顯然是本 vm_area 的保護信息， pgprot_t 在之前有談過。

54 行的 vm_flags 是本 vm_area 的標誌，如下：

77/* 78 * vm_flags.. 79 */ 80#define VM_READ 0×00000001 /* currently active flags */ 81#define VM_WRITE 0×00000002 82#define VM_EXEC 0×00000004 83#define VM_SHARED 0×00000008 84 85#define VM_MAYREAD 0×00000010 /* limits for mprotect() etc */ 86#define VM_MAYWRITE 0×00000020 87#define VM_MAYEXEC 0×00000040 88#define VM_MAYSHARE 0×00000080 89 90#define VM_GROWSDOWN 0×00000100 /* general info on the segment */ 91#define VM_GROWSUP 0×00000200 92#define VM_SHM 0×00000400 /* shared memory area, don’t swap out */ 93#define VM_DENYWRITE 0×00000800 /* ETXTBSY on write attempts.. */ 94 95#define VM_EXECUTABLE 0×00001000 96#define VM_LOCKED 0×00002000 97#define VM_IO 0×00004000 /* Memory mapped I/O or similar */ 98 99 /* Used by sys_madvise() */ 100#define VM_SEQ_READ 0×00008000 /* App will access data sequentially */ 101#define VM_RAND_READ 0×00010000 /* App will not benefit from clustered reads */ 102 103#define VM_DONTCOPY 0×00020000 /* Do not copy this vma on fork */ 104#define VM_DONTEXPAND 0×00040000 /* Cannot expand with mremap() */ 105#define VM_RESERVED 0×00080000 /* Don’t unmap it from swap_out */ 106 107#ifndef VM_STACK_FLAGS 108#define VM_STACK_FLAGS 0×00000177 109#endif

80 ～83 行分別表示頁是否可以被讀、寫、執行和共享。

85 ～88 行表示可以對 80 ～83 行的標誌進行設置。

95 行表示該頁含可執行代碼。

96 行表示該頁被鎖。

其它標誌均有注釋。

在這裏一般會有個疑惑，一個vm_area可能包含很多個內存頁，為什麼隻有一個 vm_page_prot 和vm_flags呢？這是因為同一片 vm_area 的所有頁麵都必須保持相同的保護信息和狀態標誌。

現在回到vm_area_struct 。

56 行是 rb_node_t vm_rb; rb_node_t 是紅黑樹 (red-black tree) 節點類型。紅黑樹的結構如下：

100typedef struct rb_node_s 101{ 102 struct rb_node_s * rb_parent; 103 int rb_color; 104#define RB_RED 0 105#define RB_BLACK 1 106 struct rb_node_s * rb_right; 107 struct rb_node_s * rb_left; 108} 109rb_node_t;

之所以使用紅黑樹是因為使用鏈表搜索的話每次都要從頭開始，會影響效率。

63 ～64 行為共享內存中的前後區間：

58 /* 59 * For areas with an address space and backing store, 60 * one of the address_space->i_mmap{,shared} lists, 61 * for shm areas, the list of attaches, otherwise unused. 62 */ 63 struct vm_area_struct *vm_next_share; 64 struct vm_area_struct **vm_pprev_share; 67行定義了一個vm_ops，指向的是一個vm_oprations_struct結構體，該結構體在/include/linux/mm.h有定義： 128/* 129 * These are the virtual MM functions - opening of an area, closing and 130 * unmapping it (needed to keep files on disk up-to-date etc), pointer 131 * to the functions called when a no-page or a wp-page exception occurs. 132 */ 133struct vm_operations_struct { 134 void (*open)(struct vm_area_struct * area); 135 void (*close)(struct vm_area_struct * area); 136 struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int unused); 137};

顯然可見vm_ops 是一個指針，可以執行操作函數作用在該 vm_area 上。其中 open 和 close 用於打開、關閉虛存空間。而當請求頁麵不在內存中調用 nopage。

vm_area_struct後麵的成員都有注釋。

————————————–cut-line

在了解vm_area_struct 的開始，我們提到了 mm_struct 。

206struct mm_struct { 207 struct vm_area_struct * mmap; /* list of VMAs */ 208 rb_root_t mm_rb; 209 struct vm_area_struct * mmap_cache; /* last find_vma result */ 210 pgd_t * pgd; 211 atomic_t mm_users; /* How many users with user space? */ 212 atomic_t mm_count; /* How many references to “struct mm_struct” (users count as 1) */ 213 int map_count; /* number of VMAs */ 214 struct rw_semaphore mmap_sem; 215 spinlock_t page_table_lock; /* Protects task page tables and mm->rss */ 216 217 struct list_head mmlist; /* List of all active mm’s. These are globally strung 218 * together off init_mm.mmlist, and are protected 219 * by mmlist_lock 220 */ 221 222 unsigned long start_code, end_code, start_data, end_data; 223 unsigned long start_brk, brk, start_stack; 224 unsigned long arg_start, arg_end, env_start, env_end; 225 unsigned long rss, total_vm, locked_vm; 226 unsigned long def_flags; 227 unsigned long cpu_vm_mask; 228 unsigned long swap_address; 229 230 unsigned dumpable:1; 231 232 /* Architecture-specific MM context */ 233 mm_context_t context; 234};

207 行的 mmap 指向虛存區間鏈表。

208 行是指向紅黑樹。

209 行的 mmap_cache 指向最後一次使用的虛存區間，因為虛存區間有若幹個內存頁，下一次請求的內存頁很可能還在該區間。

210 行的 pgd 顯然是進程的頁麵目錄，當內核調度一個進程運行時，將該指針轉換為物理地址並寫入控製寄存器 CR3 。

211 行的 mm_users 表示用戶空間中有多少用戶。而 212 行的 mm_count 表示該 mm_count 結構的被引用數。

213 行 map_count 表示 vm_area 的個數。

214 和 215 是一些狀態控製，進行諸如鎖定等狀態控製。

217 行是 mm_struct 鏈表。

餘下部分用途較顯然。

從mm_users 和 mm_count 我們可以知道一個 mm_struct 允許被多個進程引用，但是一個進程隻能使用一個 mm_struct結構。

至此，我們了解到以下幾點。

1 。虛存方麵是由 vm_area_struct 和 mm_struct 進行處理的。 32 位的計算機可以形成 4G 的虛存空間，其中 3 ～ 4G 的虛存空間用作內核空間，其餘用作用戶空間。 mm_struct 是用戶空間抽象，位於虛存管理的高層。而 vm_area_struct則是從屬於 mm_struct 。一個進程允許有多個 vma ，這些虛存區間構成鏈表以及紅黑樹，在 vma 個數較少的時候使用鏈表操作，個數多的時候使用紅黑樹操作。mm_struct 中的 mmap 指向 vma 鏈表，而 map_count 則指示有多少個 vma 。當一個進程進入運行時，進程所對應的 mm_struct 中的 pgd （頁麵目錄）被寫入控製寄存器 CR3 ，於是頁式映射機製的源頭 CR3 就有內容了。

2 。在 CR3 被設置以後，便可以進行頁式映射了。負責將虛擬地址映射為物理地址的內存管理單元從 CR3 讀出數據，然後結合 pgd 等內容完成映射。

此外，如果要通過進程的虛擬地址找到所屬區間以及相應的vma 結構可以使用 find_vma ：

666/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */ 667struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr) 668{ 669 struct vm_area_struct *vma = NULL; 670 671 if (mm) { 672 /* Check the cache first. */ 673 /* (Cache hit rate is typically around 35%.) */ 674 vma = mm->mmap_cache; 675 if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) { 676 rb_node_t * rb_node; 677 678 rb_node = mm->mm_rb.rb_node; 679 vma = NULL; 680 681 while (rb_node) { 682 struct vm_area_struct * vma_tmp; 683 684 vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb); 685 686 if (vma_tmp->vm_end > addr) { 687 vma = vma_tmp; 688 if (vma_tmp->vm_start <= addr) 689 break; 690 rb_node = rb_node->rb_left; 691 } else 692 rb_node = rb_node->rb_right; 693 } 694 if (vma) 695 mm->mmap_cache = vma; 696 } 697 } 698 return vma; 699}

首先通過查找mmap_cache ，如果不是，則在鏈表中或者紅黑樹中搜索。如果返回 0 ，表示還沒有創建 vma ，這時候就需要創建一個新的虛存區間結構。

————————————–cut-line

1。越界訪問

頁式映射將虛擬地址轉換成物理地址，並不是每次映射都是成功的，以下是幾種失敗的情況：

1 ）映射過程中遇到 pgd 或者 pte 等項為空，映射沒有建立

2 ）物理頁麵不在內存中

3 ）權限不符

於是就有相應的錯誤處理程序/arch/i386/mm/fault.c 中的 do_page_fault() ：

130/* 131 * This routine handles page faults. It determines the address, 132 * and the problem, and then passes it off to one of the appropriate 133 * routines. 134 * 135 * error_code: 136 * bit 0 == 0 means no page found, 1 means protection fault 137 * bit 1 == 0 means read, 1 means write 138 * bit 2 == 0 means kernel, 1 means user-mode 139 */ 140asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code) 141{

由前置注釋可知，錯誤碼第0 位為0 表示頁麵不存在，1 表示權限不符；第1 位為0 表示為讀訪問引起的錯誤，1 表示寫訪問引起錯誤；第2 位為0 表示錯誤發生在內核態，1 表示在用戶態。

該頁麵錯誤處理機製需要兩個參數，一個是regs 指向錯誤前現場， error_code 如上。

151 /* get the address */ 152 __asm__(”movl %%cr2,%0″:”=r” (address));

這兩行是獲得導致映射失敗的線性地址，它存儲在CR2 中，由匯編語言實現。

接著首先是處理在內核空間發生的非權限不符錯誤：

160 /* 161 * We fault-in kernel-space virtual memory on-demand. The 162 * ‘reference’ page table is init_mm.pgd. 163 * 164 * NOTE! We MUST NOT take any locks for this case. We may 165 * be in an interrupt or a critical region, and should 166 * only copy the information from the master page table, 167 * nothing more. 168 * 169 * This verifies that the fault happens in kernel space 170 * (error_code & 4) == 0, and that the fault was not a 171 * protection error (error_code & 1) == 0. 172 */ 173 if (address >= TASK_SIZE && !(error_code & 5)) 174 goto vmalloc_fault; 175 176 mm = tsk->mm; 177 info.si_code = SEGV_MAPERR;

由前置注釋可知if 條件的判斷保證了錯誤發生在內核空間，而且不是權限不符錯誤。這種錯誤轉向vmalloc_fault 處理，該處理機製也在內部定義。

接著處理的是中斷或者進程映射未建立的情況：

179 /* 180 * If we’re in an interrupt or have no user 181 * context, we must not take the fault.. 182 */ 183 if (in_interrupt() || !mm) 184 goto no_context;

在這段代碼之下是一段有關於堆棧越界的處理。當用盡了本進程的堆棧空間後，如果再執行進棧操作，由於堆棧是從上往下延伸的，所以一般情況下會把數據寫到(%esp-4) 位置，如果是 32 字節操作則是 (%esp-32) 了。

188 vma = find_vma(mm, address);

查找虛存區間。

如果沒有找到：

189 if (!vma) 190 goto bad_area;

轉向bad_area 處理。

如果找到，且地址大於vma 起始地址（非堆棧）則轉向：

191 if (vma->vm_start <= address) 192 goto good_area;

而如果是堆棧，那麼VM_GROWSDOWN 標記為 1 ，當向下越界時，如果超過 %esp-32 那麼就轉向 bad_area 否則擴充堆棧，調用 expand_stack() ：

193 if (!(vma->vm_flags & VM_GROWSDOWN)) 194 goto bad_area; 195 if (error_code & 4) { 196 /* 197 * accessing the stack below %esp is always a bug. 198 * The “+ 32″ is there due to some instructions (like 199 * pusha) doing post-decrement on the stack and that 200 * doesn’t show up until later.. 201 */ 202 if (address + 32 < regs->esp) 203 goto bad_area; 204 } 205 if (expand_stack(vma, address)) 206 goto bad_area;

但是並不是無限製地擴充堆棧的，每個進程都有限製，如果超過就跳轉到bad_area 。如果允許擴充，轉向 good_area 繼續完成新增頁麵對物理內存的映射。

具體的處理機製見/arch/i386/mm/fault.c 。

最後更新：2017-04-02 04:01:43

Linux內核中的內存管理淺談

上一篇： IBM WebSphere Application Server V7.0 Fix Pack 7於2009.11.13發布

下一篇： Linux內核中的list.h淺談

相關內容

熱門內容

最新內容