Memory network (MemNN) & End to End memory network (MemN2N) & Dynamic memory network
記憶網絡(Memory network)
虛擬助理在回答單個問句時表現不賴,但是在多輪對話中表現差強人意,以下例子說明我們麵臨著什麼挑戰:
- Me: Can you find me some restaurants?
- Assistance: I find a few places within 0.25 mile. The first one is Caffé Opera on Lincoln Street. The second one is …
- Me: Can you make a reservation at the first restaurant?
- Assistance: Ok. Let’s make a reservation for the Sushi Tom restaurant on the First Street.
為什麼虛擬助理不能按照我的指令預訂Caffé Opera?那是因為虛擬助理並不能記住我們的對話,她隻是簡單地回答我們的問題而不考慮向前談話的上下午。因此,她所能做的隻是找到與詞“First”相關的餐廳(一間位於First Street的餐廳)。記憶網絡(Memory Networks)通過記住處理過的信息來解決該問題。
The description on the memory networks (MemNN) is based on Memory networks, Jason Weston etc.
考慮以下語句和問句“Where is the milk now?”:
- Joe went to the kitchen.
- Fred went to the kitchen.
- Joe picked up the milk.
- Joe traveled to the office.
- Joe left the milk.
- Joe went to the bathroom.
存信息於記憶
首先,保存句子在記憶m中:
Memory slot \(m_i\) | Sentence |
---|---|
1 | Joe went to the kitchen. |
2 | Fred went to the kitchen. |
3 | Joe picked up the milk. |
4 | Joe traveled to the office. |
5 | Joe left the milk. |
6 | Joe went to the bathroom. |
回答問句(Answering a query)
回答一個問句\(q\),首先通過打分函數\(s_0\)計算出與\(q\)最相關的句子\(m_{01}\)。然後將句子\(m_{01}\)與\(q\)結合起來形成新的問句\([q, m_{01}]\),並且定位最高分的下一個句子\(m_{01}\)。最後形成另外一個問句\([q, m_{01}, m_{02}]\)。此時我們並不是用該問句去查詢下一個句,而是通過另一個打分函數定位一個詞\(w\)。以上述例子來說明該過程。
回答問句\(q\)“where is the milk now?”,我們基於以下式子計算第一個推斷:$$o_1 = \mathop{\arg\min}_{i=1,...,N}s_0(q, m_i)$$其中,\(s_0\)是計算輸入\(x\)與\(m_i\)匹配分數的函數,\(o_1\)是記憶\(m\)中最佳匹配索引。這裏\(m_{01}\)是第一個推斷中最好的匹配句:“Joe left the milk.”。
然後,基於\([q: "where is the milk now", m_{01}: "Joe left the milk."]\)$$o_2
= \mathop{\arg\max}_{i=1,...,N}s_0([q, m_{01}], m_i)$$其中\(m_{02}\)是“Joe traveled to the office.”。
結合問句和推導的結果記為\(o\):$$o = [q, m_{01}, m_{02}] = ["where is the milk now"," Joe left the milk."," Joe travelled to the office."]$$生成最終的答複\(r\):$$r = \mathop{\arg\max}_{w \in W}s_r([q, m_{01}, m_{02}], w)$$其中\(W\)是字典中的所有詞,\(s_r\)是另外一個計算\([q, m_{01}, m_{02}]\)和詞\(w\)的匹配度。在我們的例子中,最後的回答\(r\)是“office”。
編碼輸入(Encoding the input)
我們利用詞袋法(bags of words)表示輸入文本。首先,我們以大小為\(\left|W\right|\)開始。
用詞袋法對問句“where is the milk now”編碼:
Vocaulary | ... | is | Joe | left | milk | now | office | the | to | travelled | where | ... |
---|---|---|---|---|---|---|---|---|---|---|---|---|
where is the milk now | ... | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | ... |
"where is the milk now"=(...,1,0,0,1,1,0,1,0,0,1,...)
為了達到更好的效果,我們分別用三個詞集編碼\(q\),\(m_{01}\)和\(m_{02}\),即\(q\)中的詞“Joe”編碼為“Joe_1”,\(m_{01}\)中同樣的詞編碼為“Joe_2”:
\(q\): Where is the milk now?
\(m_{01}\): Joe left the milk.
\(m_{02}\): Joe travelled to the office.
編碼上述\(q, m_{01}, m_{02}\):
... | Joe_1 | milk_1 | ... | Joe_2 | milk_2 | ... | Joe_3 | milk_3 | ... |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | 1 | 0 |
因此,每一句變換為大小為\(3\left|W\right|\)的編碼。
計算打分函數(Compute the scoring function)
我們用詞嵌入\(U\)轉換\(3\left|W\right|\)詞袋編碼的句子為大小為\(n\)的詞嵌入表示。計算打分函數\(s_0\)和\(s_r\):$$s_0(x, y)n = \Phi_x(x)^TU_0^TU_0\Phi_y(y)$$ $$s_r(x, y)n = \Phi_x(x)^TU_r^TU_r\Phi_y(y)$$其中\(U_0\)和\(U_r\)由邊緣損失函數訓練得到,\(\phi(m_i)\)轉換句子\(m_i\)為詞袋表示。
邊緣損失函數(Margin loss function)
用邊緣損失函數訓練\(U_0\)和\(U_r\)中的參數:
$$\sum\limits_{\overline{f}\not=m_{01}}\max(0, \gamma - s_0(x, m_{01}) + s_0(x, \overline{f})) + $$ $$\sum\limits_{\overline{f}\not=m_{02}}\max(0, \gamma - s_0(\left[x, m_{01}\right], m_{02}) + s_0(\left[x, m_{01}\right], \overline{f^\prime})) + $$ $$\sum\limits_{\overline{r}\not=r}\max(0, \gamma - s_0(\left[x, m_{01}, m_{02}\right], r) + s_0(\left[x, m_{01}, m_{02}\right], \overline{r}))$$其中\(\overline{f}\),\(\overline{f^\prime}\)和\(\overline{r}\)是真實標簽外的其它可能預測值。即當錯誤回答的分數大於正確回答的分數減\(\gamma\)時增加邊緣損失。
大記憶網絡(Huge memory networks)
對於記憶規模較大的的係統,計算每個記憶的分數較昂貴。其它可選方案為,計算完詞嵌入\(U_0\)後,運用K-clustering將詞嵌入空間分為K類。然後將每個輸入\(x\)映射到相應類中,並在類空間中進行推測而不是在全部記憶空間中。
端到端記憶網絡(End to End memory network, MemN2N)
The description, as well as the diagrams, on the end to end memory networks (MemN2N) are based on End-To-End Memory Networks, Sainbayar Sukhbaatar etc..
以一句問句“where is the milk now?”開始,並用大小為\(V\)的詞袋表示(其中\(V\)為詞典大小)。簡單地,用詞嵌入矩陣\(B(d\times V)\)轉換上述向量為\(d\)維詞嵌入。$$u = embedding_B(q)$$ 對於每個記憶項\(x_i\),用另一個詞嵌入矩陣\(A(d\times V)\)轉換為d維向量\(m_i\)。$$m_i = embedding_A(x_i)$$
通過計算\(u\)與每個記憶\(m_i\)的內積然後softmax得到其匹配度:$$p_i =
softmax(u^Tm_i)$$
用第三個詞嵌入矩陣編碼\(x_i\)為\(c_i\):$$c_i = emedding_C(x_i)$$ 計算輸出:$$o = \sum\limits_{i}p_ic_i$$
用矩陣\(W(V\times d)\)乘以\(o\)和\(u\)的和。結果傳給softmax函數預測最終答案。$$\hat = softmax(W(o + u))$$
這裏,將所有步驟總結為一個圖:
多層
語言模型
動態記憶網絡(Dynamic memory network)
最後更新:2017-10-22 22:03:38