DeBERTa#

DEBERTA: DECODING-ENHANCED BERT WITH DIS-ENTANGLED ATTENTION

Published as a conference paper at ICLR 2023 Paper PDF

Pengcheng He, Microsoft Dynamics 365 AI
Xiaodong Liu, Microsoft Research
Jianfeng Gao, Microsoft Research
Weizhu Chen, Microsoft Dynamics 365 AI
{penhe,xiaodl,jfgao,wzchen}@microsoft.com

microsoft์—์„œ deberta๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค๋ณด๋‹ˆ ์ง‘ํ•„์ง„์ด ์ „๋ถ€ microsoft์ด๋‹ค.

๋ฌธ์ œ ์„ค์ • Introduction#

Transformer๋Š” nlp์—์„œ ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ network architecture๋กœ ์ž๋ฆฌ์žก์•˜๋‹ค. ์ˆœ์„œ์— ๋”ฐ๋ผ text๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” RNN๋“ค๊ณผ ๋‹ค๋ฅด๊ฒŒ, tranformers๋Š” ์ž…๋ ฅ ํ…์ŠคํŠธ(input text)์˜ ๋ชจ๋“  ๋‹จ์–ด(every words)์— ๋Œ€ํ•ด์„œ self-attention์„ ์ ์šฉํ•˜์—ฌ, attention weight๋ฅผ ๋ฝ‘์•„๋‚ธ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด attention weight๋Š” ๊ฐ ๋‹จ์–ด๋“ค์ด ์„œ๋กœ์—๊ฒŒ ์–ด๋– ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ˆ˜์น˜์ด๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹(๋ณ‘๋ ฌ์ฒ˜๋ฆฌ)์„ transformer๊ฐ€ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— RNN๊ณผ ๊ฐ™์€ ์ˆœ์ฐจ์ฒ˜๋ฆฌ ๋ชจ๋ธ๋ณด๋‹ค large-scale training์ด ๊ฐ€๋Šฅํ•œ ๊ฒƒ์ด๋‹ค.

1.1 Parallelization#

RNN๊ณผ transformer๋Š” ๋‘˜ ๋‹ค sequence modeling architecture์ด๋‹ค. ํ•˜์ง€๋งŒ transformer๋Š” ์‹œํ€€์Šค์˜ ๋ชจ๋“  ์œ„์น˜์— ๋Œ€ํ•ด ๋™์‹œ์— ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ด์„œ ๋ณ‘๋ ฌํ™”๊ฐ€ ๋” ์‰ฝ๋‹ค. ๋ณ‘๋ ฌํ™”๋ผ๋Š” ๋ง ์ž์ฒด๋Š” ํ•˜๋‚˜์˜ ์ž‘์—…์„ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ž‘์€ ์ž๊ฒ…ใ…‚์œผ๋กœ ๋‚˜๋ˆ„์–ด ๋™์‹œ์— ์ฒ˜๋ฆฌํ•จ์œผ๋กœ์จ ์ „์ฒด์ ์œผ๋กœ๋Š” ํ•˜๋‚˜์˜ ์ž‘์—…์„ ๋” ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ธฐ์ˆ ์„ ๋งํ•œ๋‹ค. ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์˜ ํ•™์Šต์€ ๊ณ„์‚ฐ๋Ÿ‰, ๋ฐ์ดํ„ฐ๋Ÿ‰์ด ๋งŽ์„ ์ˆ˜๋ก ์ข‹์€ ๋ฐฉํ–ฅ์œผ๋กœ ํ˜๋Ÿฌ๊ฐ€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Ÿฌํ•œ ๊ธฐ์ˆ ์€ ํ•„์ˆ˜์ ์ด๋‹ค. GPU๋ฅผ ๋™์‹œ์— ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ชผ๊ฐœ๊ฑฐ๋‚˜, ์—ฐ์‚ฐ์„ ์ชผ๊ฐœ๊ฑฐ๋‚˜ ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์—ฌ๋Ÿฌ ๋ถ€๋ถ„์—์„œ ์ชผ๊ฐœ์–ด ๋™์‹œ๋‹ค๋ฐœ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์„ ์ „๋ฐ˜์ ์œผ๋กœ ๋ณ‘๋ ฌํ™”๋ผ๊ณ  ๋งํ•  ์ˆ˜ ์žˆ๋‹ค.

๋ณ‘๋ ฌํ™”๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” 3๊ฐ€์ง€ ํŠน์ง•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜: Transformer๋Š” Self-Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๋ชจ๋“  ์š”์†Œ๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด ์„œ๋กœ ๋…๋ฆฝ์ ์ธ ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

  2. Layer ๋‹จ์œ„ ๋ณ‘๋ ฌํ™”: Transformer์˜ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”๋Š” ์—ฌ๋Ÿฌ ์ธต์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์ด๋Ÿฌํ•œ ์ธต๋“ค์€ ๋…๋ฆฝ์ ์œผ๋กœ ๋ณ‘๋ ฌ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  3. ๋งˆ์Šคํ‚น: ํ•™์Šต ์‹œ์—๋Š” ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์˜ ์ถœ๋ ฅ๋งŒ์„ ์ƒ์„ฑํ•˜๋„๋ก ๋งˆ์Šคํ‚นํ•˜์—ฌ ์ˆœ์ฐจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ถ”๋ก (inference) ๋‹จ๊ณ„์—์„œ๋Š” ๋ณ‘๋ ฌํ™”ํ•˜์—ฌ ๋™์‹œ์— ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

1.2 Disentangled attention#

relative position embedding๋ฅผ ์‚ฌ์šฉํ•˜์ž. ๊ทผ๋ฐ vector๋ฅผ ๋‘ ๊ฐœ๋กœ ๋‚˜๋ˆ ์„œโ€ฆ

๊ธฐ์กด์˜ self-attention์˜ ์œ„์น˜์ •๋„ encoding ๋ฌธ์ œ
์›๋ž˜(transformers)์˜ self-attention mechanism์€ ์ฃผ์–ด์ง„ input sequence์—์„œ ๋ชจ๋“  ์œ„์น˜์˜ ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ output์„ ์ƒ์„ฑํ•œ๋‹ค. input sequence์˜ ๋ชจ๋“  ์œ„์น˜ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋†’์€ ์ˆ˜์ค€์˜ ์ƒํ˜ธ์˜์กด์„ฑ์„ ๊ฐ€์ง„ ์ „์—ฐ๊ฒฐ ๊ทธ๋ž˜ํ”„๋กœ ๋ชจ๋ธ๋ง์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, sequence์˜ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์งˆ์ˆ˜๋ก ์—ฐ์‚ฐ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•˜๊ณ , ์žฅ๊ธฐ์˜์กด์„ฑ์„ ํ•™์Šตํ•˜๋Š”๋ฐ ์–ด๋ ค์›€์ด ์ปค์ง„๋‹ค.

positional bias
๋•Œ๋ฌธ์— ๊ธฐ์กด์˜ word embedding์— positional bias๋ฅผ ์ถ”๊ฐ€ํ•ด์ค˜์„œ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ์—์„œ content์™€ position์˜ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋„๋ก ํ•˜๋Š” ๋ฐฉ์‹์ด ์‚ฌ์šฉ๋˜์—ˆ๋‹ค.

  • absolute position embedding : ๋ฌธ์žฅ ๋‚ด ์œ„์น˜๋งŒ ๊ณ ๋ ค

  • relative position embedding : ๋‹จ์–ด๊ฐ„ ์ƒ๋Œ€์  ์œ„์น˜๋งŒ ๊ณ ๋ ค

๊ธฐ์กด BERT๋ฅผ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ ‡๊ฒŒ ๋งํ•œ๋‹ค. each word in the input layer is represented using **a vector** which is the sum of its word (content) embedding and (absolute)position embedding. ๋ถ„๋ช…ํ•˜๊ฒŒ absolute position ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•œ ๊ฒƒ์€ ๋งž์ง€๋งŒ ๋’ค์—์„œ ๋”ฐ๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ ์ด์šฉํ•˜๋˜, ์—ฌ๊ธฐ์„œ๋Š” relative position embedding์„ ์ด์šฉํ•˜์—ฌ ์œ„์น˜์ •๋ณด๋ฅผ input์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ , ์‹ค์ œ๋กœ ๋งŽ์€ ์„ ์ œ์—ฐ๊ตฌ๋“ค์—์„œ ์ด ๋ฐฉ๋ฒ•์˜ ์ค‘์š”์„ฑ์„ ๋งํ•ด์คฌ๋‹ค๊ณ  ํ•œ๋‹ค.

โ€ฆ

word in input layer

transformer

word itself

Bert

vector(word embedding(token) + position embedding)

Deberta

vector(word embedding(token)),vector(absoulte position embedding)

์—ฌ๊ธฐ์„œ๋Š” relative word embedding์„ ์‚ฌ์šฉ
disentangled attention์€ ์ด๋ ‡๊ฒŒ ๊ฐ๊ฐ์˜ embedding์œผ๋กœ word์— ๋Œ€ํ•œ ํ‘œํ˜„์„ ํ•œ ๋’ค, ๊ฐ๊ฐ์˜ attention weight๋ฅผ disentangled matrices = 2๊ฐœ์˜ vectors -> 2x2 = 4๊ฐœ์˜ matrices๋ฅผ ์ด์šฉํ•ด์„œ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด๋Š” ๋‹จ์–ด์˜ ๊ด€๊ณ„๋ผ๋Š” ๊ฒƒ์ด ์˜๋ฏธ์  ๊ด€๊ณ„๋งŒ ์žˆ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ƒ๋Œ€์ ์ธ ์œ„์น˜์— ์˜ํ•ด์„œ๋„ ์˜ํ–ฅ์„ ๋ฐ›๋Š” ๊ฒƒ์—์„œ ๋‚˜์˜จ ๊ฒƒ์ด๋‹ค. ๋ณธ๋ฌธ์˜ ์˜ˆ์‹œ๋Š” โ€˜deepโ€™,โ€™learningโ€™์ด๋ผ๋Š” ๋‘ ๋‹จ์–ด๊ฐ€ ๋‹ค๋ฅธ ๋ฌธ์žฅ์—์„œ ์žˆ์„ ๋•Œ๋ณด๋‹ค ์˜†์— ๋ถ™์–ด์žˆ์„ ๋•Œ ์˜์กด์„ฑ์ด ๊ต‰์žฅํžˆ ๋†’์•„์ง€๋Š” ๊ฒƒ์œผ๋กœ ๋“ ๋‹ค.

1.3 Enhanced mask decoder#

absolute position embedding ๋”ํ•˜๊ธฐ

MLM : masked language modeling์€ Bert์—์„œ ๋‚˜์™”๋˜ ๊ฒƒ์œผ๋กœ ํ•ด๋‹น ๋ชจ๋ธ ์—ญ์‹œ ์ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ด์„œ ๊ธฐ๋ณธ์ ์ธ pre-training์„ ์ง„ํ–‰ํ•œ๋‹ค. MLM ์ด๋ผ๋Š” ๊ฒƒ์€ fill-in-the-black task๋กœ ์ฃผ๋ณ€์˜ ๋‹จ์–ด๋“ค์„ ์‚ฌ์šฉํ•ด์„œ blank(masked word)์˜ ์›๋ž˜ word๋ฅผ ์œ ์ถ”ํ•ด ๋‚ด๋Š” ๊ฒƒ์ด๊ณ , deberta์˜ ์ฐจ์ด๋Š” MLM์„ ํ•˜๋˜ ์œ„์˜ disentangled ์–ดํ…์…˜์„ ์‚ฌ์šฉํ•ด์„œ ๋‘ ๊ฐœ์˜ ๋ณ„๊ฐœ์˜ vector๋ฅผ ์ด์šฉํ•ด์„œ MLM์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ๋ฌธ์ œ๋Š” position embedding์ด ์ƒ๋Œ€์ ์ธ ์œ„์น˜๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ ๋‹จ์–ด์˜ ์ ˆ๋Œ€์  ์œ„์น˜๊ฐ€ ์—†๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ ˆ๋Œ€์  ์œ„์น˜๋Š” ๊ตฌ๋ฌธ๋ก ์ ์œผ๋กœ ์ดํ•ด๋ฅผ ๋•๋Š” ๊ต‰์žฅํžˆ ์ค‘์š”ํ•œ ์ •๋ณด๋ผ๊ณ  ๋งํ•œ๋‹ค.

a new store opened beside the new mall

์ด๋Ÿฌํ•œ ์˜ˆ์‹œ์—์„œ store, mall์ด ๋‘˜ ๋‹ค masking๋˜์—ˆ์„๋•Œ, ๋‘˜์˜ context์ ์ธ ์˜๋ฏธ๋Š” โ€˜๊ฐ€๊ฒŒโ€™๋กœ ๋น„์Šทํ•˜์ง€๋งŒ ๊ตฌ๋ฌธ๋ก ์ ์œผ๋กœ ์•ž์˜ store๋Š” ์ฃผ์–ด์˜ ์—ญํ• ์„ ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ตฌ๋ฌธ๋ก ์ ์ธ ์ •๋ณด๋ฅผ ์ฃผ๋Š” ๊ฒƒ์€ context, content๊ฐ€ ์•„๋‹ˆ๋ผ absolute position์ด๋‹ค.

ํ•ด์„œ deberta์—์„œ๋Š” ์ด๋Ÿฌํ•œ absolute position embedding์„ softmax layer ์ง์ „์— ๋„ฃ์–ด์ค€๋‹ค. ๋ฐ”๋กœ ์ด ์ง€์ ์ด ๋ชจ๋ธ์ด ์•ž์—์„œ ๋งํ•œ content, position embedding ์กฐํ•ฉ์„ ๊ธฐ๋ฐ˜์œผ๋กœ masking์„ ํ•ด๋…ํ•˜๊ธฐ ์ง์ „์ด๋‹ค. ๊ทธ๋‹ˆ๊น ๊ฑฐ์˜ ๋งˆ์ง€๋ง‰์— ๋„ฃ์–ด์ฃผ๋Š” ๊ฒƒ์œผ๋กœ โ€˜์ฐธ์กฐโ€™์˜ ์ˆ˜์ค€์œผ๋กœ ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

1.3 virtual adversarial training#

์ด๊ฑด ์ž์„ธํ•˜๊ฒŒ ์„ค๋ช…์ด ์•ž์— ์•ˆ๋‚˜์˜ค๋Š”๋ฐ model generalization์— ์ข‹๋‹ค๊ณ  ํ•œ๋‹ค.

1.4 MLM#

Masked Language Model

๊ธฐ์กด์˜ large-scale transformer-based PLM(pretrained language model)๋“ค์€ ๋ณดํ†ต ๋งŽ์€ ์–‘์˜ ํ…์ŠคํŠธ๋ฅผ ๊ฐ€์ง€๊ณ  ๋ฌธ๋งฅ์ ์ธ ๋‹จ์–ด์˜ ํ‘œํ˜„(contextual word representation)์„ ๋ฐฐ์šฐ๊ธฐ ์œ„ํ•ด์„œ ํ•™์Šต ๊ณผ์ •์„ ๊ฑฐ์ณค๋‹ค. ์ด ํ•™์Šต ๊ณผ์ •์€ โ€˜self-supervision objectiveโ€™ ๊ทธ๋Ÿฌ๋‹ˆ๊น ์ž๊ธฐ์ง€๋„ ํ•™์Šต์ธ๋ฐ, ๋ฐฉ๋ฒ•๋ก  ์ ์œผ๋กœ๋Š” MLM์„ ๊ฐ€๋ฆฌํ‚จ๋‹ค. ์ด์ œ ์ˆ˜์‹์ ์œผ๋กœ ๋“ค์–ด๊ฐ€๋ณด์ž

DeBERTa architecture#

3.1 Disentangled attention#

a two-vector approach to content and position embedding

$i\( : position of token in a sequence\ \){H_i}\( : content of token, hidden state, output from encoding\ \){P_{i|j}}$ : relative position with j-th token, from distance of tokens\

cross attention score token_i / token_j

\[\begin{split} \begin{matrix} A_{i,j} &=& \{H_i,P_{i|j}\} \times \{H_i,P_{j|i}\}^{\intercal}\\ &=& H_iH_j^{\intercal} + H_iP_{j|i}^{\intercal} + P_{i|j}H_j^{\intercal}+P_{i|j}P_{j|i}^{\intercal}\\ \end{matrix} \end{split}\]

ํ•˜๋‚˜์˜ ๋‹จ์–ด ์Œ์— ๋Œ€ํ•ด์„œ attention weight๋ฅผ ๊ณ„์‚ฐํ•˜๋ ค๋ฉด 4 attention score๋ฅผ ๊ตฌํ•ด์•ผ ํ•œ๋‹ค. ์ด ๊ณผ์ •์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๊ฒƒ์ด disentangled matrices์ด๊ณ  ์ด๋Š” content position ๋‘ ๊ฐœ์— ๋Œ€ํ•ด์„œ content-content,content-position,position-content,position-position๋กœ 4๊ฐœ๊ฐ€ ๋œ๋‹ค.

  1. \(H_i H_j^{\intercal}\): i๋ฒˆ์งธ ํ† ํฐ์˜ ๋‚ด์šฉ ์ •๋ณด์™€ j๋ฒˆ์งธ ํ† ํฐ์˜ ๋‚ด์šฉ ์ •๋ณด ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” ํ† ํฐ ๊ฐ„์˜ ๋‚ด์šฉ์ ์ธ ์œ ์‚ฌ์„ฑ์„ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.content-content

  2. \(H_i P_{j|i}^{\intercal}\): i๋ฒˆ์งธ ํ† ํฐ์˜ ๋‚ด์šฉ ์ •๋ณด์™€ j๋ฒˆ์งธ ํ† ํฐ์— ๋Œ€ํ•œ i๋ฒˆ์งธ ํ† ํฐ์˜ ์ƒ๋Œ€์  ์œ„์น˜ ์ •๋ณด ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” i๋ฒˆ์งธ ํ† ํฐ์˜ ๋‚ด์šฉ์ด j๋ฒˆ์งธ ํ† ํฐ์— ๋Œ€ํ•œ ์œ„์น˜์— ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€๋ฅผ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.content-position

  3. \(P_{i|j} H_j^{\intercal}\): i๋ฒˆ์งธ ํ† ํฐ์— ๋Œ€ํ•œ j๋ฒˆ์งธ ํ† ํฐ์˜ ์ƒ๋Œ€์  ์œ„์น˜ ์ •๋ณด์™€ j๋ฒˆ์งธ ํ† ํฐ์˜ ๋‚ด์šฉ ์ •๋ณด ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” j๋ฒˆ์งธ ํ† ํฐ์˜ ๋‚ด์šฉ์ด i๋ฒˆ์งธ ํ† ํฐ์— ๋Œ€ํ•œ ์œ„์น˜์— ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€๋ฅผ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.position-content

  4. \(P_{i|j} P_{j|i}^{\intercal}\): i๋ฒˆ์งธ ํ† ํฐ์— ๋Œ€ํ•œ j๋ฒˆ์งธ ํ† ํฐ์˜ ์ƒ๋Œ€์  ์œ„์น˜ ์ •๋ณด์™€ j๋ฒˆ์งธ ํ† ํฐ์— ๋Œ€ํ•œ i๋ฒˆ์งธ ํ† ํฐ์˜ ์ƒ๋Œ€์  ์œ„์น˜ ์ •๋ณด ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” ํ† ํฐ ๊ฐ„์˜ ์ƒ๋Œ€์ ์ธ ์œ„์น˜์  ์œ ์‚ฌ์„ฑ์„ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.position-position - relative position embedding์—์„œ๋Š” ์ œ๊ฑฐ๋˜๋Š” ๋ถ€๋ถ„

์ด 4๊ฐœ๋ฅผ disentangled attentions๋ผ๊ณ  ํ•˜๋Š” ๊ฒƒ์ด๋‹ค
์ด์ „์˜ relative position encoding์€ ๊ธฐ์กด์˜ attention weight์—๋‹ค๊ฐ€ relative position bias๋ฅผ ๋”ํ•ด์ฃผ๋Š” ์‹์œผ๋กœ ์œ„์—์„œ์˜ 1,2๋ฒˆ๋งŒ์„ ์‚ฌ์šฉํ•˜๋Š” ์‹์œผ๋กœ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ deberta์—์„œ๋Š” ์ด๋ ‡๊ฒŒ ๋ชจ๋“  ๊ฒƒ์„ ์‚ฌ์šฉํ•˜๋˜ 4๋ฒˆ์€ ์•ˆ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋ฏธ 1,2,3์—์„œ ๋‘ ํ† ํฐ ๊ฐ„์˜ ์ƒ๋Œ€์  ์œ„์น˜ ์ •๋ณด๋ฅผ ์ถฉ๋ถ„ํžˆ captureํ•œ๋‹ค๊ณ  ํŒ๋‹จํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ˆ˜์‹์ƒ์œผ๋กœ๋Š” out๋˜์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

3.1.1 standard self-attention operation#

\(R^{N \times d}\) : N(๋ฌธ์žฅ์˜ ๊ธธ์ด, ํ† ํฐ์˜ ์ˆ˜)๊ฐœ์˜ ํ–‰ * d(ํ† ํฐ์˜ hidden ๋ฒกํ„ฐ์˜ ์ฐจ์›)๊ฐœ์˜ ์—ด - ์ฐจ์›์˜ ์‹ค์ˆ˜ยฎ ํ–‰๋ ฌ
\(H \in R^{N \times d}\) : input hidden vectors
\(H_o \in R^{N \times d}\) : output of self-attention
\(W_q,W_k,W_v \in R^{d \times d}\) : projection matrices
\(A \in R^{N \times N}\) : attention matrix

\[\begin{split} \begin{matrix} Q &=& HW_q,\\ K &=& HW_K,\\ V &=& HW_v,\\ A &=& \frac{QK^T}{\sqrt{d}}\\ H_o &=& \text{softmax}(A)V\\ \end{matrix} \end{split}\]

์—ฌ๊ธฐ์„œ Q, K, V๋Š” ๊ฐ๊ฐ query, key, value๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, d_k๋Š” key์˜ ์ฐจ์›์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด ๋•Œ ๊ฐ ํ† ํฐ์˜ ์œ„์น˜ ์ •๋ณด๋Š” absolute positional encoding ๋ฐฉ์‹์„ ํ†ตํ•ด ๊ฐ ํ† ํฐ์˜ hidden state์— ์ถ”๊ฐ€๋˜๊ณ , ์ด ์ •๋ณด๊ฐ€ ํฌํ•จ๋œ hidden state๊ฐ€ query, key, value๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

3.1.2 relative distance#

\(k\) : maximum relative distance
\(\delta(i,j)\in[0,2k)\) : the relative distance from token i to token j

\[\begin{split} \delta(i,j) = \begin{cases} 0 & \text{for} & i-j\leqslant-k\\ 2k-1 & \text{for} & i-j\geqslant k\\ i-j+k &\text{others.} &\\ \end{cases} \end{split}\]

์—ฌ๊ธฐ์„œ [ : ํฌํ•จ, ) : ๋ฏธํฌํ•จ์„ ์˜๋ฏธํ•œ๋‹ค. ์ƒ๋Œ€์  ๊ฑฐ๋ฆฌ๊ฐ’์ด 0<=x<2k ์˜ ๋œป์ด๋ผ๊ณ  ํ•œ๋‹ค. 0~2k-1์‚ฌ์ด์˜ ์ •์ˆ˜ ๊ฐ’

3.1.3 disentangled self-attention#

with relative position bias as equation 4

\(Q_c,K_c,V_c\) : projected content vectors from projection matrices
\(W_{q,c},W_{k,c},W_{v,c} \in R^{d\times d}\) : projection matrices
\(P \in R^{2k\times d}\) : relative position embedding vectors shared with all layers
\(Q_r, K_r\) : projected relative position vectors
\(W_{q,r},W_{k,r} \in R^{d\times d}\) : projection matrices

\[\begin{split} \begin{matrix} Q_c &=& HW_{q,c},\\ K_c &=& HW_{k,c},\\ V_c &=& HW_{v,c},\\ Q_r &=& PW_{q,r},\\ K_r &=& PW_{k,r}\\ \end{matrix} \end{split}\]

\(\tilde{A}_{i,j}\) : \(\tilde{A}\)์˜ ์š”์†Œ. ํ† ํฐ i์—์„œ token j ๋กœ์˜ ์–ดํ…์…˜ ์Šค์ฝ”์–ด
\(Q_i^c\) : i-th row of \(Q_c\)
\(Q^r_{\delta(j,i)}\) : \(Q_r\) with regarding to ์ƒ๋Œ€์  ๊ฑฐ๋ฆฌ \(\delta(j,i)\)

\[ \tilde{A}_{i,j} = Q_i^cK_j^{c\intercal} + Q_i^c{K_{\delta(i,j)}^r}^\intercal + K_j^c{Q_{\delta(j,i)^r}}^\intercal \]

projection matrix#

ํ•œ ๋ฒกํ„ฐ๋‚˜ ๊ณต๊ฐ„์„ ๋‹ค๋ฅธ ๊ณต๊ฐ„์œผ๋กœ mappingํ•˜๋Š” ๊ฒƒ์„ ๋„์šฐ๋Š” matrix