Deberta V3#

DEBERTAV3: IMPROVING DEBERTA USING ELECTRA-STYLE PRE-TRAINING WITH GRADIENT-DISENTANGLED EMBEDDING SHARING

Published as a conference paper at ICLR 2023 Paper PDF

๋ฌธ์ œ ์„ค์ •#

PLMs(Pre-trained Language Models)๋ฅผ scaling up์„ ํ•ด์„œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ˆ˜์‹ญ์–ต ์ˆ˜๋ฐฑ๋งŒ ๋‹จ์œ„๋กœ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์ด ํ™•์‹คํ•œ ์„ฑ๋Šฅํ–ฅ์ƒ์ด ์žˆ์—ˆ๊ณ  ์ง€๊ธˆ๊นŒ์ง€์˜ ์ฃผ๋„์ ์ธ ๋ฐฉ๋ฒ•์ด์—ˆ์ง€๋งŒ, ๋” ์ฃผ์š”ํ•œ ๊ฒƒ์€ parameter๋ฅผ ์ค„์ด๊ณ  computation cost๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์ด๋ผ๊ณ  ๋งํ•œ๋‹ค.

Improving Efficency

  1. incorporating disentangled attention(imporoved relative-position encoding mechanism)

    • DeBERTA๋Š” 1.5B๊นŒ์ง€ scaling up์„ ํ•จ์œผ๋กœ์จ SuperGLUE์—์„œ ์ฒ˜์Œ์œผ๋กœ ์‚ฌ๋žŒ performance๋ฅผ ๋„˜์–ด์„ฐ๋‹ค.

  2. Replaced Token Detection(RTD) vs Masked language modeling(MLM)

    • proposed by ELECTRA(2020)

    • transformer encoder๋ฅผ ์˜ค์—ผ๋œ token๋ฅผ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•˜๋˜ BERT(MLM)์™€๋Š” ๋‹ค๋ฅด๊ฒŒ

    • RTD๋Š” generator, discriminator ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. generator๋Š” ํ—ค๊น”๋ฆฌ๋Š” ์˜ค์—ผ์„ ๋งŒ๋“ค์–ด๋‚ด๊ณ , discriminator๋Š” generator๊ฐ€ ๋งŒ๋“  ์˜ค์—ผ๋œ ํ† ํฐ์„ original inputs๊ณผ ๊ตฌ๋ถ„์„ ํ•ด๋‚ด๋ คํ•œ๋‹ค. ๋งˆ์น˜ GAN(Generative Adversarial Networks)๋ž‘ ์ƒ๋‹นํžˆ ๋น„์Šทํ•œ ๋ฉด์ด ์žˆ๋‹ค. ์ฐจ์ด์ ์— ๋Œ€ํ•ด์„œ ๋ถ„๋ช…ํ•˜๊ฒŒ ํ•˜์ž.

์—ฌ๊ธฐ์„œ ์ด์ „์˜ DeBERTa์—์„œ V3๋กœ ๋‚˜์•„๊ฐ€๋ฉด์„œ ๋ฐ”๋€ ์ ์œผ๋กœ ๋‘ ๊ฐ€์ง€๋กœ ๊ผฝ๋Š”๋‹ค. ํ•˜๋‚˜๋Š” ์œ„์—์„œ ๋งํ•œ BERT์˜ MLM์„ ELECTRA ์Šคํƒ€์ผ์˜ RTD(where the model is trained as a discriminator to predict whethre a token in the corrupt input is either original or replaced by a generator)๋กœ ๋ฐ”๊พธ๋Š” ๊ฒƒ์ด๊ณ , ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” new embedding sharing method์ด๋‹ค. ELECTRA์—์„œ generator discriminator๋Š” ๊ฐ™์€ token embedding์„ ๊ณต์œ ํ•œ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์—ฐ๊ตฌ์ž๋“ค์€ ๋ณธ์ธ๋“ค์˜ ์—ฐ๊ตฌ์— ๋”ฐ๋ฅด๋ฉด ์ด๊ฒƒ์€ ํ•™์Šต ํšจ์œจ์„ฑ ๋ฉด์ด๋‚˜ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ๋ฉด์—์„œ ๋ถ€์ •์ ์ธ ์˜ํ–ฅ์„ ์ค€๋‹ค๊ณ  ๋งํ•œ๋‹ค. training losses of the discriminator and the generator pull token embeddings into opposite directions ์ฆ‰ ์ƒ์„ฑ๊ธฐ์™€ ๋ถ„๋ฅ˜๊ธฐ์˜ ํ•™์Šต์˜ ๋ฐฉํ–ฅ์„ฑ์ด ๋ฐ˜๋Œ€์ด๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต loss๊ฐ€ ๊ฐˆํŒก์งˆํŒกํ•  ์ˆ˜ ๋ฐ–์— ์—†๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿผ ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด ๋’ค์— ๋‘๊ฐœ์˜ loss๋ฅผ ๋งŒ๋“ค์–ด์„œ ํ•™์Šต ๋ฐฉํ–ฅ์„ฑ์„ ๋ฐ˜๋Œ€๋กœ ๊ฐˆ ์žˆ๋„๋ก ํ–ˆ์„๊นŒ ํ•˜๋Š” ์ ์„ ์ƒ๊ฐํ•ด ๋ณผ ์ˆ˜ ์žˆ๊ฒ ๋‹ค. token embedding์„ ๋‹ฌ๋ฆฌ ์ค€๋‹ค๋Š” ๊ฒƒ์ด ์–ด๋–ค ์˜๋ฏธ์ธ์ง€ ๋’ค์—์„œ ํ™•์‹คํ•˜๊ฒŒ ๋‚˜์™€์•ผ ํ•  ๊ฒƒ์ด๋‹ค. MLM์€ generator๋ฅผ token ์ค‘์—์„œ ์„œ๋กœ ๊ด€๋ จ์ด ์žˆ์–ด๋ณด์ด๋Š” ๊ฐ€๊นŒ์šด ๊ฒƒ๋“ค์„ ์„œ๋กœ ์žก์•„๋‹น๊ธฐ๋ฉด์„œ ํ•™-์Šต์„ ์ง„ํ–‰ํ•˜๊ฒŒ ๋œ๋‹ค. ํ•˜์ง€๋งŒ ๋ฐ˜๋ฉด์— RTD์˜ discriminator๋Š” ์˜๋ฏธ์ ์œผ๋กœ ๊ฐ€๊นŒ์šด token์˜ ์‚ฌ์ด๋ฅผ ์ตœ๋Œ€ํ•œ ๋ฉ€๋ฆฌํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ด์ง„๋ถ„๋ฅ˜ ์ตœ์ ํ™”(๋งž๋‹ค ์•„๋‹ˆ๋‹ค)๋ฅผ ํ•˜๊ณ  pull their embedding์„ ํ•˜๊ฒŒ ๋จ์œผ๋กœ์จ ๋”์šฑ ๊ตฌ๋ถ„์„ ์ž˜ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฌํ•œ โ€˜์ค„๋‹ค๋ฆฌ๊ธฐ tug-of-warโ€™์™€ ๊ฐ™์€ ์—ญํ•™์ด ํ•™์Šต์„ ๋ง์น˜๊ณ , ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ๋–จ์–ดํŠธ๋ฆฌ๋Š” ๊ฒƒ์ด๋ผ๊ณ  ๋งํ•œ๋‹ค. ๊ทธ๋ ‡๋‹ค๊ณ  ๋ฌด์กฐ๊ฑด์ ์œผ๋กœ seperated embedding์„ ํ•  ์ˆ˜๋Š” ์—†๋Š” ๊ฒƒ์ด generator์˜ embedding์„ discriminator์˜ ๋‹ค์šด์ŠคํŠธ๋ฆผ taskํ•™์Šต์— ํฌํ•จ์‹œํ‚ค๋Š” ๊ฒƒ์ด ๋„์›€์ด ๋œ๋‹ค๊ณ  ๋งํ•˜๋Š” ๋…ผ๋ฌธ๋„ ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๊ทธ๋ž˜์„œ ๊ทธ๋“ค์ด ์ œ์•ˆํ•˜๋Š” ๊ฒƒ์€ new gradient-disentangled embedding sharing(GDES) method์ด๋‹ค. the generator shares its embeddings with the discriminator but stops the gradients from the discriminator to the generator embeddings. embedding sharing์˜ ์žฅ์ ๋งŒ์„ ์ทจํ•˜๋˜, ์ค„๋‹ค๋ฆฌ๊ธฐ ์—ญํ•™์€ ํ”ผํ•  ์ˆ˜ ์žˆ๋„๋ก gradient์˜ ํ๋ฆ„์ด discriminator์—์„œ generator๋กœ ํ๋ฅด์ง€๋Š” ์•Š๋„๋ก ํ•˜๋Š” ๋ฐฉ์‹์ธ ๊ฒƒ์ด๋‹ค.

Model Table#

Model

Vocabulary(K)

Backbone Parameters(M)

Hidden Size

Layers

Note

V2-XXLarge1

128

1320

1536

48

128K new SPM vocab

V2-XLarge

128

710

1536

24

128K new SPM vocab

XLarge

50

700

1024

48

Same vocab as RoBERTa

Large

50

350

1024

24

Same vocab as RoBERTa

Base

50

100

768

12

Same vocab as RoBERTa

DeBERTa-V3-Large2

128

304

1024

24

128K new SPM vocab

DeBERTa-V3-Base2

128

86

768

12

128K new SPM vocab

DeBERTa-V3-Small2

128

44

768

6

128K new SPM vocab

DeBERTa-V3-XSmall2

128

22

384

12

128K new SPM vocab

mDeBERTa-V3-Base2

250

86

768

12

250K new SPM vocab, multi-lingual model with 102 languages

์ฐธ์กฐ

  1. This is the model(89.9) that surpassed T5 11B(89.3) and human performance(89.8) on SuperGLUE for the first time. 128K new SPM vocab.

  2. These V3 DeBERTa models are deberta models pre-trained with ELECTRA-style objective plus gradient-disentangled embedding sharing which significantly improves the model efficiency.

Background#

1. Transformer#

Transformer ๊ธฐ๋ฐ˜ ์–ธ์–ด๋ชจ๋ธ๋“ค์€ \(L\)๊ฐœ์˜ transformer block์ด ์Œ“์—ฌ์ง„ ํ˜•ํƒœ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ๊ฐ ๋ธ”๋ฝ๋“ค์€ multi-head self-attention layer๋“ค์„ ํฌํ•จํ•˜๊ณ  ๊ทธ ๋’ค๋กœ์€ fully-connected positional feed-forward network๊ฐ€ ๋’ค๋”ฐ๋ฅธ๋‹ค. ๊ธฐ์กด์˜ self-attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ๋‹จ์–ด์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ encodeํ•˜๋Š”๋ฐ๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š์•˜๋‹ค. ๊ทธ๋ž˜์„œ ๊ธฐ์กด์˜ ์ ‘๊ทผ๋ฒ•๋“ค์€ positional bias๋ฅผ ๊ฐ input word embedding์— ๋”ํ•จ์œผ๋กœ์จ, content์™€ position์— ๋”ฐ๋ผ์„œ ๊ฐ’์ด ๋‹ฌ๋ผ์ง€๋Š” vector๋กœ ํ‘œํ˜„ํ•˜๋ ค๊ณ  ํ•˜์˜€๋‹ค. ์ด positional bias๋Š” absolute position embedding, relative position embedding ๋“ฑ์ด ์žˆ์—ˆ๋‹ค. ์ƒ๋Œ€์  ์œ„์น˜ ์ž„๋ฒ ๋”ฉ์ด ์ข‹์€ ๊ฒฐ๊ณผ๋“ค์„ ์ตœ๊ทผ๊นŒ์ง€๋Š” ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋Š” ์ถ”์„ธ๋ผ๊ณ  ํ•œ๋‹ค.

2. DeBERTa#

BERT๋กœ๋ถ€ํ„ฐ ๋‘ ๊ฐ€์ง€ ๊ฐœ์„ ์ ์„ ๋ณด์—ฌ์ค€๋‹ค. ์šฐ์„  DA(Disentengled Attention : ๋ถ„๋ฆฌ๋œ ์–ดํ…์…˜), ๊ทธ๋ฆฌ๊ณ  enhanced mask decoder์ด๋‹ค. ์ด์ „์˜ single vector๋กœ ํ•˜๋‚˜์˜ input word์˜ ๋‚ด์šฉ๊ณผ ์œ„์ง€์ •๋ณด๋ฅผ ํ‘œํ˜„ํ•˜๋ ค๋Š” ๊ฒƒ๊ณผ๋Š” ๋‹ค๋ฅธ๊ฒŒ, DA๋Š” ๋‘ ๊ฐœ์˜ seperate vector๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. one for the content and one for the position. ๊ทธ๋Ÿฌ๋ฉด์„œ DA ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ๋‹จ์–ด๋“ค ์‚ฌ์ด์˜ attention weight๋Š” disentangled matrices์— ์˜ํ•ด์„œ ๊ณ„์‚ฐ๋˜๊ณ  ์ด๋Š” ๊ฐ๊ฐ ๋‚ด์šฉ๊ณผ ์ƒ๋Œ€์  ์œ„์น˜ ๋‘ ๊ฐœ๊ฐ€ ๊ฐ๊ฐ์˜ ํ–‰๋ ฌ๋กœ ๋‹ค๋ฅด๊ฒŒ ๊ณ„์‚ฐ๋œ๋‹ค.

๊ทธ๋ฆฌ๊ณ  MLM์— ๋Œ€ํ•ด์„œ๋Š” BERT์™€ ๋™์ผํ•˜๊ฒŒ ์‚ฌ์šฉ๋œ๋‹ค. DA๊ฐ€ ์ด๋ฏธ ๋‚ด์šฉ๊ณผ ์ƒ๋Œ€์  ์œ„์น˜์— ๋Œ€ํ•œ ๊ณ ๋ฏผ์ด ๋“ค์–ด๊ฐ€๋Š” ์žˆ์ง€๋งŒ, ์ค‘์š”ํ•œ ๊ฒƒ์€ absolute position์— ๋Œ€ํ•œ ๊ณ ๋ฏผ์€ ์—†๋‹ค. absolute position์€ ์˜ˆ์ธก์—์„œ ๊ฝค๋‚˜ ์ฃผ์š”ํ•œ ์š”์†Œ์ž„์œผ๋กœ ์ด๋Ÿฌํ•œ ์ ์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด์„œ DeBERTa์—์„œ๋Š” enhanced Mask Decoder๋ฅผ MLM์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Š” MLM decoding layer์—์„œ context word์— absolute position information์ด ์ถ”๊ฐ€๋กœ ๋“ค์–ด๊ฐ€๋Š” ๋ฐฉ์‹์ด๋‹ค.

3. ELECTRA#

2.3.1 Masked Language Model(MLM)#

Large-scale Transformer-based PLMs ๋Š” ๋ณดํ†ต ๋งŽ์€ ์–‘์˜ ํ…์ŠคํŠธ๋กœ ์‚ฌ์ „ํ•™์Šต๋˜๋ฉด์„œ self-supervision objective(์ž๊ธฐ์ฃผ๋„ํ•™์Šต์˜ ๋ชฉ์ )์ธ MLM์„ ์‹คํ˜„ํ•˜๊ณ  ์ด ๋ง์ธ ์ฆ‰์Šจ ๋ฌธ๋งฅ์„ ์ดํ•ดํ•˜๊ฒŒ ๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

\(X = \{x_i\}\)๋Š” ํ•˜๋‚˜์˜ sequence์ด๊ณ  \(\tilde{X}\)๋Š” 15%์˜ ํ† ํฐ์ด ์˜ค์—ผ(masking)๋œ ๊ฒƒ์ด๋‹ค. ๋ชฉํ‘œ๋Š” ๋‹ค์‹œ reconstruct \(X\)์ด๋‹ค. ๋ฐฉ๋ฒ•์€ language model์„ predicting the masked tokens \(\tilde{x} \text{ conditioned on } \tilde{X}\) ํ•˜๋ฉด์„œ ํ›ˆ๋ จ์‹œํ‚ค๋Š” ๊ฒƒ์ด๊ณ  parameterized by \(\theta\)์ด๋‹ค.

\[ \max_{\theta}\log p_{\theta}(X|\tilde{X}) = \max_{\theta}\sum_{i\in C} \log p_{\theta}(\tilde{x}_i|\tilde{X}) \]

C : index set of the masked tokens in the sequence
BERT์—์„œ๋Š” 10%์˜ masked tokens๋ฅผ ๋ฐ”๊พผ ์ƒํƒœ๋กœ ์œ ์ง€ํ•˜๊ณ , ๋‹ค๋ฅธ 10%์˜ ๋ฌด์ž‘์œ„ ์„ ํƒ๋œ ํ† ํฐ์„ ๋ฐ”๊พธ์—ˆ๊ณ , ๋‚˜๋จธ์ง€ 80%๋Š” ์•„์˜ˆ \([MASK]\) token์œผ๋กœ ์œ ์ง€ํ–ˆ๋‹ค.

2.3.2 Replaced Token Detection(RTD)#

BERT๋Š” ํ•˜๋‚˜์˜ transformer encoder๋ฅผ ์‚ฌ์šฉํ–ˆ๊ณ , MLM์œผ๋กœ ํ›ˆ๋ จ๋˜์—ˆ๋‹ค. ์ด์™€ ๋‹ค๋ฅด๊ฒŒ ELECTRA๋Š” ๋‘ ๊ฐœ์˜ transformer encoders๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด์„œ GAN์ฒ˜๋Ÿผ ํ›ˆ๋ จํ–ˆ๋‹ค. Generator encoder๋Š” MLM์œผ๋กœ ํ›ˆ๋ จ๋˜์—ˆ๊ณ , discriminator encoder๋Š” token-level binary classifier๋กœ ํ›ˆ๋ จ๋˜์—ˆ๋‹ค. generator๋Š” input sequence์—์„œ ๋งˆ์Šคํ‚น๋œ token์„ ๋Œ€์ฒดํ•  ambiguousํ•œ ํ† ํฐ์„ ์ƒ์„ฑํ–ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋ ‡๊ฒŒ ๋งŒ๋“ค์–ด์ง„ sequence๋Š” dicriminator๋กœ ๋“ค์–ด๊ฐ€์„œ ํ•ด๋‹น ํ† ํฐ์ด original ํ† ํฐ์ด ๋งž๋Š”์ง€ ์•„๋‹ˆ๋ฉด generator๊ฐ€ ๋งŒ๋“  ํ† ํฐ์ธ์ง€๋ฅผ ์ด์ง„ ๋ถ„๋ฅ˜ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ์ด์ง„๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ์ด RTD์ด๋‹ค. ์—ฌ๊ธฐ์„œ parameterized ๋˜๋Š” ๋ถ€๋ถ„์ด \(\theta_{G}\)๋Š” generator์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์ด๊ณ , \(\theta_{D}\)๋Š” discriminator์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์ด๋‹ค. Loss function of the generator๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[ L_{MLM} = \mathbb{E} \Big(-\sum_{i \in C}\log p_{\theta_G} \big(\tilde{x}_{i,G} = x_i|\tilde{X}_G \big) \Big) \]
  • \(p_{\theta_G} \big(\tilde{x}_{i,G} = x_i|\tilde{X}_G \big) \Big)\) : ์ด ๋ถ€๋ถ„์ด G๊ฐ€ \(\tilde{X}_G\)๋ฅผ \(x_i\)๋กœ ์žฌ๊ตฌ์„ฑํ•  ํ™•๋ฅ ์ด๋‹ค.

  • ํŠนํžˆ๋‚˜ masking๋œ token(\(\tilde{X}_G\))์€ ์›๋ณธ์—์„œ randomly maskingํ•œ 15%์ด๋‹ค.

discriminator์— ๋“ค์–ด๊ฐ€๋Š” input sequences๋Š” generator์˜ output probability์— ๋”ฐ๋ผ์„œ new tokens์ด masked tokens ์ž๋ฆฌ๋ฅผ ์ฑ„์šด์ฑ„๋กœ ๋“ค์–ด์˜จ๋‹ค. ๊ทธ๋ž˜์„œ i๊ฐ€ C์•ˆ์— ์—†์œผ๋ฉด ๊ทธ๋Œ€๋กœ x์ด๊ณ (replaced token์ด ์•„๋‹ˆ๋ผ ์•„์˜ˆ ์›๋ณธ์ด๊ธฐ ๋•Œ๋ฌธ), index๊ฐ€ C์— ์žˆ๋Š”๊ฑฐ๋งŒ ๋น„๊ต๋ฅผํ•œ๋‹ค.

\[\begin{split} \tilde{x}_{i,D} = \begin{cases} \tilde{x}_i \sim p_{\theta_G} \big(&\tilde{x}_{i,G} = x_i|\tilde{X}_G\big), & i\in C \\ & x_i, &i\notin C \end{cases} \end{split}\]
  • \(\sim\) : sim ๊ธฐํ˜ธ๋Š” โ€˜distributed asโ€™ ๋˜๋Š” โ€˜has a distribution ofโ€™๋กœ ํ•ด์„๋˜๋ฉฐ, โ€˜ํŠน์ • ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค๋Š” ์˜๋ฏธโ€™, โ€˜๋”ฐ๋ฅธ๋‹ค, ๋ถ„ํฌํ•œ๋‹คโ€™์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ sim ๊ธฐํ˜ธ๋Š” ์ฃผ์–ด์ง„ ํ™•๋ฅ  ๋ถ„ํฌ \(p_{\theta_G}\)์— ๋”ฐ๋ผ ๋ณ€์ˆ˜ \(\tilde{x}_i\)๊ฐ€ ๋ถ„ํฌํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ.

  • ๋”ฐ๋ผ์„œ, \(\tilde{x}_i \sim p{\theta_G} (\tilde{x}_{i,G} = x_i|\tilde{X}_G)\)๋Š” ๋ณ€์ˆ˜ \(\tilde{x}_i\)๊ฐ€ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  ๋ถ„ํฌ \(p{\theta_G} (\tilde{x}_{i,G} = x_i|\tilde{X}_G)\)๋ฅผ ๋”ฐ๋ฅธ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ถ„ํฌ๋Š” \(p_{\theta_G}\) ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง€๋ฉฐ, \(\tilde{X}_G\)๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ \(\tilde{x}_{i,G} = x_i\)์ธ ์กฐ๊ฑด๋ถ€ ๋ถ„ํฌ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์œ„ ์ˆ˜์‹์˜ ์ „์ฒด์ ์ธ ์˜๋ฏธ๋Š”, ์ธ๋ฑ์Šค i๊ฐ€ ์ง‘ํ•ฉ C์— ํฌํ•จ๋˜์–ด ์žˆ์œผ๋ฉด, \(\tilde{x}_i\)๋Š” ์กฐ๊ฑด๋ถ€ ๋ถ„ํฌ \(p_{\theta_G} (\tilde{x}_{i,G} = x_i|\tilde{X}_G)\)๋ฅผ ๋”ฐ๋ฅด๊ณ , ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด \(\tilde{x}_i\)๋Š” \(x_i\)์™€ ๋™์ผํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

\[ L_{RTD} = \mathbb{E}\Big( -\sum_{i}\log p_{\theta_D} \big(\mathbb{I}(\tilde{x_{i,D}} = x_i)|\tilde{X}_D,i \big) \Big) \]
  • \(\mathbb{I}\) : indicator function์„ ๋งํ•œ๋‹ค. \((\tilde{x_{i,D}} = x_i)\)๋ฅผ ์ถฉ์กฑํ•˜๋ฉด 1์„ ๋ฐ˜ํ™˜ํ•˜๊ณ , ์•„๋‹ˆ๋ฉด(\(\tilde{X}_D,i\)) 0์„ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ๋งํ•œ๋‹ค. ๋งค์šฐ ์—„๊ฒฉํ•œ ํ•จ์ˆ˜๋กœ ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉฐ ํ›„๋ณด๋Š” โ€˜sigmoidโ€™, โ€˜tahnโ€™, โ€˜ReLUโ€™, โ€˜Leaky ReLUโ€™๋“ฑ์ด ์žˆ๋‹ค.

  • ์œ„ discriminator์˜ loss function์˜ input์€ \(\tilde{X_D}\) ์ด๋ฉฐ ์ด๊ฒƒ์€ ์œ„์˜ 3๋ฒˆ equation์˜ result์ด๋‹ค. ์ฆ‰ discriminator์—์„œ ํ•œ๋ฒˆ ๊ฑธ๋Ÿฌ์ ธ ๋‚˜์˜จ ๊ฒƒ์ด loss function์— ๋“ค์–ด๊ฐ€๋Š” ๊ฒƒ์ด๋‹ค.

์ด์ฒด์ ์ธ ELECTRA์˜ loss function์€ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. $\( L = L_{MLM} + \lambda L_{RTD} \)$

  • \(\lambda\) : discriminator loss function์— ๋Œ€ํ•œ weight๋ฅผ ๋‚˜ํƒ€๋ƒ„. ๋ชจ๋ธ ํ•™์Šต์—์„œ ํ•ด๋‹น ์†์‹ค์˜ ์ค‘์š”์„ฑ์„ ์กฐ์ ˆํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋จ. ์—ฌ๊ธฐ์„œ๋Š” 50์ž„์œผ๋กœ MLM์— ๋น„ํ•ด์„œ 50๋ฐฐ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋” ๊ณฑํ•ด์ค€๋‹ค๋Š” ์˜๋ฏธ์ž„์œผ๋กœ RTD์— ์—„์ฒญ ์ค‘์š”์„ฑ์„ ๋†’๊ฒŒ ์ณ์ฃผ๋Š” ๊ฒƒ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

DeBERTaV3#

DeBERTa + RTD training loss + new weight-sharing method

3.1 DeBERTa + RTD#

ELECTRA์—์„œ ๊ฐ€์ ธ์˜จ RTD, ๊ทธ๋ฆฌ๊ณ  DeBERTa disentangled attention mechanism์˜ ํ•ฉ์€ ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹ ๊ณผ์ •์—์„œ ๊ฐ„๋‹จํ•˜๊ณ  ํšจ๊ณผ์ ์ธ ๊ฒƒ์œผ๋กœ ํŒ๋ณ„๋˜์—ˆ๋‹ค. ์ด์ „ DeBERTa์—์„œ ์‚ฌ์šฉ๋˜์—ˆ๋˜ MLM objective๋ฅผ RTD objective๋กœ ๋ฐ”๊ฟˆ์œผ๋กœ์จ ๋”์šฑ disentangled attention mechainsm์„ ๋”์šฑ ๊ฐ•ํ™”ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

training ๋ฐ์ดํ„ฐ๋กœ๋Š” Wikipedia, bookcorpus์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. generator๋Š” discriminator์™€ ๊ฐ™์€ width๋ฅผ ๊ฐ€์ง€๋˜ depth๋Š” ์ ˆ๋ฐ˜๋งŒ ๊ฐ€์ ธ๊ฐ„๋‹ค. batch size๋Š” 2048์ด๋ฉฐ 125,000 step์ด ํ›ˆ๋ จ๋˜์—ˆ๋‹ค. learning_rate = 5e-4, warmup_steps = 10,000, ๊ทธ๋ฆฌ๊ณ  ์œ„์—์„œ ๋งํ–ˆ๋“ฏ์ด RTD loss function์— ๊ฐ€์ค‘์น˜๋ฅผ 50์„ ์คŒ์œผ๋กœ์„œ optimization hyperparameter๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋กœ๋Š” MNLI, SQuAD v2.0์„ ์‚ฌ์šฉํ•˜์˜€๊ณ , ์ด ๋ฐ์ดํ„ฐ๋“ค์— ๋Œ€ํ•œ ์ •๋ฆฌ๋„ ํ•„์š”ํ•  ๊ฒƒ์ด๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ DeBERTa๋ฅผ ์••๋„ํ•˜์ง€๋งŒ ๋”์šฑ๋” improved๋  ์žˆ๋Š” ํฌ์ธํŠธ๋ฅผ ๋งํ•˜๋Š” ์ง€์ ์ด ์žˆ๋‹ค. token Embedding Sharing(ES) used for RTD(๊ธฐ์กด์— ์‚ฌ์šฉ๋˜์—ˆ๋˜)๋ฅผ new Gradient-Disentangled Embedding Sharing(GDES) method๋กœ ๋ฐ”๊ฟˆ์œผ๋กœ์จ ๋”์šฑ ๋ฐœ์ „๋  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค๊ณ  ๋งํ•œ๋‹ค.

3.2 Token Embedding Sharing (in ELECTRA)#

ELECTRA์—์„œ๋Š” generator์™€ discriminator๊ฐ€ token embedding์„ ๊ณต์œ ํ•œ๋‹ค. ์ด๊ฒƒ์ด Embedding Sharing(ES)์ด๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ generator๊ฐ€ discriminator์— input์œผ๋กœ ๋“ค์–ด๊ฐˆ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•จ์œผ๋กœ์จ ํ•™์Šต์— ํ•„์š”ํ•œ parameter๋ฅผ ์ค„์—ฌ์ฃผ๋Š” ์—ญํ• ์„ ํ•˜๊ณ  ํ•™์Šต์„ ์šฉ์ดํ•˜๊ฒŒ ํ•ด์ค€๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์•ž์—์„œ ๋งํ–ˆ๋“ฏ์ด ๋‘ ๊ธฐ์ œ์˜ ๋ชฉ์  ๋ฐฉํ–ฅ์„ฑ์ด ๋ฐ˜๋Œ€์ด๊ธฐ ๋•Œ๋ฌธ์— ์„œ๋กœ๋ฅผ ๋ฐฉํ•ดํ•˜๊ณ , ํ•™์Šต ์ˆ˜๋ ด์„ ์ €ํ•ดํ•  ๊ฐ€๋Šฅ์„ฑ์ด ํฌ๋‹ค.

  • \(E\) : token embeddings

  • \(g_E\) : gradients = \(\frac{\delta L_{MLM}}{\delta E} + \lambda\frac{\delta L_{RTD}}{\delta E}\)

์œ„์˜ equation์€ token embeddings(E)๊ฐ€ ๋‘ ๊ฐœ์˜ ์ผ์—์„œ์˜ gradient๋ฅผ ํ•œ ๋ฒˆ์— ์กฐ์ •ํ•˜๋ฉด์„œ update๋œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. ์œ„์—์„œ ๋งํ–ˆ๋“ฏ์ด ์ด๊ฒƒ์€ ์ค„๋‹ค๋ฆฌ๊ธฐ ์ด๋‹ค. ์•„์ฃผ ์กฐ์‹ฌ์Šค๋Ÿฝ๊ฒŒ update ์†๋„๋ฅผ ์กฐ์ ˆํ•˜๋ฉด์„œ(small learning_rate, gradient clipping) ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋ฉด ๊ฒฐ๋ก ์ ์œผ๋กœ๋Š” ์ˆ˜๋ ด์„ ํ•˜๊ธฐ๋Š” ํ•œ๋‹ค๊ณ  ๋งํ•œ๋‹ค. ํ•˜์ง€๋งŒ ๋‘ ๊ฐœ์˜ task๊ฐ€ ์ •๋ฐ˜๋Œ€์˜ ๋ชฉ์ ์„ ๊ฐ€์ง„๋‹ค๋ฉด ์ด๊ฒƒ์€ ๊ต‰์žฅํžˆ ๋น„ํšจ์œจ์ ์ด๋ฉฐ, ํ•ด๋‹น ์ƒํ™ฉ(MLM,RTD)์€ ์ •ํ™•ํžˆ ๊ทธ๋Ÿฐ ์ƒํ™ฉ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋‘ ๊ฐœ์˜ task๊ฐ€ token embedding์„ ์—…๋ฐ์ดํŠธ ํ•˜๋ฉด์„œ ๋ฐ”๋ผ๋Š” ๊ฒƒ์ด ํ•˜๋‚˜๋Š” ์œ ์‚ฌ์„ฑ์— ๋”ฐ๋ผ์„œ ๊ฐ€๊น๊ฒŒ ํ•˜๊ณ , ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ์œ ์‚ฌ์„ฑ์— ๋”ฐ๋ผ์„œ ๋ฉ€๊ฒŒํ•˜์—ฌ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

์ด๊ฒƒ์„ ์‹ค์ œ๋กœ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด์„œ ์—ฌ๋Ÿฌ ๋‹ค์–‘ํ•œ ELECTRA๋ฅผ ๊ตฌํ˜„ํ•˜๋˜, ํ•ด๋‹น ELECTRA๋“ค์€ token embedding์„ ๊ณต์œ ํ•˜์ง€ ์•Š๋„๋ก ๊ตฌํ˜„ํ–ˆ๋‹ค๊ณ ํ•œ๋‹ค. ๊ทธ๋ ‡๊ฒŒ ๊ตฌํ˜„์„ ํ•˜๋ฉด No Embedding Sharing(NES)๊ฐ€ ๋˜๋Š” ๊ฒƒ์ด๋‹ค. ์–˜๋„ค๋Š” gradient update๊ฐ€ ๊ฐ๊ฐ ๋œ๋‹ค. ์šฐ์„ ์€ (1) generator์˜ parameter(token embedding with \(E_G\))๊ฐ€ MLM loss๋ฅผ back-propํ•˜๋ฉด์„œ ์—…๋ฐ์ดํŠธ๋˜๊ณ , (2) ์ดํ›„์— discriminator๊ฐ€ generator output์„ input์œผ๋กœ ๋ฐ›๋Š”๋‹ค (3) ๋งˆ์ง€๋ง‰์œผ๋กœ discriminator parameter(token embeddings with \(E_D\))๋ฅผ RTD loss๋ฅผ back-propํ•˜๋ฉด์„œ updateํ•œ๋‹ค.

์ด๋“ค์€ 3๊ฐ€์ง€๋กœ ES vs NES๋ฅผ ๋น„๊ตํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

  1. convergence speed : NES๊ฐ€ ๋‹น์—ฐํžˆ gradient conflict๋ฅผ ๋ฐฉ์ง€ํ•จ์œผ๋กœ ์Šน๋ฆฌ

  2. quality of token embeddings : average cosine similiarity scores๋ฅผ ๋น„๊ตํ–ˆ๋‹ค. \(E_G\)์—์„œ๋Š” ๊ต‰์žฅํžˆ ํšจ๊ณผ๊ฐ€ ์ข‹์•˜์ง€๋งŒ \(E_D\)๋Š” ํ•™์Šต์„ ๊ฑฐ์˜ ๋ชปํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์˜€๋‹ค. ํšจ๊ณผ๊ฐ€ ์ข‹๋‹ค๋Š” ๊ฒƒ์€ ์˜๋ฏธ์ ์œผ๋กœ coherent ์ผ๊ด€์„ฑ ์žˆ๊ฒŒ \(E_G\)๊ฐ€ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๊ฐ€ ๋งค์šฐ ๋†’์•„์ง€๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค.

  3. performance on downstream NLP tasks : ๋˜ํ•œ NES๊ฐ€ ๋‹ค์šด์ŠคํŠธ๋ฆผ test์—์„œ๋„ ์ข‹์€ ๋ชจ์Šต์„ ๋ณด์ด์ง€ ๋ชปํ–ˆ๋‹ค

ES๊ฐ€ generator embedding์œผ๋กœ๋ถ€ํ„ฐ discriminator๊ฐ€ ํ•™์Šต์„ ํ•  ๋•Œ ๋„์›€์„ ๋ฐ›๋Š”๊ฒŒ ์žฅ์ ์ด ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

??? average cosine similiarity of word embeddings of the G vs D#

average cosine similiarity๊ฐ€ ๋†’์„์ˆ˜๋ก ์ข‹์€ ๊ฒƒ์ธ๊ฐ€? ์–ด๋–ค ์˜๋ฏธ์ธ์ง€ ์ œ๋Œ€๋กœ ์ดํ•ด๋ฅผ ๋ชปํ•œ ๊ฑฐ ๊ฐ™๋‹ค.

3.3 Gradient-Disentangled Embedding Sharing(GDES)#

ES, NES์˜ ๋‹จ์ ์„ ๊ฝค ๋šซ๊ธฐ ์œ„ํ•ด ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ ์ค‘์š”ํ•˜๊ฒŒ ๋งํ•˜๋Š” ์ง€์ ์ด๋‹ค. ๋‘ ๊ฐœ์˜ ์žฅ ๋‹จ์ ์ด ๋ถ„๋ช…ํ•˜๊ฒŒ ์กด์žฌํ•˜๋ฉด์„œ ๋‘ ๊ฐœ๋ฅผ ๋ชจ๋‘ ์ฑ™๊ธธ ๋ฐฉ๋ฒ•์œผ๋กœ ๋‚˜์˜จ ๊ฒƒ์ด๋‹ค. ํ•œ ๋ฒˆ ์ •๋ฆฌ๋ฅผ ํ•˜์ž๋ฉด ES๋Š” ํ•™์Šต์€ ๋Š๋ฆฌ์ง€๋งŒ generator output : token embedding๋ฅผ discriminator๊ฐ€ ์ฐธ์กฐํ•˜๋ฉด์„œ ํ•™์Šต parameter reducing์— ๋„์›€์„ ๋ฐ›๋Š” ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ ๋‹จ์ ์€ generator discriminator token embeddings๊ฐ€ ๋‘˜ ๋‹ค ์ผ๊ด€์„ฑ์ด ์—†์–ด์ง„๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

๋ฐ˜๋ฉด์— NES๋Š” ํ•™์Šต์ด ๊ต‰์žฅํžˆ ๋นจ๋ผ์ง„๋‹ค. G,D์˜ ๋ฐฉํ–ฅ์„ฑ์˜ ์ •๋ฐ˜๋Œ€์˜ ์„ฑ์งˆ์„ ํ•ด๊ฒฐํ•ด ์คŒ์œผ๋กœ์จ ํ•™์Šต์ด ์šฉ์ดํ•˜๊ฒŒ ๋œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ๊ฒฐ๋ก ์ ์œผ๋กœ๋Š” ํ•™์Šต์„ ์˜คํžˆ๋ ค ES๋ณด๋‹ค ๋ชปํ•œ ๊ผด์ด ๋‚œ๋‹ค. ๊ทธ๋ž˜๋„ ์žฅ์ ์€ G์˜ token embedding์ด ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๊ฐ€ ๋†’์€ ์ผ๊ด€์„ฑ ์žˆ๋Š” embedding์„ ๋งŒ๋“ค์–ด์ฃผ๋Š” ๊ฒฝํ–ฅ์„ฑ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

์ด ๋ชจ๋“  ๋‹จ์ ์„ ์ปค๋ฒ„ํ•˜๊ณ  ๋„๋Œ€์ฒด ์–ด๋–ป๊ฒŒ ์žฅ์ ๋งŒ ๋‚จ๊ธด๋‹ค๋Š” ๊ฒƒ์ธ๊ฐ€? ์žฅ์ ๋งŒ ๋‚จ๊ธด๋‹ค๋ฉด ํ•™์Šต์˜ ์†๋„๋„ ๋นจ๋ผ์ง€๋ฉด์„œ ๋™์‹œ์— G,D์˜ token embedding์ด ์œ ์‚ฌ์„ฑ์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ๋  ๊ฒƒ์ด๋‹ค. ํ›„์ž๋ฅผ ๋…ผ๋ฌธ์—์„œ๋Š” โ€˜learn from same vocabulary and leverage the rich semantic information encoded in the embeddingsโ€™๋ผ๊ณ  ๋งํ•œ๋‹ค.

์ด๊ฒƒ์„ GDES๋Š” ์˜ค์ง generator embeddings๋ฅผ MLL loss๋งŒ ๊ฐ€์ง€๊ณ  ์—…๋ฐ์ดํŠธํ•จ์œผ๋กœ์จ, output์˜ ์ผ๊ด€์„ฑ๊ณผ ํ†ต์ผ์„ฑ์„ ์œ ์ง€ํ•œ๋‹ค. ๊ทธ๊ฒƒ์„ ์ˆ˜์‹์ ์œผ๋กœ ๋ณด๋ฉด

\[ E_D = sg(E_G) + E_{\Delta} \]

์›๋ž˜๋Š” NES์—์„œ๋Š” \(E_D\)๋กœ ๋ฐ”๋กœ backpropํ•˜๋˜ ๊ฒƒ์„ re-parameterizeํ•˜์—ฌ discriminator embedding๋ฅผ ์ƒˆ๋กœ ์ •์˜ํ•œ๋‹ค. \(sg\)d์˜ ์—ญํ• ์€ \(E_G\)์—์„œ ๋‚˜์˜จ gradient๊ฐ€ ๊ณ„์† ํ˜๋Ÿฌ๋“ค์–ด๊ฐ€๋Š” ๊ฒƒ์„ ๋ง‰๊ณ , residual embeddings \(E_{\Delta}\) ๋งŒ์„ ์—…๋ฐ์ดํŠธ ํ•˜๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๊ฒƒ์€ residual learning์—์„œ์˜ ์•„์ด๋””์–ด์™€ ๊ต‰์žฅํžˆ ์œ ์‚ฌํ•œ ๊ฒƒ ๊ฐ™์€๋ฐ!!!.

(1) G output -> input for discriminator = \(E_G\)
(2) update \(E_G\) \(E_D\) with MLM loss
(3) D run on G output
(4) update \(E_D\) with RTD loss with only \(E_{\Delta}\)
(5) after training, \(E_{\Delta}\) + \(E_G\) = \(E_D\)

์…‹์€ embedding sharing์˜ ์ฐจ์ด๋งŒ ์žˆ๊ณ , computation cost์˜ ์ฐจ์ด๋Š” ์—†์œผ๋‹ค. computation cost์˜ ์ฐจ์ด๊ฐ€ ์—†์ด ์•„์ด๋””์–ด๋งŒ์œผ๋กœ ์„ฑ๋Šฅ์˜ up์„ ํ•œ ๊ฒƒ๋„ resnet์ด๋ž‘ ๋น„์Šทํ•˜๋‹ค.

์ฝ”์‚ฌ์ธ ํ‰๊ท  ์œ ์‚ฌ๋„์—์„œ๋„ ์ฐจ์ด๊ฐ€ NES๋ณด๋‹ค๋Š” ๋œํ•œ๋ฐ ์ด๋Š” โ€˜preserves more semantic information in the discriminator embeddings through the partial weight shargingโ€™์ด๋ผ๊ณ  ๋งํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ ๋ณด์ด๋Š” partial weight sharing์ด \(E_{\Delta}\)์ด๋ฉฐ ์ด๊ฒƒ์ด embedding์˜ ์ž”์ฐจ๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰๋จ์œผ๋กœ์จ ํ•™์Šต์„ ์šฉ์ดํ•˜๊ฒŒ ๊ฐ€์ ธ๊ฐ”๋‹คโ€ฆ ์ •๋„๋กœ ๋ณด์ธ๋‹ค.

Conclusion#

  • pre-training paradigm for language models based on the combination of DeBERTa and ELECTRA, two state-of-the-art models that use relative position encoding and replaced token detection (RTD) respectively

  • interference issue between the generator and the discriminator in the RTD framework which is well known as the โ€œtug-of-warโ€ dynamics.

  • GDES : the discriminator to leverage the semantic information encoded in the generatorโ€™s embedding layer without interfering with the generatorโ€™s gradients and thus improves the pre-training ef๏ฌciency

  • a new way of sharing information between the generator and the discriminator in the RTD framework, which can be easily applied to other RTD-based language models

  • debertav3-large : 1.37% on the GLUE average score

  • ๋ชฉ์  : parameter-ef๏ฌcient pre-trained language models

Reference#

  1. kpmg notion - Deberta Review

  2. HF - lighthouse/mdeberta-v3-base-kor-further

  3. github - microsoft/DeBERTa