Normalization ์ •๊ทœํ™”#

morpheme

์šฐ๋ฆฌ๊ฐ€ ์ตํžˆ ์•Œ๊ณ  ์žˆ๋Š” ์ •๊ทœํ™”๋Š” curr - min / max - min์ด๋‹ค. ์–ธ์–ดํ•™๊ณผ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ morpheme(ํ˜•ํƒœ์†Œ : ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„ ๊ฐ€์žฅ ์ž‘์€ ๋‹จ์œ„)๋ผ๋Š” ๊ฒƒ์ด ๋‹จ์–ด์˜ ๊ธฐ๋ฐ˜์œผ๋กœ ์ •์˜๋œ๋‹ค. ๋ณดํ†ต token์ด๋ผ๋Š” ๊ฒƒ์€ ๋‘ ๊ฐœ ์ด์ƒ์˜ morpheme์˜ ํ•ฉ์œผ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ์ ‘๋‘์‚ฌ(prefix) ์ ‘๋ฏธ์‚ฌ(suffix)๊ฐ€ ๋‹จ์–ด์˜ ํ˜•ํƒœ์— ๋”ฐ๋ผ ๋ณ€ํ˜•๋˜์–ด ๋ถ™๋Š”๋‹ค.

์ •๊ทœํ™”๋Š” โ€˜converting a token into its base formโ€™ํ•˜๋Š” ๊ณผ์ •์ด๋‹ค. inflection form โ†’ base form.

  1. reducing the # of unique tokens present in the text

  2. removing the variations of a word in the text

  3. removing redundant information

3.1 stemming ์–ด๊ฐ„ ์ถ”์ถœ#

laughing, laughed, laughs, laugh โ†’ laugh ๋‹จ์  : ๊ฐ€๋” ์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด๋ฅผ ๋งŒ๋“ค ์ˆ˜๋„ ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ข‹์€ normalization์€ ์•„๋‹ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ winning ๊ฐ™์€ ๊ฒฝ์šฐ ์›๋ž˜์˜ stem์€ win์ธ๋ฐ winn์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

from nltk.stem import WordNetLemmatizer
from nltk.stem import PosterStemmer, LancasterStemmer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()

๋ณดํ†ต ์–ด๊ฐ„ ์ถ”์ถœ ์†๋„๊ฐ€ ํ‘œ์ œ์–ด ์ถ”์ถœ๋ณด๋‹ค ๋น ๋ฅด๋‹ค.

3.2 lemmatization ํ‘œ์ œ์–ด ์ถ”์ถœ#

Lemma ํ‘œ์ œ์–ด๋Š” โ€˜๊ธฐ๋ณธ ์‚ฌ์ „ํ˜• ๋‹จ์–ดโ€™ ์ •๋„์˜ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ํ‘œ์ œ์–ด ์ถ”์ถœ์„ ์œ„ํ•ด์„œ๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ POS(part of speech)๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ํ˜•ํƒœํ•™์  ํŒŒ์‹ฑ(morphological parsing)์„ ํ•˜๊ฒŒ๋˜๋ฉด ์–ด๊ฐ„(stem)๊ณผ ์ ‘์‚ฌ(affix)๋กœ ๋‹จ์–ด๋ฅผ ์ชผ๊ฐœ๊ฒŒ ๋œ๋‹ค. ์ด๋Ÿฌํ•œ ๊ณผ์ •์„ ํ†ตํ•ด์„œ ๋” ์ด์ƒ ์ชผ๊ฐค ์ˆ˜ ์—†๋Š” ๋‹จ์œ„๋กœ ์˜๋ฏธ๋ฅผ ์ชผ๊ฐœ๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฐ ๊ณผ์ •์„ ์œ„ํ•ด์„œ ํ’ˆ์‚ฌ ์ •๋ณด๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ๋‹จ์–ด์˜ ์›ํ˜•์„ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์ด ์˜๋ฏธ์ ์œผ๋กœ ์ •ํ™•ํ•˜๊ณ  ์ผ๊ด€๋œ ๊ฒฐ๊ณผ๋ฅผ ์–ป๋Š”๋ฐ ๋„์›€์ด ๋œ๋‹ค.

3.3 POS(part of speech tags)#

KoNLPy ํ•œ๊ตญ์–ด ์ฒ˜๋ฆฌ ํŒจํ‚ค์ง€

ํ’ˆ์‚ฌ๋ž€ ๋ฌธ์žฅ๋‚ด ๋‹จ์–ด์˜ ๋ฌธ๋งฅ, ๊ธฐ๋Šฅ ๊ทธ๋ฆฌ๊ณ  ํ™œ์šฉ์„ ์ •์˜ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ํ’ˆ์‚ฌ๋Š” ๋ฌธ์žฅ ๋‚ด์˜ ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค ๊ณผ์˜ ๊ด€๊ณ„๋กœ ์ •์˜๋œ๋‹ค. ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์ด๋‚˜ rule-based ๋ชจ๋ธ๋“ค์€ ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ๋ฅผ ์•Œ์•„์•ผ ์ ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

4. ๋ฌธ๋ฒ• Grammer#

4.1 Constituency Grammar#

4.2 Dependency Grammar#

words of a sentence are dependent upon other words of the sentence

5. Subword Tokenizer ์•Œ๊ณ ๋ฆฌ์ฆ˜#

Birthplace = Birth + Place

์•„๋ฌด๋ฆฌ ๋งŽ์ด ํ•™์Šต์‹œ์ผœ๋„ ๋ชจ๋ฅด๋Š” ๋‹จ์–ด๊ฐ€ ํ•™์Šต๊ณผ์ •์—์„œ ๋‚˜์˜ค๊ธฐ ๋งˆ๋ จ์ด๋‹ค. ์ด๋ฅผ OOV(out-of-vocabulary), UNK(unknown token)์ด๋ผ๊ณ  ๋งํ•œ๋‹ค. OOV๋ฌธ์ œ๋ฅผ ์ผ์œผํ‚ค๋Š” ๋‹จ์–ด๋Š” ๋ณดํ†ต ๋” ์ž‘์€ ๋‹จ์œ„๋กœ ์ชผ๊ฐœ๋ฉด ์•„๋Š” 2๊ฐœ๋กœ ํ•ด์„๋  ์ˆ˜ ์žˆ๋Š” ํ•ฉ์„ฑ์–ด, ํฌ๊ท€ ๋‹จ์–ด ํ˜น์€ ์‹ ์กฐ์–ด ์ธ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์–ด๋ฅผ ๋” ์ชผ๊ฐœ๊ฐฐ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.

5.1 Byte-Pair Encoding(BPE) 1994#

huggingface_DOC

aaabdaaabac โ†’ (aa โ†’ Z) ZabdZabac โ†’ (ab โ†’ Y) ZYdZYac โ†’ (ZY โ†’ X) XdXac

๋ฐ์ดํ„ฐ ์••์ถ• ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ ์—ฐ์†์ ์œผ๋กœ ๊ฐ€์žฅ ๋งŽ์ด ๋“ฑ์žฅํ•œ ๊ธ€์ž์˜ ์Œ์„ ์ฐพ์•„์„œ ํ•˜๋‚˜์˜ ๊ธ€์ž๋กœ ๋ณ‘ํ•ฉํ•˜๋Š” ๋ฐฉ์‹.

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ BPE๋Š” ๊ณง subword segmentation ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋งํ•œ๋‹ค. ๊ธ€์ž ๋‹จ์œ„์—์„œ ์ ์ฐจ์ ์œผ๋กœ ๋‹จ์–ด ์ง‘ํ•ฉ์„ ๋งŒ๋“ค์–ด๋‚ด๋Š” bottom up ๋ฐฉ์‹์˜ ์ ‘๊ทผ์„ ํ•œ๋‹ค. ์šฐ์„  ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ์žˆ๋Š” ๋‹จ์–ด๋“ค์„ ๋ชจ๋“  ๊ธ€์ž ๋˜๋Š” ์œ ๋‹ˆ์ฝ”๋“œ ๋‹จ์œ„๋กœ ๋‹จ์–ด ์ง‘ํ•ฉ์„ ๋งŒ๋“ค๊ณ , ๊ฐ€์žฅ ๋งŽ์ด ๋“ฑ์žฅํ•˜๋Š” ์œ ๋‹ˆ๊ทธ๋žจ์„ ํ•˜๋‚˜์˜ ์œ ๋‹ˆ๊ทธ๋žจ์œผ๋กœ ํ†ตํ•ฉํ•œ๋‹ค. ์ •๋ฆฌํ•˜์ž๋ฉด ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋“ฑ์žฅํ•˜๋Š” ๋ฌธ์ž ํ˜น์€ subword์˜ ์Œ์„ ์žฌ๊ท€์ ์œผ๋กœ ํ•ฉ์น˜๋ฉฐ, ํ…์ŠคํŠธ๋ฅผ ๋ถ„ํ• ํ•˜๋Š” ๋ฐฉ์‹์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋ฌธ์ œ๋Š” letter ์ˆ˜์ค€์—์„œ ์ž‘๋™ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์–ด ๋‚ด์˜ ๋‚ด๋ถ€ ๊ตฌ์กฐ๋‚˜ ์˜๋ฏธ๋ฅผ ๊ณ ๋ คํ•˜์ง„ ์•Š๋Š”๋‹ค.

5.2 wordpiece tokenizer โ†’ for BERT#

bpe๊ฐ€ ๋นˆ๋„์ˆ˜์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ๊ฐ€์žฅ ๋งŽ์ด ๋“ฑ์žฅํ•œ ์Œ์„ ๋ณ‘ํ•ฉํ•˜๋Š” ๊ฒƒ ๊ณผ๋Š” ๋‹ฌ๋ฆฌ, ๋ณ‘ํ•ฉ๋˜์—ˆ์„ ๋•Œ ์ฝ”ํผ์Šค์˜ ์šฐ๋„ likelihood๋ฅผ ๊ฐ€์žฅ ๋†’์ด๋Š” ์Œ์œผ๋กœ ๊ตฌ์„ฑ. ๋ชจ๋“  ๋‹จ์–ด ์•ž์— ๋ฅผ ๋ถ™์ด๊ณ  ๊ธฐ์กด์— ์—†๋˜ ๋„์–ด์“ฐ๊ธฐ ์ถ”๊ฐ€. ๋˜๋Œ๋ฆด๋•Œ๋Š” ๋„์–ด์“ฐ๊ธฐ ๋ชจ๋‘ ์—†์• ๊ณ  ๋ฅผ ๋„์–ด์“ฐ๊ธฐํ™” ํ•˜๋ฉด ๋œ๋‹ค. BPE๋ž‘ ๋น„์Šทํ•˜๊ธด ํ•œ๋ฐ, ๋‹จ์–ด ์ˆ˜์ค€์—์„œ ์ž‘๋™ํ•œ๋‹ค๋Š”๊ฒŒ ๋‹ค๋ฅด๋‹ค. BPE๊ฐ€ letter โ†’ token(subword) bottom up ๋ฐฉ์‹์ด์—ˆ๋‹ค๋ฉด, wordpiece๋Š” word โ†’ token(subword) topdown ๋ฐฉ์‹์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ์ฐจ์ด์ ์€ ๋‹จ์–ด ๊ฒฝ๊ณ„๋ฅผ ํ‘œ์‹œํ•˜๋Š” ํŠน์ˆ˜ ํ† ํฐ[CLS],[SEP]๋“ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ์„œ๋ธŒ์›Œ๋“œ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

5.3 SentencePiece#

์‚ฌ์ „ ํ† ํฐํ™” ์ž‘์—…(pretokenization) ์—†์ด ์ „์ฒ˜๋ฆฌ๋ฅผ ํ•˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ(raw data)์— ๋ฐ”๋กœ ๋‹จ์–ด ๋ถ„๋ฆฌ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ. ๋‹จ์–ด ๋ถ„๋ฆฌ ํ† ํฐํ™” ์ˆ˜ํ–‰ โ†’ ์–ด๋–ค ์–ธ์–ด์—๋„ ๋ฐ”๋กœ ์ ์šฉ ๊ฐ€๋Šฅ.

6. Libraries#

  1. spacy

  • c ๊ธฐ๋ฐ˜, ๋น ๋ฆ„

  • ๊ทธ๋ƒฅ ์ด๊ฑฐ ์จ๋ผ

  1. nltk

  • ์“ธ๋ฐ์—†์ด ์„ ํƒ์ง€ ๋งŽ์Œ