[자연어처리] Transformer (NLP, 트랜스포머) 논문요약

자연어처리/NLP 모델

[자연어처리] Transformer (NLP, 트랜스포머) 논문요약

Suda_777 2022. 11. 29. 20:49

0. 논문 소개
1. Abstract
2. Introduction
3. Background
4.Model Architecture
- 4.1. Encoder and Decoder Stacks
- 4.2. Attention
5. Conclusion

0. 논문 소개

논문 링크
[Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new

arxiv.org](https://arxiv.org/abs/1706.03762)

1. Abstract

기존 최고 성능은 인코더와 디코더를 사용한 복잡한 모델, 또는 CNN 기반 모델
새로운 아키텍쳐인 Transformer를 제안
간단함, 병렬화 가능, 품질 우수, 훈련시간 적음

2. Introduction

반복 모델(Recurrent models)
- 병렬화가 불가능함. longer sequence length에서 한계가 있음
- factorization trick과 conditional computation 방식으로 성능을 개선함
- sequential computation이라는 근본적인 문제점이 남아있음
Attention mechanisms
- input, output에서 문장 길이와 상관없이 모델링 가능
- Recurrent network와 함께 쓰이는 문제 있음
Transformer
- input, output에서 global dependency를 찾기 위해 attention mechanism을 사용
- 더욱 많은 병렬화 가능

3. Background

The goal of reducing sequential computation
- sequential computation(순차 계산)을 줄이기 위해 Extended Neural GPU, ByteNet, ConvS2S에서 CNN을 basic building block으로 사용함.
- 멀리 떨어진 위치 사이의 dependency(종속성)을 학습하는 것이 어렵다
- transformer는 Multi-Head Attention을 이용해 작업 수를 줄임
Self-attention
- 여러 Task에서 좋았음
End-to-end memory network
- recurrent attention mechanism에 적합
- simple-language question answering 과 language modeling task에서 성능좋음
Transformer
- self-attention에만 의존

4.Model Architecture

4.1. Encoder and Decoder Stacks

Encoder
- 6개의 identical layers으로 구성
- 각 레이어는 두개의 sub-layer를 가짐
  - 첫번째 층: multi-head self-attention mechanism
  - 두번째 층: fully connected feed-forward network
- 각 두개의 sub-layers마다 normalization(정규화) 후에 residual connection을 사용
  - LayerNorm(x+Sublayer(x))
- output= 512 차원
Decoder
- 6개의 identical layers으로 구성
- Encoder 두개의 층에 세번째 sub-layer를 추가
- 해당 layer가 multi-head attention을 수행
- residual connection 사용

4.2. Attention

키-값 쌍을 출력에 매핑
쿼리, 키, 값 및 출력은 모두 벡터
Q, K, V (다른 글에서 참조해옴)
- Q: query (영향을 받는 단어 A)
- K: key (영향을 주는 단어 B)
- V: value (영향에 대한 가중치)
출력은 가중 합계로 계산

4.2.1. Scaled Dot-Product Attention

Input : query, key, value

$A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}})$

Additive attention
- feed-forward layer network를 사용해 compatibility function를 계산함
Dot-product attention
- scaling factor를 제외하면 attention 방식과 동일

4.2.2. Multi-Head Attention

하나의 attention function을 사용하는 것보다, queries와 keys, values를 linear projection을 통해 중간에 매핑해줘서 각 다른 값들을 입력으로 하는, 여러 개의 attention function을 만드는 것이 더 효율적
병렬적으로 attention function을 거쳐 output value을 만들어냄
이 결과들은 합쳐 다시 한번 project 되어 최종 결과를 냄

$M u l t i H e a d (Q, K, V) = C o n c a t t (h e a d_{1}, . . ., h e a d_{h}) W O$

$w h e r e h e a d_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})$

4.2.3 Applications of Attention in our Mode

생략

5. Conclusion

생략

저작자표시 비영리 변경금지

'자연어처리 > NLP 모델' 카테고리의 다른 글

[자연어처리] attention 논문 요약 (0)	2022.12.13

현재글[자연어처리] Transformer (NLP, 트랜스포머) 논문요약

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

인공지능 개발자 수다