[LLM] 모델 학습 with HuggingFace (DeepSpeed) (7)

자연어처리/LLM을 위한 코딩

[LLM] 모델 학습 with HuggingFace (DeepSpeed) (7)

Suda_777 2025. 2. 20. 04:10

1. DeepSpeed 란?

DeepSpeed는 분산 학습 메모리를 효율적이고 빠르게 만드는 PyTorch 최적화 라이브러리

내용 출처: DeepSpeed

DeepSpeed

DeepSpeed는 분산 학습 메모리를 효율적이고 빠르게 만드는 PyTorch 최적화 라이브러리입니다. 그 핵심은 대규모 모델을 규모에 맞게 훈련할 수 있는 Zero Redundancy Optimizer(ZeRO)입니다. ZeRO는 여러 단계

huggingface.co

설치

pip install deepspeed

2. 개념

2.1. ZeRO(Zero Redundancy Optimizer)

ZeRO (Zero Redundancy Optimizer) 방법을 사용함

ZeRO-1, GPU 간 최적화 상태 분할
ZeRO-2, GPU 간 그레이디언트 분할
ZeRO-3, GPU 간 매개변수 분할

2.2. CPU 오프로드

CPU 오프로드: GPU 메모리가 부족할 때, 전체 모델 파라미터와 연산에 필요한 모든 데이터를 GPU에만 적재하기 어려울 수 있다. 이러한 상황에서 일부 데이터를 CPU 메모리에 보관하여 GPU 메모리 사용량을 줄이고, 필요한 순간에만 GPU로 다시 가져와서 계산을 진행하는 방식

GPU 메모리가 충분하다면 CPU/NVMe 오프로드를 비활성화 하는 것을 추천

결론

속도와 메모리 사용량 사이의 적절한 균형을 찾아 적절히 사용하자

효율	속도	메모리 효율
좋음	ZeRO-1	ZeRO-3 + offload
	ZeRO-2	ZeRO-3
	ZeRO-2 + offload	ZeRO-2 + offload
	ZeRO-3	ZeRO-2
나쁨	ZeRO-3 + offload	ZeRO-1

3. 파이썬에 적용

DeepSpeed는 주로 json 파일을 이용해

파라미터를 정의하고

그것을 학습(Trainer)에 적용해 준다.

TrainingArguments(..., deepspeed="path/to/deepspeed_config.json")

딕셔너리 형식으로도 가능

ds_config_dict = {...} # 설정 내용

ds_config_dict = dict(scheduler=scheduler_params, optimizer=optimizer_params)
args = TrainingArguments(..., deepspeed=ds_config_dict)
trainer = Trainer(model, args, ...)

3. DeepSpeed Json 파일 만드는 법

Json파일을 만드는 방법은 아래와 같으며

어떠한 파라미터가 있는지 정리 해두고

필요할 때 가져다 쓰면 되겠다.

예시코드

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "nvme",
            "nvme_path": "/local_nvme",
            "pin_memory": true,
            "buffer_count": 4,
            "fast_init": false
        },
        "offload_param": {
            "device": "nvme",
            "nvme_path": "/local_nvme",
            "pin_memory": true,
            "buffer_count": 5,
            "buffer_size": 1e8,
            "max_in_cpu": 1e9
        },
        "aio": {
            "block_size": 262144,
            "queue_depth": 32,
            "thread_count": 1,
            "single_submit": false,
            "overlap_events": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

3.1. 정밀도

DeepSpeed는 fp32, fp16 및 bf16 혼합 정밀도를 지원

fp32 사용하기

일반적인 단정밀도(float32) 부동소수점 형식
연산의 정확도가 높다
연산량이 많아 속도가 느리고 메모리 사용량이 많다

{
    "fp16": {
        "enabled": false
    }
}

fp16 활성화

반정밀도(float16) 부동소수점 형식

파라미터

loss_scale : 손실을 스케일링함
loss_scale_window: 손실 스케일 값 업데이트 주기 스텝
initial_scale_power: 초기 손실 스케일 값
hysteresis: 손실 스케일을 낮춘 후, 일정 스텝(여기서는 2 스텝) 동안 overflow 없이 안정적으로 연산이 이루어지면 손실 스케일을 다시 높이는 방식
min_loss_scale : 손실 스케일의 최솟값

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    }
}

bf16

Google이 TPU에서 성능 최적화를 위해 개발한 부동소수점 형식
fp32의 수치 표현 범위를 유지하면서 속도를 높이고 메모리를 절약할 때 (ex: TPU, NVIDIA A100 이상의 GPU)

{
    "bf16": {
        "enabled": "auto"
    }
}

3.2. 옵티마이저

params : AdamW 에 대한 파라미터를 정의함
lr : 학습률 (Learning Rate)
betas : 아래 예시에서는 AdamW 옵티마이저에서 모멘텀과 RMSProp의 역할을 하는 beta 계수

옵티마이저마다 파라미터는 다르니까 그때그때 가져와서 쓰자

type 옵션에 넣어준다.

Adam : 일반적인 Adam 옵티마이저.
AdamW : Adam 옵티마이저에 weight decay를 효과적으로 적용할 수 있도록 개선된 버전.
SGD : 확률적 경사 하강법(Stochastic Gradient Descent).
Lamb : 대규모 배치 학습에 최적화된 옵티마이저.
FusedAdam : GPU 가속을 위해 최적화된 Adam 버전 (예: NVIDIA Apex와 같이 사용).
OneBitAdam : 통신 비용 절감을 위해 분산 학습 환경에서 사용되는 옵티마이저

{
   "optimizer": {
       "type": "AdamW",
       "params": {
         "lr": "auto",
         "betas": "auto",
         "eps": "auto",
         "weight_decay": "auto"
       }
   }
}

3.3. 스케줄러

스케줄러도 마찬가지로 다음과 같은 것들을 쓸 수 있으니

그때그때 찾아서 쓰자

WarmupLR : 초기 학습률을 천천히 증가시키는 warmup 단계만 적용하는 스케줄러입니다.
WarmupDecayLR : 일정 단계 동안 warmup을 진행한 후, 이후에 학습률을 점진적으로 감소시키는 방식
ConstantLR : 학습률을 일정하게 유지
CosineAnnealingLR : 코사인 함수 형태로 학습률을 감소시키는 스케줄러

{
   "scheduler": {
         "type": "WarmupDecayLR",
         "params": {
             "total_num_steps": "auto",
             "warmup_min_lr": "auto",
             "warmup_max_lr": "auto",
             "warmup_num_steps": "auto"
         }
     }
}

3.4. ZeRO

zero 1

{
    "zero_optimization": {
        "stage": 1
    }
}

zero 2

offload_optimizer
- 옵티마이저의 상태(예: momentum, variance 등)를 CPU 메모리로 오프로딩
- CPU 메모리에서 pinned(고정) 메모리를 사용

allgather_partitions : 모든 파티션을 모으기 위해 allgather 통신을 사용
allgather_bucket_size : allgather 연산에 사용할 버킷(bucket)의 크기
overlap_comm : 통신(comm)과 계산(computation)을 동시에 수행하여 오버랩(overlap)하도록 설정
reduce_scatter : 각 GPU가 자신에게 필요한 파티션만 모아서 계산할 수 있도록 하는 효율적인 통신 방법으로, gradient의 집계 및 분산에 효과적
reduce_bucket_size : reduce scatter 연산에 사용할 버킷 크기
contiguous_gradients : 모든 gradient를 연속적인 메모리 블록에 저장
round_robin_gradients : 이 방법은 각 GPU에 균등하게 gradient를 할당하여 로드 밸런싱을 돕고, 특정 GPU에 과도한 부담이 가지 않도록 합니다.

{
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 5e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5e8,
        "contiguous_gradients": true
        "round_robin_gradients": true
    }
}

zero3

offload_param : 모델 파라미터(가중치 등)를 CPU 메모리로 오프로딩
sub_group_size : ZeRO Stage 3에서 파라미터를 그룹으로 나눌 때 사용할 최대 서브 그룹 크기
stage3_prefetch_bucket_size : Stage 3에서 파라미터 prefetch(미리 로드) 시 사용할 버킷 크기
stage3_param_persistence_threshold : Stage 3에서 파라미터를 GPU에 계속 상주시킬지(offload 하지 않을지) 결정하는 임계값 설정
stage3_max_live_parameters : 한 번에 GPU에 유지(live)될 수 있는 최대 파라미터 수
stage3_max_reuse_distance : 파라미터 재사용 간 최대 거리(steps)를 지정
stage3_gather_16bit_weights_on_model_save : 모델 저장 시 분산되어 있는 16-bit(반정밀도) 가중치를 하나로 모아서 저장

{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

3.5. 배치 크기

train_micro_batch_size_per_gpu : 각 GPU가 한 번의 순전파/역전파 단계에서 처리할 데이터 샘플의 수를 자동으로 설정
train_batch_size : 전체 학습 배치 사이즈(즉, 모든 GPU에서 처리되는 데이터 샘플의 총합)를 자동으로 결정

{
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto"
}

3.6. 그레이디언트

그레디언트 누적

{
    "gradient_accumulation_steps": "auto"
}

그레디언트 클리핑

학습 중에 gradient의 크기가 지나치게 커지는(gradient explosion) 문제를 방지하기 위해 gradient를 자동으로 클리핑하는 기능을 활성화
미리 정해진 임계값 이상인 gradient를 일정 비율로 줄여주는 방법

{
    "gradient_clipping": "auto"
}

3.7. 통신 데이터 유형(Communication data type)

{
    "communication_data_type": "fp32"
}

4. 모델 가중치 저장

4.1. fp16

ZeRO-2로 훈련된 모델은 pytorch_model.bin 가중치를 fp16에 저장합니다. ZeRO-3으로 훈련된 모델의 모델 가중치를 fp16에 저장하려면 모델 가중치가 여러 GPU에 분할되어 있으므로 “stage3_gather_16bit_weights_on_model_save”: true를 설정

{
    "zero_optimization": {
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

4.2. fp32

4.2.1. 학습 중 혹은 학습 직후

이미 저장된 최신 체크포인트를 바로 불러와서 fp32 가중치를 복원
주의: 이 방법으로 로드하면 DeepSpeed와 관련된 최적화(마법)가 제거되므로, 학습이 완전히 끝난 후에만 사용하는 것이 좋다

from transformers.trainer_utils import get_last_checkpoint
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint

# 마지막 체크포인트 디렉토리 가져오기
checkpoint_dir = get_last_checkpoint(trainer.args.output_dir)
# 분산 저장된 체크포인트에서 fp32 가중치 로드
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)

4.2.2. 학습 후

DeepSpeed는 zero_to_fp32.py라는 독립 실행 스크립트를 제공합니다.
이 스크립트는 분산된 체크포인트(여러 파일에 나누어 저장된 가중치)를 하나의 단일 파일(예: pytorch_model.bin)로 병합해줌

예를 들어 체크포인트 폴더가 다음과 같은 경우

$ ls -l output_dir/checkpoint-1/
-rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json
drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/
-rw-rw-r-- 1 stas stas   12 Mar 27 13:16 latest
-rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt
-rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin
-rw-rw-r-- 1 stas stas  623 Mar 27 20:42 scheduler.pt
-rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json
-rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model
-rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json
-rw-rw-r-- 1 stas stas  339 Mar 27 20:42 trainer_state.json
-rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin
-rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py*

다음 명령어 사용

python zero_to_fp32.py . pytorch_model.bin

5. 모델 배포 with DeepSpeed

DeepSpeed로 학습한 모델은 저장될 때 DeepSpeed의 최적화 방식(예: ZeRO Stage 3 사용 시 분산된 체크포인트, fp16 저장 등)에 따라 일반적인 체크포인트와 다르게 저장될 수 있다

5.1. 일반적인 경우

많은 경우 학습된 모델은 Hugging Face의 from_pretrained 메서드를 통해 불러올 수 있으며, 별도의 복잡한 배포 절차 없이 모델을 실행할 수 있다.

5.2. ZeRO Stage 3 등 특수 설정 사용 시

모델을 배포(또는 추론) 전에 DeepSpeed가 제공하는 체크포인트 병합(conversion) 스크립트를 통해 하나의 통합된 체크포인트로 변환해야 할 수 있다

deepspeed --num_gpus=1 merge_checkpoint \
    --input_dir /path/to/zero_stage3_checkpoint/ \
    --output_dir /path/to/merged_checkpoint/ \
    --zero-stage 3

--num_gpus=1: 단일 GPU 환경에서 병합 작업을 수행
--input_dir: ZeRO Stage 3로 저장된 분산 체크포인트가 있는 디렉토리
--output_dir: 병합된 체크포인트를 저장할 경로
--zero-stage 3: ZeRO Stage 3 체크포인트임을 지정

5.3. DeepSpeed Inference 활용

DeepSpeed Inference API를 사용하면 모델 추론 시 GPU 메모리 사용을 최적화하고 속도를 향상시킬 수 있습니다. 아래는 Python 코드 예시입니다.
ZeRO Inference는 ZeRO-3와 동일한 구성 파일을 공유하며, ZeRO-2 및 ZeRO-1 구성은 추론에 아무런 이점을 제공하지 않음
deepspeed.init_inference() 메서드 사용

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import deepspeed

# 병합된 체크포인트에서 모델 로드 (fp16 사용)
model = AutoModelForSequenceClassification.from_pretrained(
    "/path/to/merged_checkpoint",
    torch_dtype=torch.float16
)

# DeepSpeed Inference 초기화: 내부적으로 최적화된 커널로 대체됨
model = deepspeed.init_inference(
    model,
    mp_size=1,  # 단일 GPU 사용, 멀티 프로세스 크기를 지정할 수 있음
    dtype=torch.half,  # fp16 연산 사용
    replace_method='auto',  # 최적화 커널 자동 대체
    replace_with_kernel_inject=True  # 커널 인젝션 최적화 활성화
)

tokenizer = AutoTokenizer.from_pretrained("/path/to/merged_checkpoint")

# 추론 실행 예시
inputs = tokenizer("DeepSpeed Inference 예시", return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}  # GPU로 데이터 이동
outputs = model(**inputs)
print(outputs)

deepspeed.init_inference(): 기존 모델을 DeepSpeed의 추론 최적화 버전으로 변환합니다.
mp_size: 멀티프로세스(MP) 설정입니다. 단일 GPU 환경이면 1로 지정합니다.
replace_method와 replace_with_kernel_inject: DeepSpeed가 제공하는 고속 커널로 모델의 특정 연산을 대체하여 추론 성능을 극대화

저작자표시 비영리 변경금지 (새창열림)

'자연어처리 > LLM을 위한 코딩' 카테고리의 다른 글

[LLM] 파인튜닝(fine-tunning) (9) (0)	2025.02.28
[LLM] 모델 학습 with HuggingFace (peft) (8) (0)	2025.02.25
[LLM] 모델 학습 with HuggingFace (Accelerate) (6) (0)	2025.02.15
[LLM] 모델 학습 with Hugging Face (TrainingArguments, Trainer, trl) (5) (1)	2025.02.13
[LLM] 데이터 준비 with huggingface (datasets) (4) (0)	2025.02.10

현재글[LLM] 모델 학습 with HuggingFace (DeepSpeed) (7)

인공지능 개발자 수다