t5-base 모델 사용 시 주의사항 및 문제 해결

문제 1: 로컬 모델 전이 시 발생하는 예기치 않은 오류

모델을 직접 다운로드하여 로컬에서 사용할 경우, 토크나이저의 model_max_length 값이 비정상적으로 출력되는 문제가 발생할 수 있습니다. 아래 코드를 실행하면 토크나이저의 최대 길이가 1000000000000000019884624838656과 같은 터무니없는 값으로 표시되지만, 모델 설정의 n_positions는 올바르게 512로 출력됩니다.


import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = r"D:\model\t5-base"
tokenizer = T5Tokenizer.from_pretrained(model_name, legacy=False)
model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)
print("Tokenizer 최대 길이:", tokenizer.model_max_length)
print(model.config)

출력 결과:

Tokenizer 최대 길이: 1000000000000000019884624838656
...
"n_positions": 512,
...

해결 방법: 모델 디렉토리에 누락된 tokenizer.json 파일을 추가해야 합니다. 이 파일은 토크나이저의 정확한 구성을 제공하여 최대 길이를 올바르게 설정합니다.

파일 추가 후 재실행하면 정상 출력됩니다:

Tokenizer 최대 길이: 512
...
"n_positions": 512,
...

문제 2: t5-base 모델의 언어 지원 제한

t5-base 모델은 영어 작업에 특화되어 있으며, 중국어 등 다른 언어에 대한 지원이 매우 제한적입니다.

영어 테스트 (정상 작동)


import torch
from transformers import AutoTokenizer, T5ForConditionalGeneration

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_path = r"D:\model\t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_path, legacy=False)
model = T5ForConditionalGeneration.from_pretrained(model_path).to(device)

input_text = "summarize: Machine learning is a field of artificial intelligence that involves training algorithms to make predictions or decisions based on data."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

outputs = model.generate(input_ids, max_length=50, num_beams=5, early_stopping=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

결과: 정상적으로 요약 생성

machine learning involves training algorithms to make predictions based on data.

중국어 테스트 (실패)

중국어 입력을 jieba로 분할해도 대부분의 토큰이 <unk>로 변환됩니다. 이는 SentencePiece 토크나이저의 어휘에 중국어 문자가 거의 포함되지 않았기 때문입니다.


import torch
import jieba
from transformers import AutoTokenizer, T5ForConditionalGeneration

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_path = r"D:\model\t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_path, legacy=False)
model = T5ForConditionalGeneration.from_pretrained(model_path).to(device)

chinese_text = "这是一篇关于机器学习的长文章。"
segmented = " ".join(jieba.cut(chinese_text))
input_text = f"summarize: {segmented}"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

outputs = model.generate(input_ids, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

결과: 의미 없는 영어 문장이 생성됨

cnn ireport: tell us what you think in the comments section below.

대안 모델 제안

중국어 텍스트 처리 작업에는 mengzi-t5-base와 같은 중국어에 최적화된 T5 변형 모델을 사용하는 것이 좋습니다.

태그: t5-base transformers tokenizer Hugging Face model loading

6월 3일 01:11에 게시됨

괴물 클럽