NLP实验 - PyTorch框架下的Transformer

edwin99
edwin99
2024-02-09 02:01
23 阅读
0 评论
文章封面
目录
正在加载目录...

RNN/LSTM的缺点:所有输入词对输出的影响相同,长序列处理能力差(编码-解码结构容易遗忘序列开头),机器翻译序列任务表现差

解决:引入注意力机制 + Transformer

 

注意力机制:输入词赋动态权重at,i,增强关键词对输出的影响;解码生成输出yt 结合所有输出隐藏状态hi(按权重甲加权)

过程:编码器-快捷连接-解码器,用输入蓄力的中间状态;避免依赖单一最终隐藏状态

 

注意力矩阵:矩阵元素{αi,j} = 生成第i输出词时,对第j输入词的关注程度(颜色浅深 = 注意力权重)

 

注意力增加NLP任务性能(翻译,生成),动态捕捉长距离依赖,替代RNN固定状态传递

 

RNN不能并行化,训练低效,参数多(引入注意力参数变多,难扩展)

 

Transformer支持自注意力 + 位置编码

自注意力:代替RNN,并行处理序列

位置编码:注入序列位置

(GPT用的就是Transformer)

 

 

Transformer模型:位置编码 + 自注意力 + 并行计算 + 上下文窗口

并行计算:全连接结构让序列并处理;输入位置独立映射到输出,支持大模型

上下文窗口:关注当前窗口的所有词,突破RNN长程依赖限制

 

 

动图结构:

 

多头注意力:多个注意力头学习词间不同关系(语法,语义)

 

BERT架构:基础(12层Transformer编码器),大型(24层)

预训练任务:掩码语言模型MLM,预测句中被遮蔽的词

训练数据:维基百科 + 书,无监督

预训练后用微调适配下游任务(分类,问答)

 

 

Transformer架构变体:BERT,DistilBERT,BigBird,GPT3等都可以微调,Huggingface的库有这些模型的PyTorch资源

 

BERT文本分类:导入HuggingFace Transformer库,加载AG News数据集;预处理分词,建模,微调,评估预测(准确率,训练好的模型新样本分类)

预处理:用BERT分词器BertTokenizer处理文本(加特殊标记),截断或者填充固定长度(512词)

建模:加载预训练BERT模型BertForSequenceClassification

微调:定义优化器,学习率调度,循环训练(算损失,反向传播,更新参数)

 

import torch

import torchtext

from torchnlp import *import transformerstrain_dataset, test_dataset, classes, vocab = load_dataset()

vocab_len = len(vocab)

 

加载分词器:用BertTokenizer.from_pretrained('bert-base-uncased')加载,预训练BERT模型匹配的分词器

用bert-base-uncased自动下载预训练模型+文件

自定义模型加载:本地目录要有分词器配置(vocab.txt) + 模型配置文件(config.json) + 模型权重文件(pytorch_model.bin)

 

代码:

# To load the model from Internet repository using model name. 

# Use this if you are running from your own copy of the notebooks

bert_model = 'bert-base-uncased'

 

# To load the model from the directory on disk. Use this for Microsoft Learn module, because we have

# prepared all required files for you.

bert_model = './bert'

 

tokenizer = transformers.BertTokenizer.from_pretrained(bert_model)

 

MAX_SEQ_LEN = 128PAD_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)

UNK_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)

 

分词器对象有编码文本函数encode()

tokenizer.encode('PyTorch is a great framework for NLP')

输出:[101, 1052, 22123, 2953, 2818, 2003, 1037, 2307, 7705, 2005, 17953, 2361, 102]

 

创建迭代器(训练时访问数据),定义填充函数:

def pad_bert(b):

# b is the list of tuples of length batch_size

# - first element of a tuple = label,

# - second = feature (text sequence)

# build vectorized sequence

v = [tokenizer.encode(x[1]) for x in b]

# compute max length of a sequence in this minibatch

l = max(map(len,v))

return ( # tuple of two tensors - labels and features

torch.LongTensor([t[0] for t in b]),

torch.stack([torch.nn.functional.pad(torch.tensor(t),(0,l-len(t)),mode='constant',value=0) for t in v])

)

 

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=8, collate_fn=pad_bert, shuffle=True)test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=8, collate_fn=pad_bert)

用BertForSequenceClassification,bert-base-uncased(简称BERT)加载预训练模型;指定分类数4

模型 = BERT基础模型 + 顶层分类器;分类器权重随机初始化,要用微调训练更新

 

微调要训练未初始化参数(分类层权重),如果分类层未初始化警告可以忽略

model = transformers.BertForSequenceClassification.from_pretrained(bert_model,num_labels=4).to(device)

 

BERT微调训练:设置学习率(小的2e-5~5e-5,避免破坏预训练权重)-> 训练循环(前向传播,法相传播,优化,算准确率)-> 持续监控(每隔report_freq次输出平均损失+准确率)

用max_steps=1000控制时间

 

梯度累积:多个小批次梯度更新,模拟更大批次训练

混合精度训练用torch.cuda.amp加速

用GPU,大的用分布式

 

代码:

optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)

 

report_freq = 50iterations = 500 # make this larger to train for longer time!

 

model.train()

 

i,c = 0,0acc_loss = 0acc_acc = 0

 

for labels,texts in train_loader:

labels = labels.to(device)-1 # get labels in the range 0-3

texts = texts.to(device)

loss, out = model(texts, labels=labels)[:2]

labs = out.argmax(dim=1)

acc = torch.mean((labs==labels).type(torch.float32))

optimizer.zero_grad()

loss.backward()

optimizer.step()

acc_loss += loss

acc_acc += acc

i+=1

c+=1

if i%report_freq==0:

print(f"Loss = {acc_loss.item()/c}, Accuracy = {acc_acc.item()/c}")

c = 0

acc_loss = 0

acc_acc = 0

iterations-=1

if not iterations:

break

 

输出:

Loss = 1.1254194641113282, Accuracy = 0.585

Loss = 0.6194715118408203, Accuracy = 0.83

Loss = 0.46665248870849607, Accuracy = 0.8475

Loss = 0.4309701919555664, Accuracy = 0.8575

Loss = 0.35427074432373046, Accuracy = 0.8825

Loss = 0.3306886291503906, Accuracy = 0.8975

Loss = 0.30340143203735354, Accuracy = 0.8975

Loss = 0.26139299392700194, Accuracy = 0.915

Loss = 0.26708646774291994, Accuracy = 0.9225

Loss = 0.3667240524291992, Accuracy = 0.8675

 

 

 

BERT分类准确率高是因为预训练前结构已成熟,只用微调分类层就可以

基础bert-base参数量1.1亿 -> 中等计算资源

大型bert-large有3.4 -> 高等

 

扩展可以用更大模型(RoBERTa,ALBERT)提性能,也可以用混合精度训练AMP减现存占用

 

 

模型评估:

切换评估模式:禁用Dropout + BatchNorm随机性

禁用梯度计算:减少内存消耗,计算加速

遍历测试集:逐批次处理数据,算损失 + 准确率

输出结果

 

代码:

model.eval()iterations = 100acc = 0i = 0

for labels,texts in test_loader:

labels = labels.to(device)-1

texts = texts.to(device)

_, out = model(texts, labels=labels)[:2]

labs = out.argmax(dim=1)

acc += torch.mean((labs==labels).type(torch.float32))

i+=1

if i>iterations: break

print(f"Final accuracy: {acc.item()/i}")

 

输出:

Final accuracy: 0.9047029702970297

 

扩展阅读:

  1. Transformer TensorFlow

评论区 (0)

登录后参与评论

暂无评论,抢沙发吧!