NLP实验 - PyTorch在CBoW中的实践

edwin99
edwin99
2024-02-08 20:12
30 阅读
0 评论
目录
正在加载目录...

数据集:AG News

import torch

import torchtext

import os

import collections

import builtins

import random

import numpy as np

 

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

加载数据集,定义分词器tokenizer和词表,vocab_size=5000

def load_dataset(ngrams = 1, min_freq = 1, vocab_size = 5000 , lines_cnt = 500):

tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

print("Loading dataset...")

test_dataset, train_dataset = torchtext.datasets.AG_NEWS(root='./data')

train_dataset = list(train_dataset)

test_dataset = list(test_dataset)

classes = ['World', 'Sports', 'Business', 'Sci/Tech']

print('Building vocab...')

counter = collections.Counter()

for i, (_, line) in enumerate(train_dataset):

counter.update(torchtext.data.utils.ngrams_iterator(tokenizer(line),ngrams=ngrams))

if i == lines_cnt:

break

vocab = torchtext.vocab.Vocab(collections.Counter(dict(counter.most_common(vocab_size))), min_freq=min_freq)

return train_dataset, test_dataset, classes, vocab, tokenizer

 

 

 

 

 

 

 

train_dataset, test_dataset, _, vocab, tokenizer = load_dataset()

 

def encode(x, vocabulary, tokenizer = tokenizer):

return [vocabulary[s] for s in tokenizer(x)]

 

 

 

CBoW模型:用上下文词预测目标词(I like to train networks生成训练对(like,I)(I,like)(to,like)....)

有嵌入层 + 线性层;

嵌入层:输入词 -> 低维稠密向量(embedding_size=30,实际常用300)

线性层:嵌入向量 -> 预测目标词(输出维度=vocab_size)

训练过程:输入(上下文词,直接词索引),输出(目标词索引,不用独热编码),损失函数CrossEntropyLoss

 

代码:

vocab_size = len(vocab)

 

embedder = torch.nn.Embedding(num_embeddings = vocab_size, embedding_dim = 30)model = torch.nn.Sequential(

embedder,

torch.nn.Linear(in_features = 30, out_features = vocab_size),)

 

print(model)

过程:

Sequential(

(0): Embedding(5002, 30)

(1): Linear(in_features=30, out_features=5002, bias=True)

)

 

 

训练数据准备:

要有一个函数生成CBoW词对(输入词->输出词),参数:窗口大小;输出:词对集合(ip_word, opword)

 

def to_cbow(sent,window_size=2):

res = []

for i,x in enumerate(sent):

for j in range(max(0,i-window_size),min(i+window_size+1,len(sent))):

if i!=j:

res.append([sent[j],x])

return res

 

print(to_cbow(['I','like','to','train','networks']))

print(to_cbow(encode('I like to train networks', vocab)))

词对:

[['like', 'I'], ['to', 'I'], ['I', 'like'], ['to', 'like'], ['train', 'like'], ['I', 'to'], ['like', 'to'], ['train', 'to'], ['networks', 'to'], ['like', 'train'], ['to', 'train'], ['networks', 'train'], ['to', 'networks'], ['train', 'networks']] [[232, 172], [5, 172], [172, 232], [5, 232], [0, 232], [172, 5], [232, 5], [0, 5], [1202, 5], [232, 0], [5, 0], [1202, 0], [5, 1202], [0, 1202]]

 

步骤:

遍历,调用to_cbow深层次对,存入X,Y

应该先处理前面部分看看效果,省时间,不错的话再处理全部

X = []Y = []for i, x in zip(range(10000), train_dataset):

for w1, w2 in to_cbow(encode(x[1], vocab), window_size = 5):

X.append(w1)

Y.append(w2)

 

X = torch.tensor(X)Y = torch.tensor(Y)

 

转换数据,创数据加载器:

class SimpleIterableDataset(torch.utils.data.IterableDataset):

def __init__(self, X, Y):

super(SimpleIterableDataset).__init__()

self.data = []

for i in range(len(X)):

self.data.append( (Y[i], X[i]) )

random.shuffle(self.data)

 

def __iter__(self):

return iter(self.data)

ds = SimpleIterableDataset(X, Y)

dl = torch.utils.data.DataLoader(ds, batch_size = 256)

 

 

训练优化:

优化器SGD(学习率高,0.1;可换成Adam或者其他优化器对比实验),训练轮次(初始10,重复训练降低损失)

过程:每批数据输入->前向传播 -> 计算交叉熵损失 -> 反向传播 -> 更新参数(根据损失调整轮次和优化器参数);每轮输出平均损失+观察收敛,如果损失不收敛要加轮/调学习率,用不同优化器对比+加速收敛

def train_epoch(net, dataloader, lr = 0.01, optimizer = None, loss_fn = torch.nn.CrossEntropyLoss(), epochs = None, report_freq = 1):

optimizer = optimizer or torch.optim.Adam(net.parameters(), lr = lr)

loss_fn = loss_fn.to(device)

net.train()

 

for i in range(epochs):

total_loss, j = 0, 0,

for labels, features in dataloader:

optimizer.zero_grad()

features, labels = features.to(device), labels.to(device)

out = net(features)

loss = loss_fn(out, labels)

loss.backward()

optimizer.step()

total_loss += loss

j += 1

if i % report_freq == 0:

print(f"Epoch: {i+1}: loss={total_loss.item()/j}")

 

return total_loss.item()/j

 

train_epoch(net = model, dataloader = dl, optimizer = torch.optim.SGD(model.parameters(), lr = 0.1), loss_fn = torch.nn.CrossEntropyLoss(), epochs = 10)

 

过程:

Epoch: 1: loss=5.664632366860172

Epoch: 2: loss=5.632101973960962

Epoch: 3: loss=5.610399051405015

Epoch: 4: loss=5.594621561080262

Epoch: 5: loss=5.582538017415446

Epoch: 6: loss=5.572900234519603

Epoch: 7: loss=5.564951676341915

Epoch: 8: loss=5.558288112064614

Epoch: 9: loss=5.552576955031129

Epoch: 10: loss=5.547634165194347

 

输出:5.547634165194347

 

 

 

用Word2Vec:提取词表所有词对应的嵌入向量

vectors = torch.stack([embedder(torch.tensor(vocab[s])) for s in vocab.itos], 0)

"paris" 编码成向量:

paris_vec = embedder(torch.tensor(vocab['paris']))

print(paris_vec)

过程:

tensor([-0.0915, 2.1224, -0.0281, -0.6819, 1.1219, 0.6458, -1.3704, -1.3314,

-1.1437, 0.4496, 0.2301, -0.3515, -0.8485, 1.0481, 0.4386, -0.8949,

0.5644, 1.0939, -2.5096, 3.2949, -0.2601, -0.8640, 0.1421, -0.0804,

-0.5083, -1.0560, 0.9753, -0.5949, -1.6046, 0.5774],

grad_fn=<EmbeddingBackward>)

 

 

 

同义词查找:

输入词向量(目标词对应向量v)-> 计算距离(对词汇表所有词wi计算欧氏距离)-> 排序索引(argsort对距离排序,获取最近邻索引)-> 取前n个索引对应词

欧氏距离公式:|wi - v|

 

代码:

def close_words(x, n = 5):

vec = embedder(torch.tensor(vocab[x]))

top5 = np.linalg.norm(vectors.detach().numpy() - vec.detach().numpy(), axis = 1).argsort()[:n]

return [ vocab.itos[x] for x in top5 ]

 

close_words('microsoft')

输出:['microsoft', 'quoted', 'lp', 'rate', 'top']

 

close_words('basketball')

输出:['basketball', 'lot', 'sinai', 'states', 'healthdaynews']

 

close_words('funds')

输出:['funds', 'travel', 'sydney', 'japan', 'business']

 

扩展阅读:

  1. TensorFlow再CBoW中的实践

  2. 训练Skip-Gram模型

 

评论区 (0)

登录后参与评论

暂无评论,抢沙发吧!