NLP实验 - 命名实体识别NER TensorFlow实验指南

edwin99
edwin99
2024-02-09 03:34
41 阅读
0 评论
目录
正在加载目录...

NER = 命名实体识别

 

数据集:Kaggle:Annotated Corpus for Named Entity Recognition

import pandas as pd

from tensorflow import keras

import numpy as np

 

用Pandas

df = pd.read_csv('ner_dataset.csv',encoding='unicode-escape')

df.head()

输出:

 

获唯一标签,创建查找字典(标签转成类别编号)

tags = df.Tag.unique()

tags

 

输出:

array(['O', 'B-geo', 'B-gpe', 'B-per', 'I-geo', 'B-org', 'I-org', 'B-tim',

'B-art', 'I-art', 'I-per', 'I-gpe', 'I-tim', 'B-nat', 'B-eve',

'I-eve', 'I-nat'], dtype=object)

 

id2tag = dict(enumerate(tags))

tag2id = { v : k for k,v in id2tag.items() }

 

id2tag[0]

输出:'0'

 

对词表操作,创建不考虑词汇的词表(实战要用Keras向量化器,限制词汇数量)

vocab = set(df['Word'].apply(lambda x: x.lower()))

id2word = { i+1 : v for i,v in enumerate(vocab) }id2word[0] = '<UNK>'vocab.add('<UNK>')

word2id = { v : k for k,v in id2word.items() }

 

 

创还能用来训练句子的数据集,遍历原始数据集,独立句子分成X(词列表)和Y(标签列表)

X,Y = [],[]

s,t = [],[]

for i,row in df[['Sentence #','Word','Tag']].iterrows():

if pd.isna(row['Sentence #']):

s.append(row['Word'])

t.append(row['Tag'])

else:

if len(s)>0:

X.append(s)

Y.append(t)

s,t = [row['Word']],[row['Tag']]X.append(s)Y.append(t)

向量化所有单词和token:

def vectorize(seq):

return [word2id[x.lower()] for x in seq]

 

def tagify(seq):

return [tag2id[x] for x in seq]

 

Xv = list(map(vectorize,X))Yv = list(map(tagify,Y))

 

Xv[0], Yv[0]

输出:

([10386,

23515,

4134,

29620,

7954,

13583,

21193,

12222,

27322,

18258,

5815,

15880,

5355,

25242,

31327,

18258,

27067,

23515,

26444,

14412,

358,

26551,

5011,

30558],

[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0])

 

所有句子用0填充到最大长度(实战要用别的策略,本实验只是简化版)

X_data = keras.preprocessing.sequence.pad_sequences(Xv,padding='post')

Y_data = keras.preprocessing.sequence.pad_sequences(Yv,padding='post')

 

 

定义标记分类网络(Token分类网络):

双层双向LSTM:输入(词嵌入序列),输出(时间步LSTM输出用TimeDistributed层用相同的全连接分类器)

 

return_sequences=True:LSTM返回时间步输出

TimeDistributed:全连接层独立处理时间步特征

 

代码:

maxlen = X_data.shape[1]vocab_size = len(vocab)num_tags = len(tags)model = keras.models.Sequential([

keras.layers.Embedding(vocab_size, 300, input_length=maxlen),

keras.layers.Bidirectional(keras.layers.LSTM(units=100, activation='tanh', return_sequences=True)),

keras.layers.Bidirectional(keras.layers.LSTM(units=100, activation='tanh', return_sequences=True)),

keras.layers.TimeDistributed(keras.layers.Dense(num_tags, activation='softmax'))])

model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])model.summary()

输出:

Model: "sequential_3"

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

embedding_4 (Embedding) (None, 104, 300) 9545400

bidirectional_6 (Bidirectio (None, 104, 200) 320800

nal)

bidirectional_7 (Bidirectio (None, 104, 200) 240800

nal)

time_distributed_3 (TimeDis (None, 104, 17) 3417

tributed)

=================================================================

Total params: 10,110,417

Trainable params: 10,110,417

Non-trainable params: 0

 

固定序列长度:指定maxlen通义输出长度,但是不能处理现场序列(可能引入冗余填充)

变长序列处理:Masking层或动态填充(按批次处理),调整网络结构支持动态输入

 

model.fit(X_data,Y_data)

输出:<keras.callbacks.History at 0x16f0bb2a310>

 

 

训练结果:

sent = 'John Smith went to Paris to attend a conference in cancer development institute'

words = sent.lower().split()

v = keras.preprocessing.sequence.pad_sequences([[word2id[x] for x in words]],padding='post',maxlen=maxlen)

res = model(v)[0]

 

r = np.argmax(res.numpy(),axis=1)

for i,w in zip(r,words):

print(f"{w} -> {id2tag[i]}")

输出:

john -> B-per

smith -> I-per

went -> O

to -> O

paris -> B-geo

to -> O

attend -> O

a -> O

conference -> O

in -> O

cancer -> B-org

development -> I-org

institute -> I-org

 

 

 

扩展学习:

  1. 医学术语NER模型实验

评论区 (0)

登录后参与评论

暂无评论,抢沙发吧!