RNN模型与NLP应用2-文本处理与词嵌入

0. 目标:情感分析 IMDB电影评论的好、坏

1. Text to Sequence(文本到序列)

1.1 Tokenization

实际工作中做Tokenization并不容易!!!

1.2 Build Dictionary

1.3 One-hot Encoding

1.4 Align Sequence

2. Word Embedding: Word to Vector(词到向量)

句子—>单词—>正整数—>高维向量—>低维向量

3. Logistic Regression for Binary Classification

  • keras已经把IMDb数据集给提前整理过了。此处每条数据都是一个向量,每个数值代表一个单词。数值的大小代表了该单词在单词表中的位置。显然,每条数据向量的长度不一定相同。

  • 训练集中一共25000条文档,其中12500个正类,12500个负类。每个文档都是由80个数字组成的向量。测试集亦然。

  • 为了方便处理,我们可以规定每条文档的长度为maxlen

关于imdb.load_data()的参数、返回值等详细信息参考如下链接:

https://github.com/keras-team/keras/blob/master/keras/datasets/imdb.py

个人感觉数据的预处理部分比较费时

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
from matplotlib import pyplot as plt

import tensorflow as tf
from tensorflow.keras.preprocessing import sequence

np.set_printoptions(threshold=np.inf)

# 情感分类
epochs = 50
batchsz = 32 # 批量大小

vocabulary = 10000 # 词汇表大小
embedding_dim = 8 # 词向量特征长度
word_num = 20 # 句子最大长度,大于的句子部分将截断,小于的将填充

# 加载 IMDB 数据集,此处的数据采用数字编码,一个数字代表一个单词
imdb = tf.keras.datasets.imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocabulary)
# 规定每条文档的长度为maxlen
x_train = sequence.pad_sequences(x_train, maxlen=word_num)
x_test = sequence.pad_sequences(x_test, maxlen=word_num)
print(x_train.shape)

# 搭建网络
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(vocabulary, embedding_dim, input_length=word_num))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(1, activation="sigmoid"))

model.summary()

model.compile(
optimizer=tf.optimizers.RMSprop(learning_rate=0.0001),
loss='binary_crossentropy',
metrics=['acc'],
)

history = model.fit(
x_train, y_train, batch_size=batchsz, epochs=epochs, validation_split=0.2
)

# 显示训练集和验证集的acc和loss曲线
acc = history.history['acc']
val_acc = history.history['val_acc']

loss = history.history['loss']
val_loss = history.history['val_loss']

plt.subplot(1, 2, 1)
plt.plot(acc, label='Training Accuracy')
plt.plot(val_acc, label='Validation Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.title('Training and Validation Accuracy')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(loss, label='Training Loss')
plt.plot(val_loss, label='Validation Loss')
plt.title('Training and Validation Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show()

# 在测试集上的 loss 和 acc
loss_and_acc = model.evaluate(x_test, y_test)
print('on test dataset, loss = ' + str(loss_and_acc[0]))
print('on test dataset, acc = ' + str(loss_and_acc[1]))

4.Summary

5.使用n-gram做单词预测

以下是pytorch代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
CONTEXT_SIZE = 2 
EMBEDDING_DIM = 10
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()

trigram = [((test_sentence[i],test_sentence[i+1]),test_sentence[i+2])
for i in range(len(test_sentence)-2)
]
vocb = set(test_sentence)
word_to_idx = {word:i for i,word in enumerate(vocb)}
idx_to_word = {word_to_idx[word]: word for word in word_to_idx}

print(word_to_idx)


# Ready? Let's build a neural network using pytorch!
import torch
from torch import nn
import torch.nn.functional as F
from torch.autograd import Variable
from torch.nn.modules.activation import Sigmoid

# n_gram model
class n_gram(nn.Module):
def __init__(self,vocab_size,context_size=CONTEXT_SIZE,n_dim=EMBEDDING_DIM):
super(n_gram,self).__init__()

self.embed = nn.Embedding(vocab_size, n_dim)
self.classify = nn.Sequential(
nn.Linear(context_size * n_dim, 64),
nn.ReLU(),
nn.Linear(64, vocab_size)
)

def forward(self, x):
voc_embed = self.embed(x)
voc_embed = voc_embed.view(1, -1)
out = self.classify(voc_embed)
return out


net = n_gram(len(word_to_idx))

word = trigram[0][0]
label = trigram[0][1]
print(word)
print(label)

word = Variable(torch.LongTensor([word_to_idx[i] for i in word]))
print(word)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=1e-2, weight_decay=1e-5)

for e in range(100):
train_loss = 0
for word, label in trigram:
word = Variable(torch.LongTensor([word_to_idx[i] for i in word]))
label = Variable(torch.LongTensor([word_to_idx[label]]))
# forward propagation
out = net(word)
loss = criterion(out, label)
train_loss += loss.data.item()
# backward propagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (e + 1) % 10 == 0:
print('epoch: {}, Loss: {:.6f}'.format(e + 1, train_loss / len(trigram)))

net = net.eval()
count = 0
for i in range(len(trigram)):
word, label = trigram[i]
word = Variable(torch.LongTensor([word_to_idx[i] for i in word]))
out = net(word)
pred_label_idx = out.argmax(1)
predict_word = idx_to_word[pred_label_idx.item()]
print(f'real word is {label}, predicted word is {predict_word}.')
if label == predict_word:
count += 1
# if you want to know which word that the neural network predicted false,
# then you should uncomment the `else` branch.
# else:
# input()

print(f'Accuracy is {count/len(trigram)}')
文章作者: 小王同学
文章链接: https://morvan.top/2019/05/16/RNN模型与NLP应用2-文本处理与词嵌入/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 小王同学的精神驿站