使用预训练的词向量完成文本分类任务

    在这个示例中,将使用飞桨2.1完成针对Imdb数据集(电影评论情感二分类数据集)的分类训练和测试。Imdb将直接调用自飞桨2.1,同时, 利用预训练的词向量(GloVe embedding)完成任务。

    本教程基于Paddle 2.1 编写,如果你的环境不是本版本,请先参考官网 Paddle 2.1 。

    1. import paddle
    2. from paddle.io import Dataset
    3. import numpy as np
    4. import paddle.text as text
    5. import random
    6. print(paddle.__version__)
    1. 2.1.0

    由于飞桨2.1提供了经过处理的Imdb数据集,可以方便地调用所需要的数据实例,省去了数据预处理的麻烦。目前,飞桨2.1以及内置的高质量 数据集包括Conll05st、Imdb、Imikolov、Movielens、HCIHousing、WMT14和WMT16等,未来还将提供更多常用数据集的调用接口。

    以下定义了调用imdb训练集合测试集的方法。其中,cutoff定义了构建词典的截止大小,即数据集中出现频率在cutoff以下的不予考虑;mode定义了返回的数据用于何种用途(test: 测试集,train: 训练集)。

    1. imdb_train = text.Imdb(mode='train', cutoff=150)
    2. imdb_test = text.Imdb(mode='test', cutoff=150)
    1. Cache file /home/aistudio/.cache/paddle/dataset/imdb/imdb%2FaclImdb_v1.tar.gz not found, downloading https://dataset.bj.bcebos.com/imdb%2FaclImdb_v1.tar.gz
    2. Begin to download
    3. Download finished
    1. print("训练集样本数量: %d; 测试集样本数量: %d" % (len(imdb_train), len(imdb_test)))
    2. print(f"样本标签: {set(imdb_train.labels)}")
    3. print(f"样本字典: {list(imdb_train.word_idx.items())[:10]}")
    4. print(f"单个样本: {imdb_train.docs[0]}")
    5. print(f"最小样本长度: {min([len(x) for x in imdb_train.docs])};最大样本长度: {max([len(x) for x in imdb_train.docs])}")
    1. 训练集样本数量: 25000; 测试集样本数量: 25000
    2. 样本标签: {0, 1}
    3. 样本字典: [(b'the', 0), (b'and', 1), (b'a', 2), (b'of', 3), (b'to', 4), (b'is', 5), (b'in', 6), (b'it', 7), (b'i', 8), (b'this', 9)]
    4. 单个样本: [5146, 43, 71, 6, 1092, 14, 0, 878, 130, 151, 5146, 18, 281, 747, 0, 5146, 3, 5146, 2165, 37, 5146, 46, 5, 71, 4089, 377, 162, 46, 5, 32, 1287, 300, 35, 203, 2136, 565, 14, 2, 253, 26, 146, 61, 372, 1, 615, 5146, 5, 30, 0, 50, 3290, 6, 2148, 14, 0, 5146, 11, 17, 451, 24, 4, 127, 10, 0, 878, 130, 43, 2, 50, 5146, 751, 5146, 5, 2, 221, 3727, 6, 9, 1167, 373, 9, 5, 5146, 7, 5, 1343, 13, 2, 5146, 1, 250, 7, 98, 4270, 56, 2316, 0, 928, 11, 11, 9, 16, 5, 5146, 5146, 6, 50, 69, 27, 280, 27, 108, 1045, 0, 2633, 4177, 3180, 17, 1675, 1, 2571]
    5. 最小样本长度: 10;最大样本长度: 2469

    对于训练集,将数据的顺序打乱,以优化将要进行的分类模型训练的效果。

    从样本长度上可以看到,每个样本的长度是不相同的。然而,在模型的训练过程中,需要保证每个样本的长度相同,以便于构造矩阵进行批量运算。 因此,需要先对所有样本进行填充或截断,使样本的长度一致。

    1. def vectorizer(input, label=None, length=2000):
    2. if label is not None:
    3. for x, y in zip(input, label):
    4. yield np.array((x + [0]*length)[:length]).astype('int64'), np.array([y]).astype('int64')
    5. else:
    6. for x in input:
    7. yield np.array((x + [0]*length)[:length]).astype('int64')

    3.2 载入预训练向量

    以下给出的文件较小,可以直接完全载入内存。对于大型的预训练向量,无法一次载入内存的,可以采用分批载入,并行处理的方式进行匹配。

    1. # !wget http://nlp.stanford.edu/data/glove.6B.zip
    2. # !unzip -q glove.6B.zip
    3. glove_path = "./glove.6B.100d.txt"
    4. embeddings = {}

    观察上述GloVe预训练向量文件一行的数据:

    1. # 使用utf8编码解码
    2. with open(glove_path, encoding='utf-8') as gf:
    3. line = gf.readline()
    4. print("GloVe单行数据:'%s'" % line)
    1. GloVe单行数据:'the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459 0.28787 -0.06731 0.30906 -0.26384 -0.13231 -0.20757 0.33395 -0.33848 -0.31743 -0.48336 0.1464 -0.37304 0.34577 0.052041 0.44946 -0.46971 0.02628 -0.54155 -0.15518 -0.14107 -0.039722 0.28277 0.14393 0.23464 -0.31021 0.086173 0.20397 0.52624 0.17164 -0.082378 -0.71787 -0.41531 0.20335 -0.12763 0.41367 0.55187 0.57908 -0.33477 -0.36559 -0.54857 -0.062892 0.26584 0.30205 0.99775 -0.80481 -3.0243 0.01254 -0.36942 2.2167 0.72201 -0.24978 0.92136 0.034514 0.46745 1.1079 -0.19358 -0.074575 0.23353 -0.052062 -0.22044 0.057162 -0.15806 -0.30798 -0.41625 0.37972 0.15006 -0.53212 -0.2055 -1.2526 0.071624 0.70565 0.49744 -0.42063 0.26148 -1.538 -0.30223 -0.073438 -0.28312 0.37104 -0.25217 0.016215 -0.017099 -0.38984 0.87424 -0.72569 -0.51058 -0.52028 -0.1459 0.8278 0.27062
    2. '
    1. with open(glove_path, encoding='utf-8') as gf:
    2. for glove in gf:
    3. word, embedding = glove.split(maxsplit=1)
    4. embedding = [float(s) for s in embedding.split(' ')]
    5. embeddings[word] = embedding
    6. print("预训练词向量总数:%d" % len(embeddings))
    7. print(f"单词'the'的向量是:{embeddings['the']}")
    1. 预训练词向量总数:400000
    2. 单词'the'的向量是:[-0.038194, -0.24487, 0.72812, -0.39961, 0.083172, 0.043953, -0.39141, 0.3344, -0.57545, 0.087459, 0.28787, -0.06731, 0.30906, -0.26384, -0.13231, -0.20757, 0.33395, -0.33848, -0.31743, -0.48336, 0.1464, -0.37304, 0.34577, 0.052041, 0.44946, -0.46971, 0.02628, -0.54155, -0.15518, -0.14107, -0.039722, 0.28277, 0.14393, 0.23464, -0.31021, 0.086173, 0.20397, 0.52624, 0.17164, -0.082378, -0.71787, -0.41531, 0.20335, -0.12763, 0.41367, 0.55187, 0.57908, -0.33477, -0.36559, -0.54857, -0.062892, 0.26584, 0.30205, 0.99775, -0.80481, -3.0243, 0.01254, -0.36942, 2.2167, 0.72201, -0.24978, 0.92136, 0.034514, 0.46745, 1.1079, -0.19358, -0.074575, 0.23353, -0.052062, -0.22044, 0.057162, -0.15806, -0.30798, -0.41625, 0.37972, 0.15006, -0.53212, -0.2055, -1.2526, 0.071624, 0.70565, 0.49744, -0.42063, 0.26148, -1.538, -0.30223, -0.073438, -0.28312, 0.37104, -0.25217, 0.016215, -0.017099, -0.38984, 0.87424, -0.72569, -0.51058, -0.52028, -0.1459, 0.8278, 0.27062]

    接下来,提取数据集的词表,需要注意的是,词表中的词编码的先后顺序是按照词出现的频率排列的,频率越高的词编码值越小。

    1. word_idx = imdb_train.word_idx
    2. vocab = [w for w in word_idx.keys()]
    3. print(f"词表的前5个单词:{vocab[:5]}")
    4. print(f"词表的后5个单词:{vocab[-5:]}")

    观察词表的后5个单词,发现最后一个词是”<unk>”,这个符号代表所有词表以外的词。另外,对于形式b’the’,是字符串’the’ 的二进制编码形式,使用中注意使用b’the’.decode()来进行转换(’<unk>’并没有进行二进制编码,注意区分)。 接下来,给词表中的每个词匹配对应的词向量。预训练词向量可能没有覆盖数据集词表中的所有词,对于没有的词,设该词的词 向量为零向量。

    1. # 定义词向量的维度,注意与预训练词向量保持一致
    2. dim = 100
    3. vocab_embeddings = np.zeros((len(vocab), dim))
    4. for ind, word in enumerate(vocab):
    5. word = word.decode()
    6. embedding = embeddings.get(word, np.zeros((dim,)))
    7. vocab_embeddings[ind, :] = embedding

    4.1 构建基于预训练向量的Embedding

    对于预训练向量的Embedding,一般期望它的参数不再变动,所以要设置trainable=False。如果希望在此基础上训练参数,则需要 设置trainable=True。

    1. pretrained_attr = paddle.ParamAttr(name='embedding',
    2. initializer=paddle.nn.initializer.Assign(vocab_embeddings),
    3. trainable=False)
    4. embedding_layer = paddle.nn.Embedding(num_embeddings=len(vocab),
    5. padding_idx=word_idx['<unk>'],
    6. weight_attr=pretrained_attr)

    这里,构建简单的基于一维卷积的分类模型,其结构为:Embedding->Conv1D->Pool1D->Linear。在定义Linear时,由于需要知 道输入向量的维度,可以按照公式官方文档 来进行计算。这里给出计算的函数如下:

    1. def cal_output_shape(input_shape, out_channels, kernel_size, stride, padding=0, dilation=1):
    2. return out_channels, int((input_shape + 2*padding - (dilation*(kernel_size - 1) + 1)) / stride) + 1
    3. # 定义每个样本的长度
    4. length = 2000
    5. # 定义卷积层参数
    6. kernel_size = 5
    7. out_channels = 10
    8. stride = 2
    9. padding = 0
    10. output_shape = cal_output_shape(length, out_channels, kernel_size, stride, padding)
    11. output_shape = cal_output_shape(output_shape[1], output_shape[0], 2, 2, 0)
    12. sim_model = paddle.nn.Sequential(embedding_layer,
    13. paddle.nn.Conv1D(in_channels=dim, out_channels=out_channels, kernel_size=kernel_size,
    14. stride=stride, padding=padding, data_format='NLC', bias_attr=True),
    15. paddle.nn.ReLU(),
    16. paddle.nn.MaxPool1D(kernel_size=2, stride=2),
    17. paddle.nn.Flatten(),
    18. paddle.nn.Linear(in_features=np.prod(output_shape), out_features=2, bias_attr=True),
    19. paddle.nn.Softmax())
    20. paddle.summary(sim_model, input_size=(-1, length), dtypes='int64')
    1. ---------------------------------------------------------------------------
    2. Layer (type) Input Shape Output Shape Param #
    3. ===========================================================================
    4. Embedding-1 [[1, 2000]] [1, 2000, 100] 514,700
    5. Conv1D-1 [[1, 2000, 100]] [1, 998, 10] 5,010
    6. ReLU-1 [[1, 998, 10]] [1, 998, 10] 0
    7. MaxPool1D-1 [[1, 998, 10]] [1, 998, 5] 0
    8. Flatten-1 [[1, 998, 5]] [1, 4990] 0
    9. Linear-1 [[1, 4990]] [1, 2] 9,982
    10. Softmax-1 [[1, 2]] [1, 2] 0
    11. ===========================================================================
    12. Total params: 529,692
    13. Trainable params: 14,992
    14. Non-trainable params: 514,700
    15. ---------------------------------------------------------------------------
    16. Input size (MB): 0.01
    17. Forward/backward pass size (MB): 1.75
    18. Params size (MB): 2.02
    19. Estimated Total Size (MB): 3.78
    20. ---------------------------------------------------------------------------
    21. {'total_params': 529692, 'trainable_params': 14992}

    4.3 读取数据,进行训练

    1. class DataReader(Dataset):
    2. def __init__(self, input, label, length):
    3. self.data = list(vectorizer(input, label, length=length))
    4. def __getitem__(self, idx):
    5. return self.data[idx]
    6. def __len__(self):
    7. return len(self.data)
    8. # 定义输入格式
    9. input_form = paddle.static.InputSpec(shape=[None, length], dtype='int64', name='input')
    10. label_form = paddle.static.InputSpec(shape=[None, 1], dtype='int64', name='label')
    11. model.prepare(optimizer=paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters()),
    12. loss=paddle.nn.loss.CrossEntropyLoss(),
    13. metrics=paddle.metric.Accuracy())
    14. # 分割训练集和验证集
    15. eval_length = int(len(train_x) * 1/4)
    16. model.fit(train_data=DataReader(train_x[:-eval_length], train_y[:-eval_length], length),
    17. eval_data=DataReader(train_x[-eval_length:], train_y[-eval_length:], length),
    18. batch_size=32, epochs=10, verbose=1)
    1. The loss value printed in the log is the current step, and the metric is the average value of previous steps.
    2. Epoch 1/10
    3. step 586/586 [==============================] - loss: 0.4641 - acc: 0.6480 - 415ms/step
    4. Eval begin...
    5. step 196/196 [==============================] - loss: 0.3703 - acc: 0.7694 - 209ms/step
    6. Eval samples: 6250
    7. Epoch 2/10
    8. step 586/586 [==============================] - loss: 0.5839 - acc: 0.7744 - 416ms/step
    9. Eval begin...
    10. step 196/196 [==============================] - loss: 0.3651 - acc: 0.7939 - 206ms/step
    11. Eval samples: 6250
    12. Epoch 3/10
    13. step 586/586 [==============================] - loss: 0.3980 - acc: 0.7953 - 419ms/step
    14. Eval begin...
    15. step 196/196 [==============================] - loss: 0.3801 - acc: 0.7982 - 214ms/step
    16. Eval samples: 6250
    17. Epoch 4/10
    18. step 586/586 [==============================] - loss: 0.4552 - acc: 0.8184 - 415ms/step
    19. Eval begin...
    20. step 196/196 [==============================] - loss: 0.3370 - acc: 0.8077 - 210ms/step
    21. Eval samples: 6250
    22. Epoch 5/10
    23. step 586/586 [==============================] - loss: 0.4108 - acc: 0.8361 - 421ms/step
    24. Eval begin...
    25. step 196/196 [==============================] - loss: 0.3369 - acc: 0.8179 - 210ms/step
    26. Eval samples: 6250
    27. Epoch 6/10
    28. step 586/586 [==============================] - loss: 0.4215 - acc: 0.8486 - 415ms/step
    29. Eval begin...
    30. step 196/196 [==============================] - loss: 0.3419 - acc: 0.8062 - 213ms/step
    31. Eval samples: 6250
    32. Epoch 7/10
    33. step 586/586 [==============================] - loss: 0.4092 - acc: 0.8586 - 424ms/step
    34. Eval begin...
    35. step 196/196 [==============================] - loss: 0.3312 - acc: 0.8200 - 208ms/step
    36. Eval samples: 6250
    37. Epoch 8/10
    38. step 586/586 [==============================] - loss: 0.4488 - acc: 0.8694 - 419ms/step
    39. Eval begin...
    40. step 196/196 [==============================] - loss: 0.3328 - acc: 0.8186 - 205ms/step
    41. Eval samples: 6250
    42. Epoch 9/10
    43. step 586/586 [==============================] - loss: 0.5302 - acc: 0.8770 - 412ms/step
    44. Eval begin...
    45. step 196/196 [==============================] - loss: 0.3961 - acc: 0.8152 - 201ms/step
    46. Eval samples: 6250
    47. Epoch 10/10
    48. step 586/586 [==============================] - loss: 0.3728 - acc: 0.8807 - 420ms/step
    49. Eval begin...
    50. step 196/196 [==============================] - loss: 0.3353 - acc: 0.8210 - 202ms/step
    51. Eval samples: 6250
    1. # 评估
    2. model.evaluate(eval_data=DataReader(test_x, test_y, length), batch_size=32, verbose=1)
    3. # 预测
    4. true_y = test_y[100:105] + test_y[-110:-105]
    5. pred_y = model.predict(DataReader(test_x[100:105] + test_x[-110:-105], None, length), batch_size=1)
    6. test_x_doc = test_x[100:105] + test_x[-110:-105]
    7. # 标签编码转文字
    8. label_id2text = {0: 'positive', 1: 'negative'}
    9. for index, y in enumerate(pred_y[0]):
    10. print("预测的标签是:%s, 实际标签是:%s" % (label_id2text[np.argmax(y)], label_id2text[true_y[index]]))