[Python人工智能] 三十三.Bert模型 (2)keras-bert库构建Bert模型实现文本分类

从本专栏开始,作者正式研究Python深度学习、神经网络及人工智能相关知识。前一篇文章开启了新的内容——Bert,首先介绍Keras-bert库安装及基础用法,这将为后续文本分类、命名实体识别提供帮助。这篇文章将通过keras-bert库构建Bert模型,并实现文本分类工作。基础性文章,希望对您有所帮助!

这篇文章主要参考“山阴少年”大佬的博客,并结合自己的经验,对其代码进行了详细的复现和理解。希望对您有所帮助,尤其是初学者,也强烈推荐大家关注这位老师的文章。

本专栏主要结合作者之前的博客、AI经验和相关视频及论文介绍,后面随着深入会讲解更多的Python人工智能案例及应用。基础性文章,希望对您有所帮助,如果文章中存在错误或不足之处,还请海涵!作者作为人工智能的菜鸟,希望大家能与我在这一笔一划的博客中成长起来。写了这么多年博客,尝试第一个付费专栏,为小宝赚点奶粉钱,但更多博客尤其基础性文章,还是会继续免费分享,该专栏也会用心撰写,望对得起读者。如果有问题随时私聊我,只望您能从这个系列中学到知识,一起加油喔~

前文赏析:


一.Bert模型引入

Bert模型的原理知识将在后面的文章介绍,主要结合结合谷歌论文和模型优势讲解。

在这里插入图片描述

BERT(Bidirectional Encoder Representation from Transformers)是一个预训练的语言表征模型,是由谷歌AI团队在2018年提出。该模型在机器阅读理解顶级水平测试SQuAD1.1中表现出惊人的成绩,并且在11种不同NLP测试中创出最佳成绩,包括将GLUE基准推至80.4%(绝对改进7.6%),MultiNLI准确度达到86.7% (绝对改进率5.6%)等。可以预见的是,BERT将为NLP带来里程碑式的改变,也是NLP领域近期最重要的进展。

Bert强调了不再像以往一样采用传统的单向语言模型或者把两个单向语言模型进行浅层拼接的方法进行预训练,而是采用新的masked language model(MLM),以致能生成深度的双向语言表征。其模型框架图如下所示,后面的文章再详细介绍,这里仅作引入,推荐读者阅读原文。

在这里插入图片描述

在这里插入图片描述


二.keras-bert安装感想

首先,我们通过github可以看到开源代码的结构,如下图。

├── chinese_L-12_H-768_A-12(BERT中文预训练模型)
│   ├── bert_config.json
│   ├── bert_model.ckpt.data-00000-of-00001
│   ├── bert_model.ckpt.index
│   ├── bert_model.ckpt.meta
│   └── vocab.txt
├── data(数据集)
│   └── sougou_mini
│       ├── test.csv
│       └── train.csv
├── label.json(类别词典,生成文件)
├── model_evaluate.py(模型评估脚本)
├── model_predict.py(模型预测脚本)
├── model_train.py(模型训练脚本)
└── requirements.txt

然而,在安装过程中我们需要注意各个版本之间的兼容性,尤其是一些冷门的扩展包,否则会提示各种错误。

在这里插入图片描述

keras-bert库最适合的版本如下,在requirements.txt文件中可以看到。

  • pandas==0.23.4
  • Keras==2.3.1
  • keras_bert==0.83.0
  • numpy==1.16.4

安装过程建议使用:

  • pip install keras-bert==0.83.0

在这里插入图片描述

但是,有时候我们配置的扩展包会已经和其他环境适配,比如keras。这里我们需要去PYPI查看Keras-bert可用的版本问题。

在这里插入图片描述

作者的Kears版本为2.2.4,之前安装就会遇到各种问题,最后将keras-bert安装为“0.81.1”才成功运行,这是我们需要注意的。

在这里插入图片描述

个人感觉Python环境兼容问题一直困扰大家,也不知道有没有好的解决方法?后面我准备学习下docker来进行环境镜像,也可以通过下面命令快速安装环境。

  • pip freeze >requirements.txt

在这里插入图片描述

错误修改
如果提示“ImportError: cannot import name ‘Adam’ from ‘keras.optimizers’”,我们可以进行如下修改:

  • from keras.optimizers import adam_v2
  • opt = adam_v2.Adam(learning_rate=lr, decay=lr/epochs)

其原因是keras库更新后无法按照原方式导入包,打开optimizers.py源码发现两句关键代码更改。但最终修改方法还是版本兼容问题。

  • from keras.optimizers import Adam
  • opt = Adam(lr=lr, decay=lr/epochs)

    在这里插入图片描述


三.Bert模型构建分类器

1.数据集

在文本挖掘中有两个比较常用的数据集,分别为:

  • sougou小分类数据集
    – 5个类别:体育、健康、军事、教育、汽车。
    – 训练集和测试集:训练集每个分类800条样本,测试集每个分类100条样本。
  • THUCNews数据集
    – 10个类别:体育, 财经, 房产, 家居, 教育, 科技, 时尚, 时政, 游戏, 娱乐。
    – 训练集和测试集:训练集 5000x10,测试集 1000x10。

本文采用sougou数据集,如下图所示包括label和content。

在这里插入图片描述

类别如下:
在这里插入图片描述

在这里插入图片描述


2.模型构建

(1) 数据预处理

在keras-bert里面,我们使用Tokenizer类来将文本拆分成字并生成相应的id,通过定义字典存放token和id的映射。

  • TOKEN_CLS:the token represents classification
  • TOKEN_SEP:the token represents separator
  • TOKEN_UNK:the token represents unknown token
# -*- coding: utf-8 -*-
"""
Created on Wed Nov 24 00:09:48 2021
@author: xiuzhang
引用:https://github.com/percent4/keras_bert_text_classification
"""
import json
import codecs
import pandas as pd
import numpy as np
from keras_bert import load_trained_model_from_checkpoint, Tokenizer
from keras.layers import *
from keras.models import Model
from keras.optimizers import Adam
#from keras.optimizers import adam_v2 

maxlen = 300
BATCH_SIZE = 8
config_path = 'chinese_L-12_H-768_A-12/bert_config.json'
checkpoint_path = 'chinese_L-12_H-768_A-12/bert_model.ckpt'
dict_path = 'chinese_L-12_H-768_A-12/vocab.txt'

#读取vocab词典
token_dict = {}
with codecs.open(dict_path, 'r', 'utf-8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)

#------------------------------------------类函数定义--------------------------------------
#词典中添加否则Unknown
class OurTokenizer(Tokenizer):
    def _tokenize(self, text):
        R = []
        for c in text:
            if c in self._token_dict:
                R.append(c)
            else:
                R.append('[UNK]')   #剩余的字符是[UNK]
        return R
tokenizer = OurTokenizer(token_dict)

#数据填充
def seq_padding(X, padding=0):
    L = [len(x) for x in X]
    ML = max(L)
    return np.array([
        np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X
    ])

class DataGenerator:

    def __init__(self, data, batch_size=BATCH_SIZE):
        self.data = data
        self.batch_size = batch_size
        self.steps = len(self.data) // self.batch_size
        if len(self.data) % self.batch_size != 0:
            self.steps += 1

    def __len__(self):
        return self.steps

    def __iter__(self):
        while True:
            idxs = list(range(len(self.data)))
            np.random.shuffle(idxs)
            X1, X2, Y = [], [], []
            for i in idxs:
                d = self.data[i]
                text = d[0][:maxlen]
                x1, x2 = tokenizer.encode(first=text)
                y = d[1]
                X1.append(x1)
                X2.append(x2)
                Y.append(y)
                if len(X1) == self.batch_size or i == idxs[-1]:
                    X1 = seq_padding(X1)
                    X2 = seq_padding(X2)
                    Y = seq_padding(Y)
                    yield [X1, X2], Y
                    [X1, X2, Y] = [], [], []

#构建模型
def create_cls_model(num_labels):
    bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path, seq_len=None)

    for layer in bert_model.layers:
        layer.trainable = True

    x1_in = Input(shape=(None,))
    x2_in = Input(shape=(None,))

    x = bert_model([x1_in, x2_in])
    cls_layer = Lambda(lambda x: x[:, 0])(x)                #取出[CLS]对应的向量用来做分类
    p = Dense(num_labels, activation='softmax')(cls_layer)  #多分类

    model = Model([x1_in, x2_in], p)
    model.compile(
        loss='categorical_crossentropy',
        optimizer=Adam(1e-5),
        #optimizer=adam_v2.Adam(1e-5),
        metrics=['accuracy']
    )
    model.summary()

    return model

#------------------------------------------主函数-----------------------------------------
if __name__ == '__main__':

    #数据预处理
    train_df = pd.read_csv("data/sougou_mini/train.csv").fillna(value="")
    test_df = pd.read_csv("data/sougou_mini/test.csv").fillna(value="")
    print("begin data processing...")

    labels = train_df["label"].unique()
    with open("label.json", "w", encoding="utf-8") as f:
        f.write(json.dumps(dict(zip(range(len(labels)), labels)), ensure_ascii=False, indent=2))

    train_data = []
    test_data = []
    for i in range(train_df.shape[0]):
        label, content = train_df.iloc[i, :]
        label_id = [0] * len(labels)
        for j, _ in enumerate(labels):
            if _ == label:
                label_id[j] = 1
        train_data.append((content, label_id))
    print(train_data[0])

    for i in range(test_df.shape[0]):
        label, content = test_df.iloc[i, :]
        label_id = [0] * len(labels)
        for j, _ in enumerate(labels):
            if _ == label:
                label_id[j] = 1
        test_data.append((content, label_id))
    print(len(train_data),len(test_data))
    print("finish data processing!")

预处理输出结果如下图所示,[1,0,0,0,0]对应第一类体育。

在这里插入图片描述

同时在本地保存类别文件,如下图所示:

在这里插入图片描述


(2) 模型训练

模型训练代码如下,这里给出主函数的核心代码:

#------------------------------------------主函数-----------------------------------------
if __name__ == '__main__':

    #数据预处理
    train_df = pd.read_csv("data/sougou_mini/train.csv").fillna(value="")
    test_df = pd.read_csv("data/sougou_mini/test.csv").fillna(value="")
    print("begin data processing...")

    labels = train_df["label"].unique()
    with open("label.json", "w", encoding="utf-8") as f:
        f.write(json.dumps(dict(zip(range(len(labels)), labels)), ensure_ascii=False, indent=2))

    train_data = []
    test_data = []
    for i in range(train_df.shape[0]):
        label, content = train_df.iloc[i, :]
        label_id = [0] * len(labels)
        for j, _ in enumerate(labels):
            if _ == label:
                label_id[j] = 1
        train_data.append((content, label_id))
    print(train_data[0])

    for i in range(test_df.shape[0]):
        label, content = test_df.iloc[i, :]
        label_id = [0] * len(labels)
        for j, _ in enumerate(labels):
            if _ == label:
                label_id[j] = 1
        test_data.append((content, label_id))
    print(len(train_data),len(test_data))
    print("finish data processing!n")
    
    #模型训练
    model = create_cls_model(len(labels))
    train_D = DataGenerator(train_data)
    test_D = DataGenerator(test_data)
    print("begin model training...")
    
    model.fit_generator(
        train_D.__iter__(),
        steps_per_epoch=len(train_D),
        epochs=3,
        validation_data=test_D.__iter__(),
        validation_steps=len(test_D)
    )
    print("finish model training!")

    #模型保存
    model.save('cls_cnews.h5')
    print("Model saved!")
    result = model.evaluate_generator(test_D.__iter__(), steps=len(test_D))
    print("模型评估结果:", result)

其中模型结构如下,我们提取 [CLS] 对应的向量,然后构建模型和全连接层,激活函数使用Softmax,最终完成分类任务。核心代码:

  • model = create_cls_model(len(labels))
  • model.fit_generator()
  • result = model.evaluate_generator(test_D.iter(), steps=len(test_D))

在这里插入图片描述

训练过程输出结果如下:(花了3小时,还是要使用GPU才行)

Skipping registering GPU devices...
begin model training...
Epoch 1/3
C:UsersPCDesktopbert-classifierblog33-kerasbert-01-classifier.py:144: UserWarning: `Model.fit_generator` is deprecated and will be removed in a future version. Please use `Model.fit`, which supports generators.
  model.fit_generator(
500/500 [==============================] - 3386s 7s/step - loss: 0.1906 - accuracy: 0.9370 - val_loss: 0.0768 - val_accuracy: 0.9737
Epoch 2/3
500/500 [==============================] - 3356s 7s/step - loss: 0.0379 - accuracy: 0.9872 - val_loss: 0.0687 - val_accuracy: 0.9758
Epoch 3/3
500/500 [==============================] - 3356s 7s/step - loss: 0.0180 - accuracy: 0.9952 - val_loss: 0.0953 - val_accuracy: 0.9657
finish model training!
Model saved!
模型评估结果: [0.08730510622262955, 0.9696969985961914]

同时将模型存储至本地。

在这里插入图片描述


(3) GPU版模型训练

GPU配置的核心代码如下:

import os
import tensorflow as tf
os.environ["CUDA_DEVICES_ORDER"] = "PCI_BUS_IS"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

#指定了每个GPU进程中使用显存的上限,0.9表示可以使用GPU 90%的资源进行训练
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

代码成功运行,如下图所示:

在这里插入图片描述

运行时间从3小时提升到600秒。

在这里插入图片描述

在这里插入图片描述


3.模型评估

模型评估核心代码如下:

# -*- coding: utf-8 -*-
# -*- coding: utf-8 -*-
"""
Created on Thu Nov 25 00:09:02 2021
@author: xiuzhang
引用:https://github.com/percent4/keras_bert_text_classification
"""
import json
import numpy as np
import pandas as pd
from keras.models import load_model
from keras_bert import get_custom_objects
from sklearn.metrics import classification_report
from blog33_kerasbert_01_train import token_dict, OurTokenizer

maxlen = 300

#加载训练好的模型
model = load_model("cls_cnews.h5", custom_objects=get_custom_objects())
tokenizer = OurTokenizer(token_dict)
with open("label.json", "r", encoding="utf-8") as f:
    label_dict = json.loads(f.read())

#单句预测
def predict_single_text(text):
    text = text[:maxlen]
    x1, x2 = tokenizer.encode(first=text)  #BERT Tokenize
    X1 = x1 + [0] * (maxlen - len(x1)) if len(x1) < maxlen else x1
    X2 = x2 + [0] * (maxlen - len(x2)) if len(x2) < maxlen else x2
    #print(X1,X2)

    #模型预测
    predicted = model.predict([[X1], [X2]])
    y = np.argmax(predicted[0])
    return label_dict[str(y)]

#模型评估
def evaluate():
    test_df = pd.read_csv("data/sougou_mini/test.csv").fillna(value="")
    true_y_list, pred_y_list = [], []
    for i in range(test_df.shape[0]):
        print("predict %d samples" % (i+1))
        true_y, content = test_df.iloc[i, :]
        pred_y = predict_single_text(content)
        print(true_y,pred_y)
        true_y_list.append(true_y)
        pred_y_list.append(pred_y)
    return classification_report(true_y_list, pred_y_list, digits=4)

#------------------------------------模型评估---------------------------------
output_data = evaluate()
print("model evaluate result:n")
print(output_data)

输出结果如下图所示:

predict 490 samples
汽车 汽车
predict 491 samples
汽车 汽车
predict 492 samples
汽车 汽车
predict 493 samples
汽车 汽车
predict 494 samples
汽车 汽车
predict 495 samples
汽车 汽车
model evaluate result:

             precision    recall  f1-score   support

         体育     0.9706    1.0000    0.9851        99
         健康     0.9694    0.9596    0.9645        99
         军事     1.0000    1.0000    1.0000        99
         教育     0.9694    0.9596    0.9645        99
         汽车     0.9796    0.9697    0.9746        99

avg / total     0.9778    0.9778    0.9777       495

4.模型预测

核心代码如下:

# -*- coding: utf-8 -*-
"""
Created on Thu Nov 25 00:10:06 2021
@author: xiuzhang
引用:https://github.com/percent4/keras_bert_text_classification
"""
import time
import json
import numpy as np

from blog33_kerasbert_01_train import token_dict, OurTokenizer
from keras.models import load_model
from keras_bert import get_custom_objects

maxlen = 256
s_time = time.time()

#加载训练好的模型
model = load_model("cls_cnews.h5", custom_objects=get_custom_objects())
tokenizer = OurTokenizer(token_dict)
with open("label.json", "r", encoding="utf-8") as f:
    label_dict = json.loads(f.read())

#预测示例语句
text = "说到硬派越野SUV,你会想起哪些车型?是被称为“霸道”的丰田 普拉多 (配置 | 询价) ,还是被叫做“山猫”的帕杰罗,亦或者是“渣男专车”奔驰大G、" 
       "“沙漠王子”途乐。总之,随着世界各国越来越重视对环境的保护,那些大排量的越野SUV在不久的将来也会渐渐消失在我们的视线之中,所以与其错过," 
       "不如趁着还年轻,在有生之年里赶紧去入手一台能让你心仪的硬派越野SUV。而今天我想要来跟你们聊的,正是全球公认的十大硬派越野SUV," 
       "越野迷们看完之后也不妨思考一下,到底哪款才是你的菜,下面话不多说,赶紧开始吧。"

#Tokenize
text = text[:maxlen]
x1, x2 = tokenizer.encode(first=text)
X1 = x1 + [0] * (maxlen-len(x1)) if len(x1) < maxlen else x1
X2 = x2 + [0] * (maxlen-len(x2)) if len(x2) < maxlen else x2

#模型预测
predicted = model.predict([[X1], [X2]])
y = np.argmax(predicted[0])
e_time = time.time()
print("原文: %s" % text)
print("预测标签: %s" % label_dict[str(y)])
print("Cost time:", e_time-s_time)

输出结果如下图所示,可以对不同的语料进行类别预测。

在这里插入图片描述


四.文本分类实验评估

1.机器学习模型对比分析

Bert实验结果如下:

             precision    recall  f1-score   support

         体育     0.9706    1.0000    0.9851        99
         健康     0.9694    0.9596    0.9645        99
         军事     1.0000    1.0000    1.0000        99
         教育     0.9694    0.9596    0.9645        99
         汽车     0.9796    0.9697    0.9746        99

avg / total     0.9778    0.9778    0.9777       495

机器学习代码如下:

# -*- coding: utf-8 -*-
"""
Created on Mon Sep 27 22:21:53 2021
@author: xiuzhang
"""
import jieba
import pandas as pd
import numpy as np
from collections import Counter
from scipy.sparse import coo_matrix
from sklearn import feature_extraction  
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn import neighbors
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import AdaBoostClassifier

#-----------------------------------------------------------------------------
#读取数据
train_path = 'data/sougou_mini/train.csv'
test_path = 'data/sougou_mini/test.csv'
types = {0: '体育', 1: '健康', 2: '军事', 3:'教育', 4:'汽车'}
pd_train = pd.read_csv(train_path)
pd_test = pd.read_csv(test_path)
print('训练集数目(总体):%d' % pd_train.shape[0])
print('测试集数目(总体):%d' % pd_test.shape[0])

#中文分词
train_words = []
test_words = []
train_labels = []
test_labels = []
stopwords = ["[", "]", ")", "(", ")", "(", "【", "】", "!", ",", "$",
             "·", "?", ".", "、", "-", "—", ":", ":", "《", "》", "=",
             "。", "…", "“", "?", "”", "~", " ", "-", "+", "\", "‘",
             "~", ";", "’", "...", "..", "&", "#",  "....", ",", 
             "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"
             "的", "和", "之", "了", "哦", "那", "一个",  ]

for line in range(len(pd_train)):
    dict_label = pd_train['label'][line]
    dict_content = str(pd_train['content'][line]) #float=>str
    #print(dict_label,dict_content)
    cut_words = ""
    data = dict_content.strip("n")
    data = data.replace(",", "")    #一定要过滤符号 ","否则多列
    seg_list = jieba.cut(data, cut_all=False)
    for seg in seg_list:
        if seg not in stopwords:
            cut_words += seg + " "
    #print(cut_words)
    train_labels.append(dict_label)
    train_words.append(cut_words)
print(len(train_labels),len(train_words)) #4000 4000

for line in range(len(pd_test)):
    dict_label = pd_test['label'][line]
    dict_content = str(pd_test['content'][line])
    cut_words = ""
    data = dict_content.strip("n")
    data = data.replace(",", "")
    seg_list = jieba.cut(data, cut_all=False)
    for seg in seg_list:
        if seg not in stopwords:
            cut_words += seg + " "
    test_labels.append(dict_label)
    test_words.append(cut_words)
print(len(test_labels),len(test_words)) #495 495

#-----------------------------------------------------------------------------
#TFIDF计算
#将文本中的词语转换为词频矩阵 矩阵元素a[i][j] 表示j词在i类文本下的词频
vectorizer = CountVectorizer(min_df=10)   #MemoryError控制参数

#该类会统计每个词语的tf-idf权值
transformer = TfidfTransformer()

#第一个fit_transform是计算tf-idf 第二个fit_transform是将文本转为词频矩阵
tfidf = transformer.fit_transform(vectorizer.fit_transform(train_words+test_words))
for n in tfidf[:5]:
    print(n)
print(type(tfidf))

#获取词袋模型中的所有词语  
word = vectorizer.get_feature_names()
for n in word[:10]:
    print(n)
print("单词数量:", len(word))

#将tf-idf矩阵抽取 元素w[i][j]表示j词在i类文本中的tf-idf权重
X = coo_matrix(tfidf, dtype=np.float32).toarray()  #稀疏矩阵
print(X.shape)
print(X[:10])

X_train = X[:len(train_labels)]
X_test = X[len(train_labels):]
y_train = train_labels
y_test = test_labels
print(len(X_train),len(X_test),len(y_train),len(y_test))

#-----------------------------------------------------------------------------
#分类模型
clf = MultinomialNB()
#clf = svm.LinearSVC()
#clf = LogisticRegression(solver='liblinear')
#clf = RandomForestClassifier(n_estimators=10)
#clf = neighbors.KNeighborsClassifier(n_neighbors=7)
#clf = AdaBoostClassifier()
clf.fit(X_train, y_train)
print('模型的准确度:{}'.format(clf.score(X_test, y_test)))
pre = clf.predict(X_test)
print("分类")
print(len(pre), len(y_test))
print(classification_report(y_test, pre, digits=4))
print("n")

输出结果如下图所示,读者可以对自定义数据集进行深入的对比实验。

训练集数目(总体):4000
测试集数目(总体):495
Building prefix dict from the default dictionary ...
Loading model from cache C:UsersxiuzhangAppDataLocalTempjieba.cache
Loading model cost 0.906 seconds.
Prefix dict has been built succesfully.
4000 4000
495 495
...
单词数量: 14083
(4495, 14083)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
4000 495 4000 495

模型的准确度:0.9797979797979798
分类
495 495
              precision    recall  f1-score   support

          体育     1.0000    0.9798    0.9898        99
          健康     0.9789    0.9394    0.9588        99
          军事     1.0000    1.0000    1.0000        99
          教育     0.9417    0.9798    0.9604        99
          汽车     0.9604    0.9798    0.9700        99

    accuracy                         0.9758       495
   macro avg     0.9762    0.9758    0.9758       495
weighted avg     0.9762    0.9758    0.9758       495

2.完整代码

(1) blog33_kerasbert_01_train.py

# -*- coding: utf-8 -*-
"""
Created on Wed Nov 24 00:09:48 2021
@author: xiuzhang
"""
import json
import codecs
import pandas as pd
import numpy as np
from keras_bert import load_trained_model_from_checkpoint, Tokenizer
from keras.layers import *
from keras.models import Model
from keras.optimizers import Adam

import os
import tensorflow as tf
os.environ["CUDA_DEVICES_ORDER"] = "PCI_BUS_IS"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

#指定了每个GPU进程中使用显存的上限,0.9表示可以使用GPU 90%的资源进行训练
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

maxlen = 300
BATCH_SIZE = 8
config_path = 'chinese_L-12_H-768_A-12/bert_config.json'
checkpoint_path = 'chinese_L-12_H-768_A-12/bert_model.ckpt'
dict_path = 'chinese_L-12_H-768_A-12/vocab.txt'

#读取vocab词典
token_dict = {}
with codecs.open(dict_path, 'r', 'utf-8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)

#------------------------------------------类函数定义--------------------------------------
#词典中添加否则Unknown
class OurTokenizer(Tokenizer):
    def _tokenize(self, text):
        R = []
        for c in text:
            if c in self._token_dict:
                R.append(c)
            else:
                R.append('[UNK]')   #剩余的字符是[UNK]
        return R
tokenizer = OurTokenizer(token_dict)

#数据填充
def seq_padding(X, padding=0):
    L = [len(x) for x in X]
    ML = max(L)
    return np.array([
        np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X
    ])

class DataGenerator:

    def __init__(self, data, batch_size=BATCH_SIZE):
        self.data = data
        self.batch_size = batch_size
        self.steps = len(self.data) // self.batch_size
        if len(self.data) % self.batch_size != 0:
            self.steps += 1

    def __len__(self):
        return self.steps

    def __iter__(self):
        while True:
            idxs = list(range(len(self.data)))
            np.random.shuffle(idxs)
            X1, X2, Y = [], [], []
            for i in idxs:
                d = self.data[i]
                text = d[0][:maxlen]
                x1, x2 = tokenizer.encode(first=text)
                y = d[1]
                X1.append(x1)
                X2.append(x2)
                Y.append(y)
                if len(X1) == self.batch_size or i == idxs[-1]:
                    X1 = seq_padding(X1)
                    X2 = seq_padding(X2)
                    Y = seq_padding(Y)
                    yield [X1, X2], Y
                    [X1, X2, Y] = [], [], []

#构建模型
def create_cls_model(num_labels):
    bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path, seq_len=None)

    for layer in bert_model.layers:
        layer.trainable = True

    x1_in = Input(shape=(None,))
    x2_in = Input(shape=(None,))

    x = bert_model([x1_in, x2_in])
    cls_layer = Lambda(lambda x: x[:, 0])(x)                #取出[CLS]对应的向量用来做分类
    p = Dense(num_labels, activation='softmax')(cls_layer)  #多分类

    model = Model([x1_in, x2_in], p)
    model.compile(
        loss='categorical_crossentropy',
        optimizer=Adam(1e-5),
        metrics=['accuracy']
    )
    model.summary()

    return model

#------------------------------------------主函数-----------------------------------------
if __name__ == '__main__':

    #数据预处理
    train_df = pd.read_csv("data/sougou_mini/train.csv").fillna(value="")
    test_df = pd.read_csv("data/sougou_mini/test.csv").fillna(value="")
    print("begin data processing...")

    labels = train_df["label"].unique()
    with open("label.json", "w", encoding="utf-8") as f:
        f.write(json.dumps(dict(zip(range(len(labels)), labels)), ensure_ascii=False, indent=2))

    train_data = []
    test_data = []
    for i in range(train_df.shape[0]):
        label, content = train_df.iloc[i, :]
        label_id = [0] * len(labels)
        for j, _ in enumerate(labels):
            if _ == label:
                label_id[j] = 1
        train_data.append((content, label_id))
    print(train_data[0])

    for i in range(test_df.shape[0]):
        label, content = test_df.iloc[i, :]
        label_id = [0] * len(labels)
        for j, _ in enumerate(labels):
            if _ == label:
                label_id[j] = 1
        test_data.append((content, label_id))
    print(len(train_data),len(test_data))
    print("finish data processing!n")
    
    #模型训练
    model = create_cls_model(len(labels))
    train_D = DataGenerator(train_data)
    test_D = DataGenerator(test_data)
    print("begin model training...")
    
    model.fit_generator(
        train_D.__iter__(),
        steps_per_epoch=len(train_D),
        epochs=3,
        validation_data=test_D.__iter__(),
        validation_steps=len(test_D)
    )
    print("finish model training!")

    #模型保存
    model.save('cls_cnews.h5')
    print("Model saved!")
    result = model.evaluate_generator(test_D.__iter__(), steps=len(test_D))
    print("模型评估结果:", result)

(2) blog33_kerasbert_02_evaluate.py

# -*- coding: utf-8 -*-
"""
Created on Thu Nov 25 00:09:02 2021
@author: xiuzhang
引用:https://github.com/percent4/keras_bert_text_classification
"""
import json
import numpy as np
import pandas as pd
from keras.models import load_model
from keras_bert import get_custom_objects
from sklearn.metrics import classification_report
from blog33_kerasbert_01_train import token_dict, OurTokenizer

maxlen = 300

#加载训练好的模型
model = load_model("cls_cnews.h5", custom_objects=get_custom_objects())
tokenizer = OurTokenizer(token_dict)
with open("label.json", "r", encoding="utf-8") as f:
    label_dict = json.loads(f.read())

#单句预测
def predict_single_text(text):
    text = text[:maxlen]
    x1, x2 = tokenizer.encode(first=text)  #BERT Tokenize
    X1 = x1 + [0] * (maxlen - len(x1)) if len(x1) < maxlen else x1
    X2 = x2 + [0] * (maxlen - len(x2)) if len(x2) < maxlen else x2
    #print(X1,X2)

    #模型预测
    predicted = model.predict([[X1], [X2]])
    y = np.argmax(predicted[0])
    return label_dict[str(y)]

#模型评估
def evaluate():
    test_df = pd.read_csv("data/sougou_mini/test.csv").fillna(value="")
    true_y_list, pred_y_list = [], []
    for i in range(test_df.shape[0]):
        print("predict %d samples" % (i+1))
        true_y, content = test_df.iloc[i, :]
        pred_y = predict_single_text(content)
        print(true_y,pred_y)
        true_y_list.append(true_y)
        pred_y_list.append(pred_y)
    return classification_report(true_y_list, pred_y_list, digits=4)

#------------------------------------模型评估---------------------------------
output_data = evaluate()
print("model evaluate result:n")
print(output_data)

(3) blog33_kerasbert_03_predict.py

# -*- coding: utf-8 -*-
"""
Created on Thu Nov 25 00:10:06 2021
@author: xiuzhang
引用:https://github.com/percent4/keras_bert_text_classification
"""
import time
import json
import numpy as np

from blog33_kerasbert_01_train import token_dict, OurTokenizer
from keras.models import load_model
from keras_bert import get_custom_objects

maxlen = 256
s_time = time.time()

#加载训练好的模型
model = load_model("cls_cnews.h5", custom_objects=get_custom_objects())
tokenizer = OurTokenizer(token_dict)
with open("label.json", "r", encoding="utf-8") as f:
    label_dict = json.loads(f.read())

#预测示例语句
text = "说到硬派越野SUV,你会想起哪些车型?是被称为“霸道”的丰田 普拉多 (配置 | 询价) ,还是被叫做“山猫”的帕杰罗,亦或者是“渣男专车”奔驰大G、" 
       "“沙漠王子”途乐。总之,随着世界各国越来越重视对环境的保护,那些大排量的越野SUV在不久的将来也会渐渐消失在我们的视线之中,所以与其错过," 
       "不如趁着还年轻,在有生之年里赶紧去入手一台能让你心仪的硬派越野SUV。而今天我想要来跟你们聊的,正是全球公认的十大硬派越野SUV," 
       "越野迷们看完之后也不妨思考一下,到底哪款才是你的菜,下面话不多说,赶紧开始吧。"

#Tokenize
text = text[:maxlen]
x1, x2 = tokenizer.encode(first=text)
X1 = x1 + [0] * (maxlen-len(x1)) if len(x1) < maxlen else x1
X2 = x2 + [0] * (maxlen-len(x2)) if len(x2) < maxlen else x2

#模型预测
predicted = model.predict([[X1], [X2]])
y = np.argmax(predicted[0])
e_time = time.time()
print("原文: %s" % text)
print("预测标签: %s" % label_dict[str(y)])
print("Cost time:", e_time-s_time)

五.总结

写到这里,这篇文章就介绍结束了,后面还会持续分享,包括Bert实现文本分类、命名实体识别及原理知识。真心希望这篇文章对您有所帮助,加油~

  • 一.Bert模型引入
  • 二.keras-bert安装感想
  • 三.Bert模型构建分类器
    – 1.数据集
    – 2.模型构建
    (1) 数据预处理
    (2) 模型训练
    (3) GPU版模型训练
    – 3.模型评估
    – 4.模型预测
  • 四.文本分类实验评估
    – 1.机器学习模型对比分析
    – 2.完整代码
    (1) blog33_kerasbert_01_train.py
    (2) blog33_kerasbert_02_evaluate.py
    (3) blog33_kerasbert_03_predict.py

下载地址:

在这里插入图片描述

(By:Eastmount 2021-11-26 夜于武汉 http://blog.csdn.net/eastmount/ )


参考文献:

本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
THE END
分享
二维码
< <上一篇

)">
下一篇>>