Chatbot based on tensorflow
**
A chatbot based on tensorflow
**
A chat robot based on Tensorflow, mainly based on machine deep learning, adopts the seq2seq+Attention model, first uses the jieba Chinese word segmentation framework to segment Chinese text sentences and then encodes them into ID numbers, and obtains a model after training and learning with a large amount of data. The dialogue robot that realizes the chat function of human question answering machine can automatically identify the matching of chat sentences and question answers, and generate and match corresponding answers through the training of a large number of corpus text vocabulary.
Design purpose
Chatbot is a special automatic questioning system, which can imitate human language habits and find answers to questions through pattern matching. In the process of chatting, there will be a variety of question sentences, and there are situations where the sentences include various parts of speech and some semantics in the sentences are different. This situation leads to chatbots being able to accurately answer some easy-to-answer questions, and answering some uncertain questions by guessing and changing topics, resulting in the inability to give correct answers. The generative chat robot algorithm based on deep learning first experiments with the mainstream framework in the field of generative chat robots, and on this basis, the attention model mechanism is added to improve the generation effect, in order to avoid a large number of meaningless replies and ensure the natural and smooth dialogue . However, the traditional chat robot has a low word recognition rate; it cannot make a correct response to the dialogue. Therefore, the chat robot dialogue generation mechanism is studied, and a chat robot dialogue generation mechanism based on seq2 seq and Attention model is proposed. , making chatbot chat conversations more real.
The chat machine based on seq2 seq and Attention model designed this timeThe mechanism of human dialogue generation mainly completes the research on the dialogue generation mechanism of chatbots from four aspects: semantic matching, dialogue keyword expansion, word vector encoding and semantic similarity calculation. The experimental comparison results show that the chat robot dialogue generation mechanism designed this time has a higher word recognition rate than the traditional mechanism, and can correctly identify the dialogue content to make accurate answers. Finally, a human-like machine that achieves a human-machine dialogue.
**
Detailed description
1. The development of chat robots has gone through three generations of different technologies. It is required to understand the specific process of the development of chat robots through literature research, focusing on mastering the third generation of chat robots based on deep learning and using the seq2seq+Attention model. ;
2. Familiar with embedding_attention_seq2seq interface, understand the meaning and use of parameters;
3. Training model:
(1) Segment Chinese words with jieba;
(2) Definition acquisition training Set function;
(3) Define the function to obtain the sentence id;
(4) Define the prediction function;
(5) Optimize parameters.
4. Use Python programming to implement a complete chat robot and perform system debugging.
Environmental construction and key technologies
2.1 Environment construction
Use python3.6 python interpreter, tensorflow version = 1.14, tensorflow embedding_attention_seq2seq, use LSTM neural network, use AdamOptimizer optimizer, jieba Chinese word segmentation, etc. The installation and construction process of each item is as follows:
1 Install Anaconda (package manager and environment manager):
Anaconda comes with a large number of commonly used data science packages, It ships with conda, Python, and over 150 scientific packages and their dependencies. so you canto start processing data immediately. In data analysis, many third-party packages are used, and conda (package manager) can help you install and manage these packages on your computer, including installing, uninstalling and updating packages.
(1) Anaconda installation and configuration steps and instructions:
a. Enter the official website, click Download;
b. Select the appropriate version of your computer to download (choose the 64-bit version of 2020.11);
c. Find the installer according to your own download path, and click the installer to install;
d. Follow the prompts to install according to the default options, and finally click finish to complete the installation;
e. Open ” System properties-advanced-environment variables-user user variables-select Path-edit to configure the environment, and add the path of your own installation after the variable value as follows:
f. Click OK to complete the configuration of the environment.
g. Click Start to open the Anaconda prompt (Anaconda3) and then enter a terminal such as Anaconda3 to change the domestic mirror source, and enter the following command on the command line:
conda config –add channels https://mirrors.tuna .tsinghua.edu.cn/anaconda/pkgs/free/
conda config –add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
conda config –add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/
h. Display channel address when setting search
conda config –set show_channel_urls yes
Anaconda completes the installation.
1. Use Anaconda3 to install the python3.6 interpreter:
(1) Enter the cmd interface of the Windows system: then enter: conda –version to check the version of anaconda, the result is shown in Figure 2.1:
Figure 2.1 Anaconda version display
(2) Check which environment variables are currently installed : conda info –envs, as shown in Figure 2.2 is success
Figure 2.2 in Anaconda Query for existing environment variables
(3) Install python and create tensorflow environment
(4) Use the command conda search –full –name python to check which python versions are supported by anaconda, and the query results are as shown in the figure below 2.3
Figure 2.3 The Python version queried
(5) Install the Python interpreter and create the tensorflow0 environment conda create –name tensorflow0 python=3.6;
(6) Enter the tensorflow0 environment command For: activatete tensorflow0, the result after entering is shown in Figure 2.4:
Figure 2.4 Tensorflow0 environment enters the state
2. Install tensorflow1.14 in the tensorflow0 environment,
(1) The command is pip install –upgrade –ignore-installed tensorflow, the installation is successful tensorflow=1.14 version;
( 2) To verify whether tensorflow is successfully installed, enter the following code:
import tensorflow as tf
hello = tf.constant(‘hello,tf’)
sess = tf.Session()
print(sess.run(hello))
The running result is shown in Figure 2.5:
Figure 2.5 The test code runs successfully
The result was successful, tensorflow version 1.14 Successful installation.
3. Install and configure the tensorflow0 environment to the IDE Pycham that runs the code
(1) Download and install Pycham: Enter the Pycham official website to download the Pycham community version installation package to the local, click the default option to install.
(2) Add environment tensorflow0 to Pycham
a. Open Pycharm, click the settings under the “file” menu as shown in Figure 2.6:
Figure 2.6 Enter Settings
b. Click “Pro ject: Pycharm “Project Interpreter”, click Click the “Add” button under the small triangle on the right to add and install Python under TensorFlow, as shown in Figure 2.7:
Figure 2.7 successfully added tensorflow0 environment Go to pychan
c. Verify that tensorflow0 is added successfully, Pycham Run the following code to verify,
import tensorflow as tf
hello = tf.constant(‘hello,tf’)
sess = tf.Session()
print(sess. run(hello))
The result is shown in Figure 2.8:
Figure 2.8 pycham successfully runs the code
Key technology
Anacon da3 on the environment variable configuration after installation Pay attention to the correct path setting. When selecting the python version, you must pay attention to the python3.6 version. The reason is that the latest version cannot install tensorflow1.14. The correct selection of the tensorflow version is the key factor for the success of the program. Because the tensorflow-based The main reference for machine learning is Google’s. Most of its algorithm frameworks and interfaces are incompatible with tensorflow2. I have encountered many difficulties in the selection and download of tensorflow. The environment must be based on python3.6 and Tensorflow1.14 versions Yes, other required jieba Chinese word segmentation library, seq2seq, and numpy library are relatively simple operations. Specifically, enter the following command in Terminal to install it successfully.
The command to install the numpy library: pip install numpy
The command to install the jieba library: pip install jieba
The command to install the seq2seq library: pip install seq2seq
Code implementation
1. Word number:
# coding:utf-8
import sys
import jieba
from numpy import unicode
class WordToken(object):
def __init__(self):
# The minimum starting id number, reserved for special tags
self.START_ID = 4
self.word2id_dict = {}
self.id2word_dict = {}
def load_file_list(self, file_list, min_freq):
"""
Load the list of sample files, count the word frequency after all the words are cut, sort the word frequency from high to low, and then number them sequentially
Save to self.word2id_dict and self.id2word_dict
"""
words_count = {}
for file in file_list:
with open(file, 'r', encoding='utf-8') as file_object:
for line in file_object. readlines():
line = line. strip()
seg_list = jieba. cut(line)
for str in seg_list:
if str in words_count:
words_count[str] = words_count[str] + 1
else:
words_count[str] = 1
sorted_list = [[v[1], v[0]] for v in words_count.items()]
sorted_list. sort(reverse=True)
for index, item in enumerate(sorted_list):
word = item[1]if item[0] < min_freq:
break
self.word2id_dict[word] = self.START_ID + index
self.id2word_dict[self.START_ID + index] = word
return index
def word2id(self, word):
if not isinstance(word, unicode):
print ("Exception: error word not unicode")
sys. exit(1)
if word in self.word2id_dict:
return self.word2id_dict[word]
else:
return None
def id2word(self, id):
id = int(id)
if id in self.id2word_dict:
return self.id2word_dict[id]
else:
return None
2. Training:
# coding:utf-8
import sys
import jieba
from numpy import unicode
class WordToken(object):
def __init__(self):
# The minimum starting id number, reserved for special tags
self.START_ID = 4
self.word2id_dict = {}
self.id2word_dict = {}def load_file_list(self, file_list, min_freq):
"""
Load the list of sample files, count the word frequency after all the words are cut, sort the word frequency from high to low, and then number them sequentially
Save to self.word2id_dict and self.id2word_dict
"""
words_count = {}
for file in file_list:
with open(file, 'r', encoding='utf-8') as file_object:
for line in file_object. readlines():
line = line. strip()
seg_list = jieba. cut(line)
for str in seg_list:
if str in words_count:
words_count[str] = words_count[str] + 1
else:
words_count[str] = 1
sorted_list = [[v[1], v[0]] for v in words_count.items()]
sorted_list. sort(reverse=True)
for index, item in enumerate(sorted_list):
word = item[1]
if item[0] < min_freq:break
self.word2id_dict[word] = self.START_ID + index
self.id2word_dict[self.START_ID + index] = word
return index
def word2id(self, word):
if not isinstance(word, unicode):
print ("Exception: error word not unicode")
sys. exit(1)
if word in self.word2id_dict:
return self.word2id_dict[word]
else:
return None
def id2word(self, id):
id = int(id)
if id in self.id2word_dict:
return self.id2word_dict[id]
else:
return None
3. Test:
# coding:utf-8
import sys
import numpy as np
import tensorflow as tf
from tensorflow.contrib.legacy_seq2seq.python.ops import seq2seq
import word_token
import jieba
import random
# input sequence length
input_seq_len = 5
# output sequence length
output_seq_len = 5
# empty value filled with 0
PAD_ID = 0
# output sequence start marker
GO_ID = 1
# end tag
EOS_ID = 2
# LSTM neuron size
size = 8
# initial learning rate
init_learning_rate = 1
# Only when the frequency of occurrence in the sample exceeds this value will it enter the vocabulary
min_freq = 10
wordToken = word_token. WordToken()
# Put it in the global position, in order to dynamically calculate num_encoder_symbols and num_decoder_symbols
max_token_id = wordToken.load_file_list(['./samples/question', './samples/answer'], min_freq)
num_encoder_symbols = max_token_id + 5
num_decoder_symbols = max_token_id + 5
def get_id_list_from(sentence):
sentence_id_list = []
seg_list = jieba. cut(sentence)
for str in seg_list:
id = wordToken. word2id(str)
if id:
sentence_id_list.append(wordToken.word2id(str))
return sentence_id_list
def get_train_set():
global num_encoder_symbols, num_decoder_symbols
train_set = []
with open('samples/question', 'r', encoding='utf-8') as question_file:
with open('samples/answer', 'r', encoding='utf-8') as answer_file:while True:
question = question_file. readline()
answer = answer_file. readline()
if question and answer:
question = question. strip()
answer = answer. strip()
question_id_list = get_id_list_from(question)
answer_id_list = get_id_list_from(answer)
if len(question_id_list) > 0 and len(answer_id_list) > 0:
answer_id_list.append(EOS_ID)
train_set.append([question_id_list, answer_id_list])
else:
break
return train_set
def get_samples(train_set, batch_num):
raw_encoder_input = []
raw_decoder_input = []
if batch_num >= len(train_set):
batch_train_set = train_set
else:
random_start = random.randint(0, len(train_set)-batch_num)
batch_train_set = train_set[random_start:random_start+batch_num]
for sample in batch_train_set:
raw_encoder_input.append([PAD_ID] * (input_seq_len - len(sample[0])) + sample[0])
raw_decoder_input.append([GO_ID] + sample[1] + [PAD_ID] * (output_seq_len - len(sample[1]) - 1))
encoder_inputs = []
decoder_inputs = []
target_weights = []
for length_idx in range(input_seq_len):
encoder_inputs.append(np.array([encoder_input[length_idx] for encoder_input in raw_encoder_input], dtype=np.int32))
for length_idx in range(output_seq_len):
decoder_inputs.append(np.array([decoder_input[length_idx] for decoder_input in raw_decoder_input], dtype=np.int32))
target_weights.append(np.array([
0.0 if length_idx == output_seq_len - 1 or decoder_input[length_idx] == PAD_ID else 1.0 for decoder_input in raw_decoder_input
], dtype=np. float32))
return encoder_inputs, decoder_inputs, target_weights
def seq_to_encoder(input_seq):
"""From the input space-separated number id string, it is converted into the encoder, decoder, target_weight, etc. for prediction
"""
input_seq_array = [int(v) for v in input_seq. split()]
encoder_input = [PAD_ID] * (input_seq_len - len(input_seq_array)) + input_seq_array
decoder_input = [GO_ID] + [PAD_ID] * (output_seq_len - 1)
encoder_inputs = [np.array([v], dtype=np.int32) for v in encoder_input]
decoder_inputs = [np.array([v], dtype=np.int32) for v in decoder_input]
target_weights = [np.array([1.0], dtype=np.float32)] * output_seq_len
return encoder_inputs, decoder_inputs, target_weights
def get_model(feed_previous=False):
"""Construct the model
"""
learning_rate = tf.Variable(float(init_learning_rate), trainable=False, dtype=tf.float32)
learning_rate_decay_op = learning_rate. assign(learning_rate * 0.9)
encoder_inputs = []
decoder_inputs = []
target_weight = []
for i in range(input_seq_len):
encoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="encoder{0}".format(i)))
for i in range(output_seq_len + 1):
decoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="decoder{0}".format(i)))
for i in range(output_seq_len):
target_weights.append(tf.placeholder(tf.float32, shape=[None], name="weight{0}".format(i)))
# decoder_inputs left shift a timing as targets
targets = [decoder_inputs[i + 1] for i in range(output_seq_len)]
cell = tf.contrib.rnn.BasicLSTMCell(size)
# We don't need the status output here
outputs, _ = seq2seq.embedding_attention_seq2seq(
encoder_inputs,
decoder_inputs[:output_seq_len],
cell,
num_encoder_symbols=num_encoder_symbols,
num_decoder_symbols=num_decoder_symbols,embedding_size=size,
output_projection=None,
feed_previous=feed_previous,
dtype=tf.float32)
# Compute the weighted cross-entropy loss
loss = seq2seq. sequence_loss(outputs, targets, target_weights)
# gradient descent optimizer
opt = tf.train.GradientDescentOptimizer(learning_rate)
# Optimization goal: minimize loss
update = opt.apply_gradients(opt.compute_gradients(loss))
# model persistence
saver = tf. train. Saver(tf. global_variables())
return encoder_inputs, decoder_inputs, target_weights, outputs, loss, update, saver, learning_rate_decay_op, learning_rate
def train():
"""
training process
"""
train_set = get_train_set()
with tf.Session() as sess:
encoder_inputs, decoder_inputs, target_weights, outputs, loss, update, saver, learning_rate_decay_op, learning_rate = get_model()
# Initialize all variables
sess.run(tf.global_variables_initializer())
# Training is veryMultiple iterations, print loss every 10 times, you can directly ctrl+c to stop depending on the situation
previous_losses = []
for step in range(50000):
sample_encoder_inputs, sample_decoder_inputs, sample_target_weights = get_samples(train_set, 1000)
input_feed = {}
for l in range(input_seq_len):
input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l]
for l in range(output_seq_len):
input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l]
input_feed[target_weights[l].name] = sample_target_weights[l]
input_feed[decoder_inputs[output_seq_len].name] = np.zeros([len(sample_decoder_inputs[0])], dtype=np.int32)
[loss_ret, _] = sess. run([loss, update], input_feed)
if step % 10 == 0:
print ('step=', step, 'loss=', loss_ret, 'learning_rate=', learning_rate.eval())
if len(previous_losses) > 5 and loss_ret > max(previous_losses[-5:]):
sess.run(learning_rate_decay_op)
previous_losses.append(loss_ret)
# model persistence
saver. save(sess, './model/demo')
def predict():
"""
forecasting process
"""
with tf.Session() as sess:
encoder_inputs, decoder_inputs, target_weights, outputs, loss, update, saver, learning_rate_decay_op, learning_rate = get_model(feed_previous=True)
saver.restore(sess, './model/demo')
sys.stdout.write("> ")
sys.stdout.flush()
input_seq = sys.stdin.readline()
while input_seq:
input_seq = input_seq. strip()
input_id_list = get_id_list_from(input_seq)
if (len(input_id_list)):
sample_encoder_inputs, sample_decoder_inputs, sample_target_weights = seq_to_encoder(' '.join([str(v) for v in input_id_list]))
input_feed = {}
for l in range(input_seq_len):
input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l]
for l in range(output_seq_len):
input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l]
input_feed[target_weights[l].name] = sample_target_weights[l]
input_feed[decoder_inputs[output_seq_len].name] = np.zeros([2], dtype=np.int32)
# predict output
outputs_seq = sess. run(outputs, input_feed)
#Because each output data is num_decoder_symbols dimension, the one with the largest value is the predicted id, which is the function of the argmax function here
outputs_seq = [int(np.argmax(logit[0], axis=0)) for logit in outputs_seq]
# If it is the end character, then the following statement will not be output
if EOS_ID in outputs_seq:
outputs_seq = outputs_seq[:outputs_seq. index(EOS_ID)]
outputs_seq = [wordToken.id2word(v) for v in outputs_seq]
print (" ". join(outputs_seq))
else:
print ("What do you think")
sys.stdout.write("> ")
sys.stdout.flush()
input_seq = sys.stdin.readline()
if __name__ == "__main__":
#train:
#train()
#Test dialogue:
predict()
Appendix
Source code:
Source address
Extraction code: yn2m