Use Analytics Zoo to Inject AI Into Customer Service Platforms on Microsoft Azure: Part 1

This series of articles talks about the practices of building a customer support platform with artificial intelligence (AI) on Microsoft Azure* by the Microsoft Azure China team using Analytics Zoo.

This is the first article in this series. In this article, we share the process and experiences in building a text classifier module for a customer support platform that provides real-time chat functionality to the customer, also known as a chatbot.

Background

The customer support and service platform is widely used to provide technical or business support both before and after the sale. Examples of current customer support include telephone support platforms of banks and agencies, online customer support platforms of retailers on Taobao.com*, and so on. The traditional customer support platform is just a communication tool between the support staff and the customers. Lately, more intelligent customer support platforms are equipped with advanced AI modules and automation tools. Such customer support platforms are capable of saving human efforts as well as improving user experiences to some extent.

We have an experimental text-based customer support platform, where customers raise questions in the UI, and the back-end system retrieves responses from support documents and predefined frequently asked questions. If the customer thinks the provided answer is not helpful she has the option to be redirected to a human support staff, which connects with the customer by the background system. The primary system provides answers based on pre-edited dialogs as well as information retrieval (IR)-based document search, indexing, and weighting. As dialogs with customers accumulate, we intend to leverage AI technologies to improve the system so that it can learn from real data and evolve through time. Specifically, we can improve the accuracy of answers by using advanced neuro-linguistic programming (NLP) technologies such as intention recognition and question answering. And we can further analyze the content of the dialog as well as the customer behavior or properties to subtly improve user experience; for example, using sentiment analysis to detect negative mood and respond accordingly, classifying the dialogs to improve the efficiency of redirecting the customer to corresponding support teams, and choosing answers based on customer portrait.

Our initial attempts were to build two new, intelligent modules into the basic system (implemented using Analytics Zoo), the text classifier module and the QA ranker module. The text classifier module classifies the service type of the dialog before redirecting to support staff so that the corresponding support team can be selected for dispatching. Such a text classification module can be modified to do sentiment analysis later. The QA ranker module ranks the candidate answers selected by the search engine.

customer service diagram
Figure 1. Overview of customer service platform (basic modules in blue, intelligent modules in green).

By now we have finished some initial pilots and the results looked nice. Further pilots/deployments will follow. In this series of articles, we share step by step the process and experiences of building this customer support platform with AI. In this article, we mainly introduce how to add the text classifier module based on Analytics Zoo version 0.2.0.

Why Analytics Zoo

Analytics Zoo is an open source big data analysis plus AI platform developed by Intel. The platform supports both Scala* and Python*, providing a series of convenient packages and tools including Pipeline* API, predefined models, pretrained models on public datasets, and reference use cases. Developing AI and deep learning applications is made easy using Apache Spark* and the open source distributed deep learning library for Apache Spark, Intel BigDL.

The number of dialogs increases over time, and putting the data on an Apache Hadoop* cluster is a scalable solution for data management and sharing. It is convenient to use Analytics Zoo to process the data on a Apache Hadoop or Apache Spark cluster. Using the Analytics Zoo Scala API to train and predict does not require any modification of the existing Spark cluster as long as it's a standard one. When doing prediction, the Plain Old Java* Object (POJO)-like prediction API (which runs locally without the Apache Spark cluster) can be used for low latency demands, while the standard prediction API (which runs on a Apache Spark cluster) is more suitable for high throughput demands. Both sets of APIs can be easily integrated into Java-based services.

Text Classification Overview

Text classification is a common type of natural language processing task, with the purpose of classifying input text corpus into one or more categories. For example, spam email detection classifies the content of an email into spam or non-spam categories. In our case, we classify the texts of a dialog into a service type.

In general, training a text classification model involves the following steps: Collecting and preparing a train dataset and a validation dataset, data cleaning and preprocessing, training the model, validating and evaluating the model, and tuning the model (which includes but is not limited to adding data, adjusting hyperparameters, and adjusting models).

There are several predefined text classifiers in Analytics Zoo that can be used out-of-the-box (namely, convolutional neural networks (CNN), long-short term memory (LSTM), and gated recurrent units (GRU)). We chose CNN as a start. We use the Python* API in the following texts to illustrate the training process.

from zoo.models.textclassification import TextClassifier     
text_classifier = TextClassifier(class_num, token_length, sequence_length=500, 
                                                     encoder="cnn", encoder_output_dim=256) 

In the above API, class_num is the number of categories in this problem, token_length is the size of word embedding, sequence_length is the number of words each text record contains, encoder is the type of word encoder (which can be cnn, lstm or gru), and encoder_output_dim is the output of this encoder. This model accepts as input a sequence of word embeddings, and outputs a label.

If you're interested in the topology of the neural network in this model, you can refer to the source code.

Data Collection and Preprocessing

Each record in the training dataset contains two fields, a dialog and a label. We collected thousands of such records, and collected labels both manually and semi-automatically. Then, we did data cleaning to the original texts, where we removed meaningless tags and garbled parts, and converted them into a text resilient distributed dataset (RDD) with each record in the format of a pair (text, label). Next, we did preprocessing with the text RDD, and output the correct form that our model accepts. A reminder here is to keep data cleaning and preprocessing the same for both training and prediction.

Table 1. Table of text RDD records after data cleaning (each record is a pair of text and label).

(How to get invoice …, 1)
(Can you send invoice to me…,1)
(Remote service connection failure…,2)
(How to buy…, 3)

The process of cleaning is omitted here. We only introduce the major steps of preprocessing below.

Tokenization

Our dialogs are written in Chinese, and the tokenization process is different from common practices in English. Unlike English, a sentence in Chinese consists of consecutive characters without any spaces between words or phrases, so we usually use a dictionary to detect semantically which consecutive characters compose a word. In our application we used a well-known segmentation module to break the sentences into words. After segmentation, each input text was converted to an array of tokens (words). The Python Chinese word segmentation module for Chinese text segmentation is Jeiba.

import jieba  
tokens_rdd = texts_rdd.map(lambda text, label: 
                               ([w for w in jieba.cut(text)], label))

Stopwords removal

Stopwords are the words that frequently appear in text but do not help semantics understanding, such as the, of, and so on. We created our own version of a stopwords dictionary and used it to remove the useless words from the array of tokens generated in Tokenization.

filtered_tokens_rdd = tokens_rdd.map(lambda tokens, label: ([w for w in tokens 
	                                      if w not in stopwords], label))  

Sequence aligning

Different texts may generate different sizes of token array. But a text classification model needs the same size of input for all records. Thus, we have to align the token arrays to the same size (specified in the parameter sequence_length in the text classifier). If the size of a token array is larger than the required size, we stripped the words from the beginning or the end; otherwise, we padded meaningless words to the end of the array (for example, ###).

padded_tokens_rdd = filtered_tokens_rdd.map(lambda tokens, label: 
                                              pad(tokens, "##", sequence_length), label)  

Word2Vec

After the token array size is aligned, we need to convert each token (word) into a vector (embedding). We used pretrained word embeddings for Chinese from the open source project fastText.

In these embeddings, each word vector has 300 dimensions. In the word embedding space, the distance between two word vectors can represent the semantical relationship between two words—usually, closer words indicate similar meanings. For words not found in fastText, we use a zero-vector with 300 dimensions to represent them.

from pyfasttext import FastText  

w2v = FastText(fasttext_bin_path)
vectors_rdd = padded_tokens_rdd.map(lambda tokens, label:
                             ([w2v[w] if w in w2v.words 
                                   else [0.]*300 for w in tokens], label))

Conversion to sample

After all the above steps, each text became a tensor with shape (sequence_length, 300). Then, from each record, we constructed one BigDL sample.

The sample had the generated tensor as feature and the label integer as label. Finally, we obtained an RDD[Sample] that could be used directly for training.

sample_rdd=vectors_rdd.map(lambdavectors,label:to_sample(vectors,label))

Model Training, Testing, Evaluating and Tuning

After we prepared the train dataset (train_rdd) and the validation dataset (val_rdd) in the same way as above, we instantiated a new TextClassifier model (text_classifier), and then created an Optimizer to train the model in a distributed fashion. We used sparse categorical cross entropy as the loss function.

from bigdl.optim.optimizer import *
from zoo.pipeline.api.keras.objectives import SparseCategoricalCrossEntropy   

optimizer = Optimizer(model=text_classifier, training_rdd=train_rdd,   
                      criterion= SparseCategoricalCrossEntropy(),
                      end_trigger=MaxEpoch(epochs),  
                      batch_size=batch_size,  
                      optim_method=Adagrad(learningrate=lr,
                      learningrate_decay=decay)) 
optimizer.optimize()  

The tunable parameters of the Optimizer include number of epochs, batch size, learning rate, and so on. You can specify validation options to output metrics such as Top1Accuracy on validation set along the training progress to detect overfit or underfit. BigDL also supports taking snapshots during training and resuming training from a snapshot later. Analytics Zoo supports both BigDL versions 0.5 and 0.6. For more detailed parameters and usage please refer to the BigDL documentation.

If you don't want to do validation during training, you can also calculate the metrics yourself on the validation set after training is finished. The series of predict APIs returns probability distributions or predicted classes.

results = text_classifier.predict(val_rdd)    
result_classes = text_classifier.predict_classes(val_rdd)
//compare results with val_rdd ground truths in your own way    

If the result is not good on the validation dataset, we have to tune the model. This is generally a repeated process of adjusting hyperparameters or data or model, train, and validation until the result is good enough. We improved our accuracy score markedly after we tuned the learning rate, added new data, and augmented the stopwords dictionary.

All the above training processes can be performed on either a single machine or a cluster. For how-to details, refer to the documentation.

Besides, Analytics Zoo documentation provides a guide for text classification and full reference examples.

Integration of Prediction API and the Service

After we obtained the trained model, the next thing to do was to use the model for prediction in our service. Since our service is implemented in Java (see the example below), to implement the prediction code for latency considerations (the Java API can load the model trained using Python code) we chose to use the POJO-like Java inference API.

Also, Analytics Zoo provides a full example of text classification as a web service.

import java.util.ArrayList;  
import java.util.List;  
import com.intel.analytics.zoo.pipeline.inference.AbstractInferenceModel;  
import com.intel.analytics.zoo.pipeline.inference.JTensor;  
  
public class TextClassificationModel extends AbstractInferenceModel {  
   public JTensor preProcess(String text) {  
        //We re-implemented the preprocessing using Java API, omitted the details here
    }  
}

TextClassificationModel model = new TextClassificationModel();  
//path is the pretrained model path
model.load(path);  
String sampleText = "text content"; //new input
JTensor input = model.preProcess(sampleText);  //preprocessing
List<JTensor> inputList = new ArrayList<>();  
inputList.add(input);  
List<List<JTensor>> result = model.predict(inputList);  //predict

Mode Publishing and Continuous Update

Data is accumulating through time, so in practice it's often required to periodically retrain the model fully or incrementally, and to publish the updated models to the AI service. To achieve this you just have to periodically re-run the training program to obtain updated models using new data. At service side you just reload the model using the model.load API, as demonstrated above. Besides, our service is based on Kubernetes* for CI and CD. For download and use, Analytics Zoo also provides a Docker* image.

End

Now, here you are. We believe after reading the contents you already have a rough idea of how to do text classification and how to add it to your own applications. We will continue to introduce other aspects in our practice to the articles following in this series.

For more information, please visit the project homepage of Analytics Zoo on GitHub*.

You can also download and try the image preinstalled with Analytics Zoo and BigDL on the Azure Marketplace in English and in Chinese.

A Chinese version of this article is available.

For more complete information about compiler optimizations, see our Optimization Notice.