Published: 06/08/2017, Last Updated: 03/14/2018

*Simplicity is the ultimate sophistication.*

*—Leonardo Da Vinci*

With time, machine learning algorithms are becoming increasingly complex. This, in most cases, is increasing accuracy at the expense of higher training-time requirements. Fast-training algorithms that deliver decent accuracy are also available. These types of algorithms are generally based on simple mathematical concepts and principles. Today, we’ll have a look at a similar machine-learning classification algorithm, naive Bayes. It is an extremely simple, probabilistic classification algorithm which, astonishingly, achieves decent accuracy in many scenarios.

In machine learning, naive Bayes classifiers are simple, probabilistic classifiers that use Bayes’ Theorem. Naive Bayes has strong (naive), independence assumptions between features. In simple terms, a naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a ball may be considered a soccer ball if it is hard, round, and about seven inches in diameter. Even if these features depend on each other or upon the existence of the other features, naive Bayes believes that all of these properties independently contribute to the probability that this ball is a soccer ball. This is why it is known as *naive*.

Naive Bayes models are easy to build. They are also very useful for very large datasets. Although, naive Bayes models are simple, they are known to outperform even the most highly sophisticated classification models. Because they also require a relatively short training time, they make a good alternative for use in classification problems.

Bayes Theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x), and P(x|c). Consider the following equation:

Here,

- P(c|x): posterior probability ofclass(c,target) givenpredictor(x,attributes). This represents the probability of c being true, provided x is true.
- P(c): is the prior probability ofclass. This is the observed probability of class out of all the observations.
- P(x|c): is the likelihood which is the probability ofpredictor-givenclass. This represents the probability of x being true, provided x is true.
- P(x): is the prior probability ofpredictor. This is the observed probability of predictor out of all the observations.

Let’s better understand this with the help of a simple example. Consider a well-shuffled deck of playing cards. A card is picked from that deck at random. The objective is to find the probability of a King card, given that the card picked is red in color.

Here,

P(King | Red Card) = ?

We’ll use,

P(King | Red Card) = P(Red Card | King) x P(King) / P(Red Card)

So,

P (Red Card | King) = Probability of getting a Red card given that the card chosen is King = 2 Red Kings / 4 Total Kings = ½

P (King) = Probability that the chosen card is a King = 4 Kings / 52 Total Cards = 1 / 13

(Red Card) = Probability that the chosen card is red = 26 Red cards / 52 Total Cards = 1/ 2

Hence, finding the posterior probability of randomly choosing a King given a Red card is:

P (King | Red Card) = (1 / 2) x (1 / 13) / (1 / 2) = 1** / 13** or **0.077**

Let’s understand naive Bayes with one more example—to predict the weather based on three predictors: humidity, temperature and wind speed. The training data is the following:

Humidity |
Temperature |
Wind Speed |
Weather |
---|---|---|---|

Humid | Hot | Fast | Sunny |

Humid | Hot | Fast | Sunny |

Humid | Hot | Slow | Sunny |

Not Humid | Cold | Fast | Sunny |

Not Humid | Hot | Slow | Rainy |

Not Humid | Cold | Fast | Rainy |

Humid | Hot | Slow | Rainy |

Humid | Cold | Slow | Rainy |

We’ll use naive Bayes to predict the weather for the following test observation:

Humidity % |
Temperature (C) |
Wind Speed (Km/h) |
Weather |
---|---|---|---|

Humid | Cold | Fast | ? |

We have to determine which posterior is greater, sunny or rainy. For the classification Sunny, the posterior is given by:

```
Posterior( Sunny) = (P(Sunny) x P(Humid / Sunny) x P(Cold / Sunny) x P(Fast / Sunny)) / evidence
```

Similarly, for the classification Rainy, the posterior is given by:

` Posterior( Rainy) = (P(Rainy) x P(Humid / Rainy) x P(Cold / Rainy) x P(Fast / Rainy)) / evidence`

Where,

` evidence = [ P(Sunny) x p(Humid / Sunny) x p(Cold / Sunny) x P(Fast / Sunny) ] + [ (P(Rainy) x P(Humid / Rainy) x P(Cold / Rainy) x P(Fast / Rainy) ) ]`

Here,

```
P(Sunny) = 0.5
P(Rainy) = 0.5
P(Humid/ Sunny) = 0.75
P(Cold/ Sunny) = 0.25
P(Fast/ Sunny) = 0.75
P(Humid/ Sunny) = 0.25
P(Cold/ Sunny) = 0.75
P(Fast/ Sunny) = 0.25
```

Therefore, evidence = 0.703 + 0.023 = **0.726.**

```
Posterior (Sunny) = 0.968
Posterior (Rainy) = 0.032
```

Since the posterior numerator is greater in the Sunny case, we predict the sample is **Sunny**.

Naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood. Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations.

**Recommendation System:**Naive Bayes classifiers are used in various inferencing systems for making certain recommendations to users out of a list of possible options.**Real-Time Prediction:**Naive Bayes is a fast algorithm, which makes it an ideal fit for making predictions in real time.**Multiclass Prediction:**This algorithm is also well-known for its multiclass prediction feature. Here, we can predict the probability of multiple classes of the target variable.**Sentiment Analysis:**Naive Bayes is used in sentiment analysis on social networking datasets like Twitter* and Facebook* to identify positive and negative customer sentiments.**Text Classification:**Naive Bayes classifiers are frequently used in text classification and provide a high success rate, as compared to other algorithms.**Spam Filtering:**Naive Bayes is widely used inspam filtering for identifying spam email.

An interesting point about naive Bayes is that even when the independence assumption is violated and there are clear, known relationships between attributes, it works decently anyway. There are two reasons that make naive Bayes a very efficient algorithm for classification problems.

**Performance:**The naive Bayes algorithm gives useful performances despite having correlated variables in the dataset, even though it has a basic assumption of independence among features. The reason for this is that in a given dataset, two attributes may depend on each other, but the dependence may distribute evenly in each of the classes. In this case, the conditional independence assumption of naive Bayes is violated, but it is still the optimal classifier. Further, what eventually affects the classification is the combination of dependencies among all attributes. If we just look at two attributes, there may exist strong dependence between them that affects the classification. When the dependencies among all attributes work together, however, they may cancel each other out and no longer affect the classification. Therefore, we argue that it is the distribution of dependencies among all attributes over classes that affects the classification of naive Bayes, not merely the dependencies themselves.**Speed:**The main cause for the fast speed of naive Bayes training is that it converges toward its asymptotic accuracy at a different rate than other methods, like logistic regression, support vector machines, and so on. Naive Bayes parameter estimates converge toward their asymptotic values in order of log(n) examples, where*n*is number of dimensions. In contrast, logistic regression parameter estimates converge more slowly, requiring order*n*examples. It is also observed that in several datasets logistic regression outperforms naive Bayes when many training examples are available in abundance, but naive Bayes outperforms logistic regression when training data is scarce.

Let’s see a practical application of naive Bayes for classifying email as spam or ham. We will use sklearn.naive_bayes to train a spam classifier in Python*.

```
import os
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
```

The following example will be using the MultinomialNB operation.

Creating the readFiles function:

```
def readFiles(path):
for root, dirnames, filenames in os.walk(path):
for filename in filenames:
path = os.path.join(root, filename)
inBody = False
lines = []
f = io.open(path, 'r', encoding='latin1')
for line in f:
if inBody:
lines.append(line)
elif line == '\n':
inBody = True
f.close()
message = '\n'.join(lines)
yield path, message
```

Creating a function to help us create a dataFrame:

```
def dataFrameFromDirectory(path, classification):
rows = []
index = []
for filename, message in readFiles(path):
rows.append({'message': message, 'class': classification})
index.append(filename)
return DataFrame(rows, index=index)
data = DataFrame({'message': [], 'class': []})
data = data.append(dataFrameFromDirectory('/…/SPAMORHAM /emails/spam/', 'spam'))
data = data.append(dataFrameFromDirectory('/…/SPAMORHAM/emails/ham/', 'ham'))
```

Let's have a look at that dataFrame:

```
data.head()
```

class message

```
/…/SPAMORHAM/emails/spam/00001.7848dde101aa985090474a91ec93fcf0 spam <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Tr...
/…/SPAMORHAM/emails/spam/00002.d94f1b97e48ed3b553b3508d116e6a09 spam 1) Fight The Risk of Cancer!\n\nhttp://www.adc...
/…/SPAMORHAM/emails/spam/00003.2ee33bc6eacdb11f38d052c44819ba6c spam 1) Fight The Risk of Cancer!\n\nhttp://www.adc...
/…/SPAMORHAM/emails/spam/00004.eac8de8d759b7e74154f142194282724 spam ##############################################...
/…/SPAMORHAM/emails/spam/00005.57696a39d7d84318ce497886896bf90d spam I thought you might like these:\n\n1) Slim Dow...
```

Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call the fit() method:

```
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)
counts
```

<3000x62964 sparse matrix of type '<type 'numpy.int64'>'

with 429785 stored elements in Compressed Sparse Row format>

Now we are using MultinomialNB():

`classifierModel = MultinomialNB()`

## This is the target

## Class is the target

`targets = data['class'].values`

## Using counts to fit the model

```
classifierModel.fit(counts, targets)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
```

The classifierModel is ready. Now, let’s prepare sample email messages to see how the model works.

Email number 1 is *Free Viagra now!!!*, Email number 2 is *A quick brown fox is not ready,* and so on:

```
examples = ['Free Viagra now!!!',
"A quick brown fox is not ready",
"Could you bring me the black coffee as well?",
"Hi Bob, how about a game of golf tomorrow, are you FREE?",
"Dude , what are you saying",
"I am FREE now, you can come",
"FREE FREE FREE Sex, I am FREE",
"CENTRAL BANK OF NIGERIA has 100 Million for you",
"I am not available today, meet Sunday?"]
example_counts = vectorizer.transform(examples)
```

Now we are using the classifierModel to predict:

`predictions = classifierModel.predict(example_counts)`

Let’s check the prediction for each email:

`predictions`

** array(['spam', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'spam', 'ham'],
dtype='|S4')**

Therefore, the first email is spam, the second is ham, and so on.

We hope you have gained a clear understanding of the mathematical concepts and principles of naive Bayes using this guide. It is an extremely simple algorithm, with oversimplified assumptions at times, that might not stand true in many real-world scenarios. In this article we explained why naive Bayes often produces decent results, despite these facts. We feel naive Bayes is a very good algorithm and its performance, despite its simplicity, is astonishing.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804