This case study compares the performance of Intel® Distribution for Python* to that of non-optimized Python using a breast cancer classification. This comparison was done using machine learning algorithms from the scikit-learn* package in Python.
Cancer refers to cells that grow out of control and invade other tissues. This process can also result in a tumor, where there is more cell growth than cell death. There are various types of cancer including bladder cancer, kidney cancer, lung cancer, and breast cancer. Currently, breast cancer is one of the most prevalent types of cancer, especially in women. It occurs when the cells in the breast divide and grow uncontrollably. Early detection of breast cancer can save lives. Causes of cancer include inherited genes, hormones, and an individual’s lifestyle.
This article provides a comparative study between the performance of non-optimized Python* and the Intel® Distribution for Python using breast cancer classification as an example. The classifiers used for breast cancer classification were taken from the scikit-learn* package in Python. The time and accuracy of each classifier for each distribution was calculated and compared.
The dataset for this study can be accessed from the Breast Cancer Wisconsin (Diagnostic) Data Set. The features of this dataset were computed from a digitized image of a fine needle aspirate of a breast mass in a CSV format and describe the characteristics of the cell nuclei present in the image. These values obtained were the features for classification. Using these features, a cancer cell can be classified into two classes: benign and malignant. Benign refers to a tumor that is not cancerous, whereas a malignant tumor has cancer in it. Observing the class distribution, there were 357 benign and 212 malignant data rows. The classification is based on the diagnosis field that has values M or B, where M denotes malignant and B denotes benign. Hence, this is a binary classification.
The experiment used Intel® architecture with the following hardware configuration:
|CPU op-mode(s)||32-bit, 64-bit|
|Byte order||Little Endian|
|On-line CPU(s) list||0-255|
|Thread(s) per core||4|
|Core(s) per socket||64|
|Model name||Intel® Xeon Phi™ processor 7210 @ 1.30 GHz|
|NUMA node0 CPU(s)||0-255|
The following are the software dependencies used to perform these classification:
The goal was to identify the class (M or B) to which the tumor belonged. The following block diagram shows the classification steps, explained in the following section, for both the Intel Distribution for Python and non-optimized Python.
The scikit-learn Python library provides a wide variety of machine learning algorithms for classification. Ten classifiers from the package were used for the study: Decision Tree Classifier, Gaussian NB, SGD Classifier, SVC, KNeighbors Classifier, OneVsRest Classifier, Quadratic Discriminant Analysis (QDA), Random Forest Classifier, MLP Classifier, and AdaBoost Classifier.
Create a Python file called classifier_ml.py. The following steps are implemented in this file:
time(cmd="python classifier_ml.py"; for i in $(seq 10); do $cmd; done)
Steps 1 through 6 provide the time and accuracy values for non-optimized Python. Repeat these steps for the Intel Distribution for Python. The time (t_idp) and accuracy are calculated.
To enable the Intel Distribution for Python, follow the steps given in Installing Intel® Distribution for Python* and Intel® Performance Libraries with Anaconda*.
The results are shown in Table 1.
The accuracy values for each classifier are the same for both non-optimized Python and the Intel Distribution for Python. Therefore, the accuracy values listed in Table 1 are common for both distributions.
Performance Gain percentage with respect to time is calculated by the given formula:
Performance Gain % = (t_nop - t_idp) / t_nop * 100
From the formula, it is clear that a positive value of Performance Gain percentage indicated better performance of the Intel Distribution for Python. The higher the value, the better the performance for the Intel Distribution for Python compared to non-optimized Python.
Table 1 shows the percentage performance gain for Intel Distribution of Python* over non-optimized Python.
Table 1: Percentage performance gain for Intel Distribution of Python* over non-optimized Python
|Classifier||Accuracy (Percent)||Performance Gain Percentage|
The performance gain clearly shows that the Intel Distribution for Python had better performance in terms of the time taken for execution as compared to non-optimized Python. The accuracy remained the same as expected and did not change whether non-optimized Python or the Intel Distribution for Python was used.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804