Analytics Speed with Ease: Visual Bag-Of-Words in Python* with Intel® Data Analytics Acceleration Library (Intel® DAAL) High Level API

By PREETHI VENKATESH, Nathan G Greeneltch

Published:12/29/2017   Last Updated:12/29/2017

This is a Companion to the Original Article: Visual Bag-Of-Words in Python: Speed Advantage of Intel® DAAL over Scikit-learn*


In the companion article, we concluded that Intel® Data Analytics Acceleration Library (DAAL) efficiently utilizes all resources of your machine to perform faster analytics. Now we will show you how to take advantage of these faster analytics methods with simpler python commands, namely with Daal4py interface.

Daal4py is a high-level API for Intel® DAAL's powerful computation libraries, enabling a quick implementation of machine learning models with few lines of codes. The easy-to-use nature of Daal4py permits Python practitioners to quickly learn and build prototype models.

Intel® Data Analytics Acceleration Library (DAAL) is included in the free Intel® Distribution of Python. Installation instructions can be found here.

To learn more about Daal4py features and supported platforms click here. For detailed user guide click here.

In this article, the image preprocessing procedure on a large dataset introduced in the previously mentioned companion article is redone using Daal4py.


The SIFT+VisualBagOfWords routine is a smart and straightforward approach to preprocessing images in the image recognition field. The SIFT* procedure captures important features used to describe an image and converts into usable numerical format. The VisualBagOfWords procedure commonly utilizes clustering algorithms like K-means for grouping the SIFT features.

Refer the parent topic here to learn more about SIFT and VisualBagOfWords implementation with Intel® DAAL's K-means, computed in various processing modes (batch and distributed).

Daal4py Implementation

Note: This is section is just an overview to highlight Daal4py's usage in SIFT+VBoW process. See the Appendix (Sections A, B, & C) for full implementation of SIFT+VBoW

Distributed K-means with Daal4py


import Daal4py as d4p


# Compute Kmeans initial centroids using Daal4py
initCentroids = d4p.kmeans_init (nclusters, t_method="plusPlusDense")
# Compute Kmeans 'nclusters' centroids using initial centroids
d4pkmeansResults = d4p.kmeans (nclusters, 300).compute(
                   all_img_sift_array,initCentroids.compute (all_img_sift_array))


#Compute prediction results
d4pkmeansPredRes = d4p.kmeans (nclusters, 0).compute (all_img_sift_dict[imagefname], centroids)
#Assign clusters obtained from prediction results
assignedClusters = d4pkmeansPredRes['assignments']

Batch K-means with Daal4py

Import and Initialize

import Daal4py as d4p

Fit and Terminate

#Compute Kmeans 'nclusters' intial centroids 
initCentroids = d4p.kmeans_init(nclusters, t_method="randomDense",distributed=True)
#Compute Kmeans 'nclusters' centroids using initial centroids
kmeansResults = d4p.kmeans(nclusters,300,distributed=True).compute(
                all_img_sift_array,   initCentroids.compute(all_img_sift_array))    
centroids = kmeansResults['centroids']



A) SIFT Feature extraction from Images using Opencv*

import cv2
from glob import glob
from os.path import join,basename

def get_imgfiles(path):
   all_files = []
   for fname in glob(path + "/*"):
         all_files.extend([join(path, basename(fname))])
   return all_files

def extractSift(img_files):
   img_Files_Sift_dict = {}
   for file in img_files:
      img = cv2.imread (file)
      gray = cv2.cvtColor (img, cv2.COLOR_BGR2GRAY)
      sift = cv2.xfeatures2d.SIFT_create()
      kp, des = sift.detectAndCompute (gray, None)   
      img_Files_Sift_dict[file] = des
   return img_Files_Sift_dict

def Siftdict2numpy(dict):
    nkeys = len (dict)
    array = zeros ((nkeys * 1000, 128))
    pivot = 0
    for key in dict.keys ():
        value = dict[key]
            nelements = value.shape[0]
        except AttributeError:
            print ("## Image file with 0 SIFT descriptors - {}".format (key))
            value = np.zeros ((1, 128))
            dict[key] = value
            nelements = value.shape[0]
        while pivot + nelements > array.shape[0]:
            padding = zeros_like (array)
            array = vstack ((array, padding))
        array[pivot:pivot + nelements] = value
        pivot += nelements
    array = resize (array, (pivot, 128))
    return array

def SiftNumpy2Norm(all_img_sift_array):
    from sklearn.preprocessing import MinMaxScaler, StandardScaler
    scaler = StandardScaler ()
    return (all_img_sift_array).transform (all_img_sift_array)

path='<location to your Image Set>' # directory structure is assumed of the form "path\category\image_file" 
label = 0
for category in glob(path + "/*"):
   img_files = get_imgfiles(category)
   for i in img_files: all_files_labels_dict[i] = label
all_img_sift_array = Siftdict2numpy (all_img_sift_dict)
all_img_sift_array = SiftNumpy2Norm (all_img_sift_array)

B) VisualBagOfWords with Daal4py (batch): K-means+binning

import time
import Daal4py as d4p
from numpy import histogram
from customUtils import getArrayFromNT  # customUtils available in pydaal-getting-started GitHub repo  
from daal.data_management import HomogenNumericTable_Float64
from numpy import zeros, resize, histogram, vstack, zeros_like, append

def computeVisualBoW(all_img_sift_array, all_img_sift_dict, all_files_labels_dict, nclusters):
    def d4pClustering(all_img_sift_array, nclusters):
        # Compute Kmeans initial centroids using Daal4py
        initCentroids = d4p.kmeans_init (nclusters, t_method="plusPlusDense")
        # Compute Kmeans 'nclusters' centroids using initial centroids
        d4pkmeansResults = d4p.kmeans (nclusters, 300).compute (all_img_sift_array,
                                                                  initCentroids.compute (all_img_sift_array))
        return d4pkmeansResults

    def createBins(nclusters, assignedClusters):
        histogram_of_words, bin_edges = histogram (assignedClusters,
                                                   bins=range (nclusters + 1))
        return histogram_of_words

    d4pkmeansResults = d4pClustering (all_img_sift_array, nclusters)
    centroids = d4pkmeansResults['centroids']
    all_word_histgrams_dict = {}
    for imagefname in all_img_sift_dict:
        # Assign cluster labels for all SIFT features in an image  
        d4pkmeansPredRes = d4p.kmeans (nclusters, 0).compute (all_img_sift_dict[imagefname], centroids)
        assignedClusters = d4pkmeansPredRes['assignments']
        word_histgram = createBins (nclusters, assignedClusters)
        word_histgram = append (word_histgram, all_files_labels_dict[imagefname])
        all_word_histgrams_dict[imagefname] = word_histgram
    return all_word_histgrams_dict

nclusters = 2  # 'nclusters' is value  of 'k' in Kmeans. This is just an example value
all_word_histgrams_dict = computeVisualBoW (all_img_sift_array, all_img_sift_dict, all_files_labels_dict, nclusters)

C) Tuning K-means with Grid Search and Scikit-learn Random Forest as Downstream Scorer

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
lclusters = [2,3,4,5,6,7,8,9,10,25,50,100,150,300,450,600]
for nclusters in lclusters:
   all_word_histgrams_dict = computeVisualBoW(all_img_sift_array,all_img_sift_dict,all_files_labels_dict,nclusters)
   inpData = np.array(list(all_word_histgrams_dict.values())) 
   X = inpData[:,0:inpData.shape[1]-1]
   Y = inpData[:,-1]
   X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=0)
   RFC = RandomForestClassifier ( class_weight='balanced').fit(X_train,y_train)

Clustering SIFT Features Find the Best k Value
Figure 1. Random Forest (RF) score plotted against K value for fitting of K-means algorithm. Datasets consisted of 13 classes (640K SIFT features vectors) and 34 classes (4Million SIFT features vectors), each with 128 features. The annotations at K values of 50 and 150. See Hardware Notice 1

D) Distributed K-means with Daal4py and Intel® MPI

Note: Run code snippet from section A (SIFT Feature Extraction) to generate all_img_sift_array

import time
from utils import printNumericTable
nclusters = 2 # 'nclusters' is value  of 'k' in Kmeans. This is just an example value
# all_img_sift_array is the array generated from SIFT feature extraction(Section A) 
   #Compute Kmeans 'nclusters' intial centroids 
   initCentroids = d4p.kmeans_init(nclusters, t_method="randomDense",distributed=True)
   #Compute Kmeans 'nclusters' centroids using initial centroids
   kmeansResults = d4p.kmeans(nclusters,300,distributed=True).compute(all_img_sift_array, initCentroids.compute(all_img_sift_array)) 
   centroids = kmeansResults['centroids']

Linux bash command to Run

Note: No. of MPI processes = No. of data partitions

Export I_MPI_SHM_LMT=shm

mpirun -genv DIST_CNC=MPI -n 4 python <path-to-dist-program>

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804