I have been using the K-Means clustering from scikit-learn with Intel Python (update 3) and I noticed that it seems to ignore the option "n_init". From the scikit-learn documentation, "n_init" is the "Number of time the k-means algorithm will be run with different centroid seeds." I was wondering what the default value of "n_init" is (scikit-learn doc says 10).
To test this I ran the following:
from sklearn.cluster import KMeans import numpy as np from sklearn.datasets import fetch_mldata import time data = fetch_mldata('MNIST original').data start = time.time() kmeans = KMeans(n_clusters=64,n_init=1).fit(data) print("n_init=1 : "+str(time.time()-start)) start = time.time() kmeans = KMeans(n_clusters=64,n_init=10).fit(data) print("n_init=10 : "+str(time.time()-start)) start = time.time() kmeans = KMeans(n_clusters=64,n_init=1000).fit(data) print("n_init=1000: "+str(time.time()-start))
This is the result I get (HW: Intel Xeon Phi processor 7210):
n_init=1 : 6.0663318634 n_init=10 : 4.0244550705 n_init=1000: 4.10864305496
I am guessing the first one is due to some sort of initialization. But the other "n_init"=10 and "n_init"=100 get the same performance. I know that K-Means can have different run-times based on how "lucky" the initial guess was, but I don't think "n_init"=10 and "n_init"=1000 would get the same performance even then.
I tried creating an "init" array from the original dataset, and the performance I got with that makes me think that the default is "n_init"=1. But I can't seem to figure out what it is exactly (partly because "verbose=1" does not work either).