Accelerating scikit-learn algorithms with the Intel Extension for Scikit-learn

Scikit-learn is the most widely used package for data science and machine learning (ML), thus it is imperative that developers achieve the best performance with this package. Scikit-learn offers simple and efficient algorithms and tools for data mining and data analysis on various tasks such as classification, regression and clustering.

The Intel Extension for Scikit-learn provides optimized implementations of many scikit-learn algorithms.it uses the Intel oneAPI Data Analytics Library (oneDAL) to achieve its acceleration. This library enables all the latest vector instructions, such as the Intel Advanced Vector Extensions (Intel AVX-512). It also uses cache-friendly data blocking, fast BLAS operations with the Intel oneAPI Math Kernel Library (oneMKL), and scalable multithreading with the Intel oneAPI Threading Building Blocks (oneTBB).

The Intel Extension for Scikit-learn basically offers a mechanism to dynamically patch scikit-learn estimators to use the optimizations from Intel(R) oneAPI Data Analytics Library . In the event that you are using When you are using algorithms that are not supported by the extension, the package just falls back into the original scikit-learn thus ensuring a seamless workflow.

You can install the package through pip or conda as shown below:

pip install scikit-learn-intelex
conda install scikit-learn-intelex -c conda-forge

To use the Intel Extension for Scikit-learn, all you need to add is add the 2 lines of code below to your Python script

from sklearnex import patch_sklearn
patch_sklearn()

Performance comparison:

For performance comparison we will use the Color Quantization using K-Means sample from the Scikit-learn website shown below

# Authors: Robert Layton <robertlayton@gmail.com>
#          Olivier Grisel <olivier.grisel@ensta.org>
#          Mathieu Blondel <mathieu@mblondel.org>
#
# License: BSD 3 clause

print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin
from sklearn.datasets import load_sample_image
from sklearn.utils import shuffle
from time import time

n_colors = 64

# Load the Summer Palace photo
china = load_sample_image("china.jpg")

# Convert to floats instead of the default 8 bits integer coding. Dividing by
# 255 is important so that plt.imshow behaves works well on float data (need to
# be in the range [0-1])
china = np.array(china, dtype=np.float64) / 255

# Load Image and transform to a 2D numpy array.
w, h, d = original_shape = tuple(china.shape)
assert d == 3
image_array = np.reshape(china, (w * h, d))

print("Fitting model on a small sub-sample of the data")
t0 = time()
image_array_sample = shuffle(image_array, random_state=0, n_samples=1_000)
kmeans = KMeans(n_clusters=n_colors, random_state=0).fit(image_array_sample)
print(f"done in {time() - t0:0.3f}s.")

# Get labels for all points
print("Predicting color indices on the full image (k-means)")
t0 = time()
labels = kmeans.predict(image_array)
print(f"done in {time() - t0:0.3f}s.")


codebook_random = shuffle(image_array, random_state=0, n_samples=n_colors)
print("Predicting color indices on the full image (random)")
t0 = time()
labels_random = pairwise_distances_argmin(codebook_random,
                                          image_array,
                                          axis=0)
print(f"done in {time() - t0:0.3f}s.")


def recreate_image(codebook, labels, w, h):
    """Recreate the (compressed) image from the code book & labels"""
    return codebook[labels].reshape(w, h, -1)


# Display all results, alongside original image
plt.figure(1)
plt.clf()
plt.axis('off')
plt.title('Original image (96,615 colors)')
plt.imshow(china)

plt.figure(2)
plt.clf()
plt.axis('off')
plt.title(f'Quantized image ({n_colors} colors, K-Means)')
plt.imshow(recreate_image(kmeans.cluster_centers_, labels, w, h))

plt.figure(3)
plt.clf()
plt.axis('off')
plt.title(f'Quantized image ({n_colors} colors, Random)')
plt.imshow(recreate_image(codebook_random, labels_random, w, h))
plt.show()

We modify the above script by adding the Scikit-learn patch discussed earlier in order to get daal4py optimizations

from sklearnex import patch_sklearn
patch_sklearn()

Our system info as is displayed below:

You can run the scripts to compare the timing results for Fitting model on a small sub-sample of the data, Predicting color indices on the full image (k-means) and Predicting color indices on the full image (random).

Below is a graphing of the timing results using regular Scikit-learn and the Intel Extension for Scikit-learn

From the above results, we can see that the Intel Extension outperforms regular Scikit-learn with the speedups below

10.44X on Fitting model on a small sub-sample of the data,

4X on Predicting color indices on the full image (k-means)

1.52X on Predicting color indices on the full image (random).

Posted in ai

Leave a Reply

Your email address will not be published. Required fields are marked *