Accelerating scikit-learn algorithms with the Intel Extension for Scikit-learn

Scikit-learn is the most widely used package for data science and machine learning (ML), thus it is imperative that developers achieve the best performance with this package. Scikit-learn offers simple and efficient algorithms and tools for data mining and data analysis on various tasks such as classification, regression and clustering.

The Intel Extension for Scikit-learn provides optimized implementations of many scikit-learn uses the Intel oneAPI Data Analytics Library (oneDAL) to achieve its acceleration. This library enables all the latest vector instructions, such as the Intel Advanced Vector Extensions (Intel AVX-512). It also uses cache-friendly data blocking, fast BLAS operations with the Intel oneAPI Math Kernel Library (oneMKL), and scalable multithreading with the Intel oneAPI Threading Building Blocks (oneTBB).

The Intel Extension for Scikit-learn basically offers a mechanism to dynamically patch scikit-learn estimators to use the optimizations from Intel(R) oneAPI Data Analytics Library . In the event that you are using When you are using algorithms that are not supported by the extension, the package just falls back into the original scikit-learn thus ensuring a seamless workflow.

You can install the package through pip or conda as shown below:

pip install scikit-learn-intelex
conda install scikit-learn-intelex -c conda-forge

To use the Intel Extension for Scikit-learn, all you need to add is add the 2 lines of code below to your Python script

from sklearnex import patch_sklearn

Performance comparison:

For performance comparison we will use the Color Quantization using K-Means sample from the Scikit-learn website shown below

# Authors: Robert Layton <>
#          Olivier Grisel <>
#          Mathieu Blondel <>
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin
from sklearn.datasets import load_sample_image
from sklearn.utils import shuffle
from time import time

n_colors = 64

# Load the Summer Palace photo
china = load_sample_image("china.jpg")

# Convert to floats instead of the default 8 bits integer coding. Dividing by
# 255 is important so that plt.imshow behaves works well on float data (need to
# be in the range [0-1])
china = np.array(china, dtype=np.float64) / 255

# Load Image and transform to a 2D numpy array.
w, h, d = original_shape = tuple(china.shape)
assert d == 3
image_array = np.reshape(china, (w * h, d))

print("Fitting model on a small sub-sample of the data")
t0 = time()
image_array_sample = shuffle(image_array, random_state=0, n_samples=1_000)
kmeans = KMeans(n_clusters=n_colors, random_state=0).fit(image_array_sample)
print(f"done in {time() - t0:0.3f}s.")

# Get labels for all points
print("Predicting color indices on the full image (k-means)")
t0 = time()
labels = kmeans.predict(image_array)
print(f"done in {time() - t0:0.3f}s.")

codebook_random = shuffle(image_array, random_state=0, n_samples=n_colors)
print("Predicting color indices on the full image (random)")
t0 = time()
labels_random = pairwise_distances_argmin(codebook_random,
print(f"done in {time() - t0:0.3f}s.")

def recreate_image(codebook, labels, w, h):
    """Recreate the (compressed) image from the code book & labels"""
    return codebook[labels].reshape(w, h, -1)

# Display all results, alongside original image
plt.title('Original image (96,615 colors)')

plt.title(f'Quantized image ({n_colors} colors, K-Means)')
plt.imshow(recreate_image(kmeans.cluster_centers_, labels, w, h))

plt.title(f'Quantized image ({n_colors} colors, Random)')
plt.imshow(recreate_image(codebook_random, labels_random, w, h))

We modify the above script by adding the Scikit-learn patch discussed earlier in order to get daal4py optimizations

from sklearnex import patch_sklearn

Our system info as is displayed below:

You can run the scripts to compare the timing results for Fitting model on a small sub-sample of the data, Predicting color indices on the full image (k-means) and Predicting color indices on the full image (random).

Below is a graphing of the timing results using regular Scikit-learn and the Intel Extension for Scikit-learn

From the above results, we can see that the Intel Extension outperforms regular Scikit-learn with the speedups below

10.44X on Fitting model on a small sub-sample of the data,

4X on Predicting color indices on the full image (k-means)

1.52X on Predicting color indices on the full image (random).

Accelerating your pandas workloads with Modin

pandas is a popular and widely used data analysis and manipulation tool, built on top of Python. It offers data structures and operations for manipulating numerical tables and time series. It is therefore widely used in the ETL (Extraction, Transform & Load) stages of your data analytics pipeline. Depending on the size of your dataset and capabilities of your compute platform, the ETL stages can be a huge bottleneck in your pipeline. It is for this reason that it is crucial that we accelerate this – enter Modin.

Modin enables you to accelerate your pandas workloads across multiple cores and multiple nodes. pandas is not designed to utilize the multiple cores available on your machine thus resulting in inefficient system utilization and impacting perfomance. This is not the case with Modin as illustrated below

pandas on a multi-core system

Modin on a multi-core system

So, how do you integrate Modin into your pandas workflow?

Well, we first need to install Modin

pip install modin

You can also explicitly install Modin to run on Ray/Dask as shown below

pip install modin[ray] # Install Modin dependencies and Ray to run on Ray

pip install modin[dask] # Install Modin dependencies and Dask to run on Dask

pip install modin[all] # Install all of the above

If you are using the Intel® oneAPI AI Analytics Toolkit (AI Kit), Modin should be available in the aikit-modin conda environment as shown below

The most crucial bit to note about Modin is how to integrate it into your pandas workflow. This is accomplished with a single line of code as shown below

import modin.pandas as pd

Note that if you are using Dask/Ray as a compute engine, you will need to initialize this first as shown below:

import os 
os.environ["MODIN_ENGINE"] = "ray"  # Modin will use Ray 
os.environ["MODIN_ENGINE"] = "dask"  # Modin will use Dask 
import modin.pandas as pd

With the setup done, let’s now get to the performance comparison for Modin vs pandas. For this tests, the CPU is the 24 core Intel® Xeon® Gold 6252 Processor as shown below

First things, first, import Modin

We will now generate a synthetic dataset using NumPy to use with Modin and save it to a CSV.

Now we will convert the ndarray into a Pandas dataframe and display the first five rows. For  pandas, the dataframe is being stored as pandas_df and for Modin, the same dataframe is being stored as modin_df.

With pandas

With Modin

In the above case, you notice that pandas took 11.7s while Modin took 2.92 second. Modin thus gives us a 4X speedup for this task!

Now let’s compare various function calls in pandas vs Modin

As you can see , Modin offers a significant perfomance boost compared to pandas and this will accelerate the ETL stage of your data analytics pipeline.

Memory Traffic Optimization to Improve Application Performance

Memory traffic optimization yields the greatest speedup compared to all the other optimization techniques to be deployed in optimizing an application. The other techniques being vectorization and multithreading. And if we want to really leverage the power of vectorization, then we have to optimize data re-use in caches. In addition to this it is important to understand that vector arithmetic in modern processors is cheap, it’s memory access that’s expensive. It’s therefore paramount that we optimize memory access for bandwidth bound applications.

But first before delving looking at locality of memory access in space and time, let’s have a quick refresher on vectorization and multithreading.

Vectorization is basically having a single instruction operating on multiple data elements. The speedup as a result of vectorizing in your application will depend on the instruction set on your hardware since the
number of registers differ across various platforms. Vectorization and multithreading can easily be implemented using OpenMP pragmas within your application. You can also perform automatic vectorization using compilers. However the caveat with automatic vectorization is that the compilers will only try to vectorize the innermost loops in cases where you have multiple nested loops. You can override this with a #pragma omp simd in your code. You also need to take care of vector dependence within your code to ensure that the code is vectorized correctly. An example of true vector dependence is as shown below:

for( int i=1;i<n; i++){



In the above snippet, the compiler does not have enough information due to the dependence of a[i] on  a[i-1], to enable it to implement vectorization. It is also important to understand that vectorization will only work if the compiler knows the number of iterations in a loop. This is the reason you should avoid using while loops because the number of iterations are not known during compilation, instead use the good ol’ for loops. If the compiler does not see any vectorization opportunities, you can provide this through a technique called strip mining. Strip mining is a programming technique that turns one loop into two nested loops. This technique not only presents compiler with vectorization opportunities but also achieves data locality in time.

If you really want to leverage vectorization, you will have to optimize data re-use in cache by achieving locality of memory access in both space and time. To achieve locality of memory address in space, ensure that your application has a unit stride! This is to enable optimal usage of the cache lines during data transfer and thus e.g if you implement unit stride access in a loop and you are accessing back to back memory elements, for one double precision memory access, you’ll end up getting 7 free accesses and for  single precision memory access, you end up with 15 free accesses. Also, another important aspect to unit stride access is if you have a 2D-container you should access it columns first, then move to the next rows and maintain this sequence to ensure that the cache line is loaded with back to back data neighbors. Spatial locality can also be improved by moving from an Array of Structures (AoS) to a Structure of Arrays (SoA) in your data containers.

The main aim of achieving data locality in time is to reduce the chances of cache misses. The most effective techniques to have an optimal cache hit rate within a loop is to by loop tiling. Loop tiling is quite complex but basically involves two steps; strip mining and permutation. If you have 2 nested loops and you detect that one of the data containers is being re-used but the the re-use in not optimal, then you strip mine the loop accessing the data container and permute the outer two loops.

The purpose of this blog post was to provide a quick primer into various performance optimization techniques and the procedures. I would recommend that you dig deeper into each of these areas: vectorization , multithreading and memory traffic optimization in order to get a better understanding and be able to apply these to your stencil code and applications.

CERN…Science will save us all!

I would like to begin this post with a quote from the Monstrous Regiment by Terry Pratchett- he says “The presence of those seeking the truth is infinitely to be preferred to the presence of those who think they’ve found it.”  Hopefully this will make sense once you read this post.

I had the pleasure of contributing to the DEEP NLP project for document analysis and classification at CERN in Switzerland/France. Yes, the CERN sits astride the Franco-Swiss border and here physicists and engineers are tasked with probing the fundamental structure of the universe.


The CERN is where the Higgs Boson alias the God particle was discovered. The science of particle discovery relies mainly on are purpose-built particle accelerators and detectors. Accelerators boost beams of particles to high energies before the beams are made to collide with each other or with stationary targets. Detectors observe and record the results of these collisions. According to CERN, the particles are so tiny that the task of making them collide is akin to firing two needles 10 kilometers apart with such precision that they meet halfway!!!!!

The LHC ( The Large Hadron Collider)  which is located at CERN  is the is the world’s largest and most powerful particle accelerator. The LHC consists of a 27-kilometre ring of superconducting magnets with a number of accelerating structures to boost the energy of the particles along the way.  The LHC has a number of experiments for particle detection, the most notable ones being the ATLAS and the CMS. The ATLAS was crucial in the discovery of the Higgs Boson and the interactions in the ATLAS detectors create an enormous flow of data. The ATLAS generates ~ 1 Petabyte of data/second which is approximately four times the internet’s output.  Below is a picture inside Continue reading

Twelve Ways to Fool the Masses when giving Perfomance results on Parallel Computers

I came across one of the most interesting and humorous research papers  while doing my nightly reads. The paper is titled Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers by David H. Bailey and published in 1991. You can download the full paper  here.
The title describes exactly what the paper is about and I’ll just share some interesting snippets from the document.

To quote in part the abstract:
Many of us in the field of highly parallel scientific computing recognize that it is often quite difficult to match the run time performance of the best conventional supercomputers.  But since lay persons usually don’t appreciate these difficulties and therefore don’t understand when we quote mediocre performance results, it is often necessary for us to adopt some advanced techniques in order to deflect attention from possibly unfavorable facts

Continue reading

Predicting user activity using an accelerometer on Intel based wearables

This project describes how to recognize certain types of human physical activities using acceleration data generated from the ADXL345 accelerometer connected to the Intel Edison board.

I have also published this project on the Intel Developer Zone site.

Human activity recognition has a wide range of applications especially in wearables. The data collected can be used monitoring patient activity, exercise patterns in athletes, fitness tracking and so on.

We will be using support vector machine learning algorithm to achieve this. The project is implemented using the LIBSVM library and done separately in both Python and node.js.

The set of activities to be recognized are running, walking, going up/down a flight of stairs and resting. We collect the accelerometer data over a set of intervals, extract the features which in this case are the acceleration values along the x,y and z axes. We then use this data for building a training model that will do the activity classification.

The project code can be downloaded from here

Hardware Implementation.

The diagram below shows the pin connection for the ADXL345 to the Intel® Edison board.


Part 1. Python Implementation.

Setting up LIBSVM

Download the LIBSVM library and transfer the  LibSVM  zipped folder to the Intel® Edison board root directory using WINSCP. Then extract it by running:

tar –xzf libsvm-3.21.tar.gz

Run  make in libsvm-3.21 directory

Run make in libsvm-3.21/python directory

Create the python script in the libsvm-3.21/python directory

Reading data from the ADXL345 accelerometer
import pyupm_adxl345 as adxl345
# Create an I2C accelerometer object
adxl = adxl345.Adxl345(0)
while True:
    adxl.update() # Update the data
    raw = adxl.getRawValues() # Read raw sensor data
    force = adxl.getAcceleration() # Read acceleration force (g)

Classifying the various activities

The training file contains instances of the various activities classified as;

2-up/down staircase

Below is the screenshot of a small section of the training file

training dataRunning a grid search to determine the best value for C:

The parameter C controls the tradeoff between errors of the SVM on training data and margin maximization. C is used during the training phase and says how much outliers are taken into account in calculating Support Vectors.

from svmutil import *
import numpy as nu

param = svm_parameter("-q -h 0")
y, x = svm_read_problem('activity.ds')
problem = svm_problem(y[:100], x[:100])
results = []
for c in range(-10,20):
  for g in range(-10,5):
    param.C, param.gamma = 2**c, 2**g
    m = svm_train(problem,param)
    p_lbl, p_acc, p_val = svm_predict(y[100:],x[100:],m)
    results.append([param.C, param.gamma, p_acc[0]])

bestIdx = nu.argmax(nu.array(results)[:,2])

print results[bestIdx]
grid-searchClassifying various user activities
#load the LIBSVM shared library:
from svmutil import *
#Construct an svm_problem instance where the activity.ds is the dataset with the training data
y, x = svm_read_problem('activity.ds')
#where y is the tuple of labels representing the various activities and x is a tuple of data instances representing acceleration values from x,y and z axes

m = svm_train(y[0:], x[0:], '-c 0.03125 -h 0 -b 1 -q')
#y[0:] and x[0:] means we train the model on the whole dataset array from the first instance[0] to the last
#-h disable shrinking heuristics
#-c : set the cost parameter C of C-SVC
#-q: quiet mode (no outputs)

values=[[float(forceX),float(forceY), float(forceZ)]]
#forceX,forceY and forceZ are data instances of the accelerometer values in x,y,z axes

 p_labels,p_acc,p_values = svm_predict([0]*len(values),values,m])
#pass the accelerometer values for prediction in the svm_predict() method which returns 
#p_labels: a list of predicted labels   p_acc: a tuple including accuracy and p_values: a list of decision values or probability estimates (if '-b 1' is specified)
print p_labels
#output the predicted labels to represent the predicted activity

classification python

Part 2. Node.Js Implementation

Create a new project directory in the board’s home root folder and install node-svm:

npm install node-svm

Copy the build and lib folder from the node-modules/node-svm to the project directory.

Next install the packages required by node-svm by running:

npm install <package-name>

The packages are Stringify-object, Mout, Graceful-fs, Optimist, Osenv, Numeric, Q and underscore.

Reading data from the ADXL345 accelerometer
Reading data from the ADXL345 accelerometer
var adxl345 = require('jsupm_adxl345');
var adxl = new adxl345.Adxl345(0);

    adxl.update(); // Update the data
    var raw = adxl.getRawValues(); // Read raw sensor data
    var force = adxl.getAcceleration(); // Read acceleration force (g)
    var rawvalues = raw.getitem(0) + " " + raw.getitem(1) + " " + raw.getitem(2);
    //console.log("Raw Values: " + rawvalues);
    var forceX=force.getitem(0).toFixed(2);
    var forceY=force.getitem(1).toFixed(2);
    var forceZ=force.getitem(2).toFixed(2);

}, 2000);

We can now write the code to classify user activity using the value of c we determined from running the grid search using Python in part 1.

var so = require('stringify-object');
var svm = require('./lib');
var numeric = require('numeric');
var fs = require('fs');
var fileName = './activity.ds';

//create the classifier
var clf = new svm.CSVC({
    c: 0.03125,
    normalize: false,
    reduce: false,

//Build the training model
    .then(function (dataset) {
        return clf.train(dataset)
            .progress(function (progress) {
                console.log('training progress: %d%', Math.round(progress * 100));
    .spread(function (model, report) {
        console.log('SVM trained. \nReport:\n%s', so(report));
    }).done(function () {

//Classify data instance at some interval
var prediction=clf.predictSync([forceX, forceY, forceZ]);
var probability=clf.predictProbabilitiesSync([forceX, forceY, forceZ]);

console output for node


The project covers how to determine the training parameters of the classifier using a grid search and running the classifier to be able to accurately predict various activities using LIBSVM in both Python and node.js.

Creating a context aware application for 2-in-1 devices

Applications running on 2-in-1 tablets can either be in laptop mode or desktop mode when running. Note that the laptop mode and desktop mode are different from landscape and portrait modes.Laptop mode refers to users interacting with the application via touch input and gestures whereas desktop mode refers to users interacting with application via the keyboard and mouse. Applications need to be contextually aware when the device mode has changed and switch between the modes accordingly. Continue reading