Accelerating your pandas workloads with Modin

pandas is a popular and widely used data analysis and manipulation tool, built on top of Python. It offers data structures and operations for manipulating numerical tables and time series. It is therefore widely used in the ETL (Extraction, Transform & Load) stages of your data analytics pipeline. Depending on the size of your dataset and capabilities of your compute platform, the ETL stages can be a huge bottleneck in your pipeline. It is for this reason that it is crucial that we accelerate this – enter Modin.

Modin enables you to accelerate your pandas workloads across multiple cores and multiple nodes. pandas is not designed to utilize the multiple cores available on your machine thus resulting in inefficient system utilization and impacting perfomance. This is not the case with Modin as illustrated below

pandas on a multi-core system

Modin on a multi-core system

So, how do you integrate Modin into your pandas workflow?

Well, we first need to install Modin

pip install modin

You can also explicitly install Modin to run on Ray/Dask as shown below

pip install modin[ray] # Install Modin dependencies and Ray to run on Ray

pip install modin[dask] # Install Modin dependencies and Dask to run on Dask

pip install modin[all] # Install all of the above

If you are using the Intel® oneAPI AI Analytics Toolkit (AI Kit), Modin should be available in the aikit-modin conda environment as shown below

The most crucial bit to note about Modin is how to integrate it into your pandas workflow. This is accomplished with a single line of code as shown below

import modin.pandas as pd

Note that if you are using Dask/Ray as a compute engine, you will need to initialize this first as shown below:

import os 
os.environ["MODIN_ENGINE"] = "ray"  # Modin will use Ray 
os.environ["MODIN_ENGINE"] = "dask"  # Modin will use Dask 
import modin.pandas as pd

With the setup done, let's now get to the performance comparison for Modin vs pandas. For this tests, the CPU is the 24 core Intel® Xeon® Gold 6252 Processor as shown below

First things, first, import Modin

We will now generate a synthetic dataset using NumPy to use with Modin and save it to a CSV.

Now we will convert the ndarray into a Pandas dataframe and display the first five rows. For  pandas, the dataframe is being stored as pandas_df and for Modin, the same dataframe is being stored as modin_df.

With pandas

With Modin

In the above case, you notice that pandas took 11.7s while Modin took 2.92 second. Modin thus gives us a 4X speedup for this task!

Now let’s compare various function calls in pandas vs Modin

As you can see , Modin offers a significant perfomance boost compared to pandas and this will accelerate the ETL stage of your data analytics pipeline.

Memory Traffic Optimization to Improve Application Performance

Memory traffic optimization yields the greatest speedup compared to all the other optimization techniques to be deployed in optimizing an application. The other techniques being vectorization and multithreading. And if we want to really leverage the power of vectorization, then we have to optimize data re-use in caches. In addition to this it is important to understand that vector arithmetic in modern processors is cheap, it’s memory access that’s expensive. It’s therefore paramount that we optimize memory access for bandwidth bound applications.

But first before delving looking at locality of memory access in space and time, let’s have a quick refresher on vectorization and multithreading.

Vectorization is basically having a single instruction operating on multiple data elements. The speedup as a result of vectorizing in your application will depend on the instruction set on your hardware since the
number of registers differ across various platforms. Vectorization and multithreading can easily be implemented using OpenMP pragmas within your application. You can also perform automatic vectorization using compilers. However the caveat with automatic vectorization is that the compilers will only try to vectorize the innermost loops in cases where you have multiple nested loops. You can override this with a #pragma omp simd in your code. You also need to take care of vector dependence within your code to ensure that the code is vectorized correctly. An example of true vector dependence is as shown below:

for( int i=1;i<n; i++){



In the above snippet, the compiler does not have enough information due to the dependence of a[i] on  a[i-1], to enable it to implement vectorization. It is also important to understand that vectorization will only work if the compiler knows the number of iterations in a loop. This is the reason you should avoid using while loops because the number of iterations are not known during compilation, instead use the good ol’ for loops. If the compiler does not see any vectorization opportunities, you can provide this through a technique called strip mining. Strip mining is a programming technique that turns one loop into two nested loops. This technique not only presents compiler with vectorization opportunities but also achieves data locality in time.

If you really want to leverage vectorization, you will have to optimize data re-use in cache by achieving locality of memory access in both space and time. To achieve locality of memory address in space, ensure that your application has a unit stride! This is to enable optimal usage of the cache lines during data transfer and thus e.g if you implement unit stride access in a loop and you are accessing back to back memory elements, for one double precision memory access, you’ll end up getting 7 free accesses and for  single precision memory access, you end up with 15 free accesses. Also, another important aspect to unit stride access is if you have a 2D-container you should access it columns first, then move to the next rows and maintain this sequence to ensure that the cache line is loaded with back to back data neighbors. Spatial locality can also be improved by moving from an Array of Structures (AoS) to a Structure of Arrays (SoA) in your data containers.

The main aim of achieving data locality in time is to reduce the chances of cache misses. The most effective techniques to have an optimal cache hit rate within a loop is to by loop tiling. Loop tiling is quite complex but basically involves two steps; strip mining and permutation. If you have 2 nested loops and you detect that one of the data containers is being re-used but the the re-use in not optimal, then you strip mine the loop accessing the data container and permute the outer two loops.

The purpose of this blog post was to provide a quick primer into various performance optimization techniques and the procedures. I would recommend that you dig deeper into each of these areas: vectorization , multithreading and memory traffic optimization in order to get a better understanding and be able to apply these to your stencil code and applications.

CERN…Science will save us all!

I would like to begin this post with a quote from the Monstrous Regiment by Terry Pratchett- he says “The presence of those seeking the truth is infinitely to be preferred to the presence of those who think they’ve found it.”  Hopefully this will make sense once you read this post.

I had the pleasure of contributing to the DEEP NLP project for document analysis and classification at CERN in Switzerland/France. Yes, the CERN sits astride the Franco-Swiss border and here physicists and engineers are tasked with probing the fundamental structure of the universe.


The CERN is where the Higgs Boson alias the God particle was discovered. The science of particle discovery relies mainly on are purpose-built particle accelerators and detectors. Accelerators boost beams of particles to high energies before the beams are made to collide with each other or with stationary targets. Detectors observe and record the results of these collisions. According to CERN, the particles are so tiny that the task of making them collide is akin to firing two needles 10 kilometers apart with such precision that they meet halfway!!!!!

The LHC ( The Large Hadron Collider)  which is located at CERN  is the is the world’s largest and most powerful particle accelerator. The LHC consists of a 27-kilometre ring of superconducting magnets with a number of accelerating structures to boost the energy of the particles along the way.  The LHC has a number of experiments for particle detection, the most notable ones being the ATLAS and the CMS. The ATLAS was crucial in the discovery of the Higgs Boson and the interactions in the ATLAS detectors create an enormous flow of data. The ATLAS generates ~ 1 Petabyte of data/second which is approximately four times the internet’s output.  Below is a picture inside Continue reading

Twelve Ways to Fool the Masses when giving Perfomance results on Parallel Computers

I came across one of the most interesting and humorous research papers  while doing my nightly reads. The paper is titled Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers by David H. Bailey and published in 1991. You can download the full paper  here.
The title describes exactly what the paper is about and I’ll just share some interesting snippets from the document.

To quote in part the abstract:
Many of us in the field of highly parallel scientific computing recognize that it is often quite difficult to match the run time performance of the best conventional supercomputers.  But since lay persons usually don’t appreciate these difficulties and therefore don’t understand when we quote mediocre performance results, it is often necessary for us to adopt some advanced techniques in order to deflect attention from possibly unfavorable facts

Continue reading

Predicting user activity using an accelerometer on Intel based wearables

This project describes how to recognize certain types of human physical activities using acceleration data generated from the ADXL345 accelerometer connected to the Intel Edison board.

I have also published this project on the Intel Developer Zone site.

Human activity recognition has a wide range of applications especially in wearables. The data collected can be used monitoring patient activity, exercise patterns in athletes, fitness tracking and so on.

We will be using support vector machine learning algorithm to achieve this. The project is implemented using the LIBSVM library and done separately in both Python and node.js.

The set of activities to be recognized are running, walking, going up/down a flight of stairs and resting. We collect the accelerometer data over a set of intervals, extract the features which in this case are the acceleration values along the x,y and z axes. We then use this data for building a training model that will do the activity classification.

The project code can be downloaded from here

Hardware Implementation.

The diagram below shows the pin connection for the ADXL345 to the Intel® Edison board.


Part 1. Python Implementation.

Setting up LIBSVM

Download the LIBSVM library and transfer the  LibSVM  zipped folder to the Intel® Edison board root directory using WINSCP. Then extract it by running:

tar –xzf libsvm-3.21.tar.gz

Run  make in libsvm-3.21 directory

Run make in libsvm-3.21/python directory

Create the python script in the libsvm-3.21/python directory

Reading data from the ADXL345 accelerometer
import pyupm_adxl345 as adxl345
# Create an I2C accelerometer object
adxl = adxl345.Adxl345(0)
while True:
    adxl.update() # Update the data
    raw = adxl.getRawValues() # Read raw sensor data
    force = adxl.getAcceleration() # Read acceleration force (g)

Classifying the various activities

The training file contains instances of the various activities classified as;

2-up/down staircase

Below is the screenshot of a small section of the training file

training dataRunning a grid search to determine the best value for C:

The parameter C controls the tradeoff between errors of the SVM on training data and margin maximization. C is used during the training phase and says how much outliers are taken into account in calculating Support Vectors.

from svmutil import *
import numpy as nu

param = svm_parameter("-q -h 0")
y, x = svm_read_problem('activity.ds')
problem = svm_problem(y[:100], x[:100])
results = []
for c in range(-10,20):
  for g in range(-10,5):
    param.C, param.gamma = 2**c, 2**g
    m = svm_train(problem,param)
    p_lbl, p_acc, p_val = svm_predict(y[100:],x[100:],m)
    results.append([param.C, param.gamma, p_acc[0]])

bestIdx = nu.argmax(nu.array(results)[:,2])

print results[bestIdx]
grid-searchClassifying various user activities
#load the LIBSVM shared library:
from svmutil import *
#Construct an svm_problem instance where the activity.ds is the dataset with the training data
y, x = svm_read_problem('activity.ds')
#where y is the tuple of labels representing the various activities and x is a tuple of data instances representing acceleration values from x,y and z axes

m = svm_train(y[0:], x[0:], '-c 0.03125 -h 0 -b 1 -q')
#y[0:] and x[0:] means we train the model on the whole dataset array from the first instance[0] to the last
#-h disable shrinking heuristics
#-c : set the cost parameter C of C-SVC
#-q: quiet mode (no outputs)

values=[[float(forceX),float(forceY), float(forceZ)]]
#forceX,forceY and forceZ are data instances of the accelerometer values in x,y,z axes

 p_labels,p_acc,p_values = svm_predict([0]*len(values),values,m])
#pass the accelerometer values for prediction in the svm_predict() method which returns 
#p_labels: a list of predicted labels   p_acc: a tuple including accuracy and p_values: a list of decision values or probability estimates (if '-b 1' is specified)
print p_labels
#output the predicted labels to represent the predicted activity

classification python

Part 2. Node.Js Implementation

Create a new project directory in the board’s home root folder and install node-svm:

npm install node-svm

Copy the build and lib folder from the node-modules/node-svm to the project directory.

Next install the packages required by node-svm by running:

npm install <package-name>

The packages are Stringify-object, Mout, Graceful-fs, Optimist, Osenv, Numeric, Q and underscore.

Reading data from the ADXL345 accelerometer
Reading data from the ADXL345 accelerometer
var adxl345 = require('jsupm_adxl345');
var adxl = new adxl345.Adxl345(0);

    adxl.update(); // Update the data
    var raw = adxl.getRawValues(); // Read raw sensor data
    var force = adxl.getAcceleration(); // Read acceleration force (g)
    var rawvalues = raw.getitem(0) + " " + raw.getitem(1) + " " + raw.getitem(2);
    //console.log("Raw Values: " + rawvalues);
    var forceX=force.getitem(0).toFixed(2);
    var forceY=force.getitem(1).toFixed(2);
    var forceZ=force.getitem(2).toFixed(2);

}, 2000);

We can now write the code to classify user activity using the value of c we determined from running the grid search using Python in part 1.

var so = require('stringify-object');
var svm = require('./lib');
var numeric = require('numeric');
var fs = require('fs');
var fileName = './activity.ds';

//create the classifier
var clf = new svm.CSVC({
    c: 0.03125,
    normalize: false,
    reduce: false,

//Build the training model
    .then(function (dataset) {
        return clf.train(dataset)
            .progress(function (progress) {
                console.log('training progress: %d%', Math.round(progress * 100));
    .spread(function (model, report) {
        console.log('SVM trained. \nReport:\n%s', so(report));
    }).done(function () {

//Classify data instance at some interval
var prediction=clf.predictSync([forceX, forceY, forceZ]);
var probability=clf.predictProbabilitiesSync([forceX, forceY, forceZ]);

console output for node


The project covers how to determine the training parameters of the classifier using a grid search and running the classifier to be able to accurately predict various activities using LIBSVM in both Python and node.js.

Creating a context aware application for 2-in-1 devices

Applications running on 2-in-1 tablets can either be in laptop mode or desktop mode when running. Note that the laptop mode and desktop mode are different from landscape and portrait modes.Laptop mode refers to users interacting with the application via touch input and gestures whereas desktop mode refers to users interacting with application via the keyboard and mouse. Applications need to be contextually aware when the device mode has changed and switch between the modes accordingly. Continue reading