Accelerating your pandas workloads with Modin

pandas is a popular and widely used data analysis and manipulation tool, built on top of Python. It offers data structures and operations for manipulating numerical tables and time series. It is therefore widely used in the ETL (Extraction, Transform & Load) stages of your data analytics pipeline. Depending on the size of your dataset and capabilities of your compute platform, the ETL stages can be a huge bottleneck in your pipeline. It is for this reason that it is crucial that we accelerate this – enter Modin.

Modin enables you to accelerate your pandas workloads across multiple cores and multiple nodes. pandas is not designed to utilize the multiple cores available on your machine thus resulting in inefficient system utilization and impacting perfomance. This is not the case with Modin as illustrated below

pandas on a multi-core system

Modin on a multi-core system

So, how do you integrate Modin into your pandas workflow?

Well, we first need to install Modin

pip install modin

You can also explicitly install Modin to run on Ray/Dask as shown below

pip install modin[ray] # Install Modin dependencies and Ray to run on Ray

pip install modin[dask] # Install Modin dependencies and Dask to run on Dask

pip install modin[all] # Install all of the above

If you are using the Intel® oneAPI AI Analytics Toolkit (AI Kit), Modin should be available in the aikit-modin conda environment as shown below

The most crucial bit to note about Modin is how to integrate it into your pandas workflow. This is accomplished with a single line of code as shown below

import modin.pandas as pd

Note that if you are using Dask/Ray as a compute engine, you will need to initialize this first as shown below:

import os 
os.environ["MODIN_ENGINE"] = "ray"  # Modin will use Ray 
os.environ["MODIN_ENGINE"] = "dask"  # Modin will use Dask 
import modin.pandas as pd

With the setup done, let’s now get to the performance comparison for Modin vs pandas. For this tests, the CPU is the 24 core Intel® Xeon® Gold 6252 Processor as shown below

First things, first, import Modin

We will now generate a synthetic dataset using NumPy to use with Modin and save it to a CSV.

Now we will convert the ndarray into a Pandas dataframe and display the first five rows. For  pandas, the dataframe is being stored as pandas_df and for Modin, the same dataframe is being stored as modin_df.

With pandas

With Modin

In the above case, you notice that pandas took 11.7s while Modin took 2.92 second. Modin thus gives us a 4X speedup for this task!

Now let’s compare various function calls in pandas vs Modin

As you can see , Modin offers a significant perfomance boost compared to pandas and this will accelerate the ETL stage of your data analytics pipeline.

Twelve Ways to Fool the Masses when giving Perfomance results on Parallel Computers

I came across one of the most interesting and humorous research papers  while doing my nightly reads. The paper is titled Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers by David H. Bailey and published in 1991. You can download the full paper  here.
The title describes exactly what the paper is about and I’ll just share some interesting snippets from the document.

To quote in part the abstract:
Many of us in the field of highly parallel scientific computing recognize that it is often quite difficult to match the run time performance of the best conventional supercomputers.  But since lay persons usually don’t appreciate these difficulties and therefore don’t understand when we quote mediocre performance results, it is often necessary for us to adopt some advanced techniques in order to deflect attention from possibly unfavorable facts

Continue reading