pandas is a popular and widely used data analysis and manipulation tool, built on top of Python. It offers data structures and operations for manipulating numerical tables and time series. It is therefore widely used in the ETL (Extraction, Transform & Load) stages of your data analytics pipeline. Depending on the size of your dataset and capabilities of your compute platform, the ETL stages can be a huge bottleneck in your pipeline. It is for this reason that it is crucial that we accelerate this – enter Modin.
Modin enables you to accelerate your pandas workloads across multiple cores and multiple nodes. pandas is not designed to utilize the mult
iple cores available on your machine thus resulting in inefficient system utilization and impacting perfomance. This is not the case with Modin as illustrated below
pandas on a multi-core system
Modin on a multi-core system
So, how do you integrate Modin into your pandas workflow?
Well, we first need to install Modin
pip install modin
You can also explicitly install Modin to run on Ray/Dask as shown below
pip install modin[ray] # Install Modin dependencies and Ray to run on Ray pip install modin[dask] # Install Modin dependencies and Dask to run on Dask pip install modin[all] # Install all of the above
If you are using the Intel® oneAPI AI Analytics Toolkit (AI Kit), Modin should be available in the aikit-modin conda environment as shown below
The most crucial bit to note about Modin is how to integrate it into your pandas workflow. This is accomplished with a single line of code as shown below
import modin.pandas as pd
Note that if you are using Dask/Ray as a compute engine, you will need to initialize this first as shown below:
import os os.environ["MODIN_ENGINE"] = "ray" # Modin will use Ray os.environ["MODIN_ENGINE"] = "dask" # Modin will use Dask import modin.pandas as pd
With the setup done, let’s now get to the performance comparison for Modin vs pandas. For this tests, the CPU is the 24 core Intel® Xeon® Gold 6252 Processor as shown below
First things, first, import Modin
We will now generate a synthetic dataset using NumPy to use with Modin and save it to a CSV.
Now we will convert the ndarray into a Pandas dataframe and display the first five rows. For pandas, the dataframe is being stored as
pandas_df and for Modin, the same dataframe is being stored as
In the above case, you notice that pandas took 11.7s while Modin took 2.92 second. Modin thus gives us a 4X speedup for this task!
Now let’s compare various function calls in pandas vs Modin
As you can see , Modin offers a significant perfomance boost compared to pandas and this will accelerate the ETL stage of your data analytics pipeline.