Speeding up BERT model inference through Quantization with the Intel Neural Compressor

The Intel Neural Compressor(INC) is an open-source Python library offering network compression technologies such as quantization, pruning and knowledge distillation. In this blog post we will focus on quantization. Pruning and knowledge distillation will be covered in later blog posts.

Neural networks are resource-hungry algorithms that require significant compute and consume a lot of memory due to the large number of parameters used. This can be a challenge when we want to run our models on edge devices which are constrained by these factors. Quantization can help address this.

Quantization refers to processes that enable lower precision inference and training by performing computations at fixed point integers that are lower than floating points. This often leads to smaller model sizes and faster inference time.

There is a caveat to note with quantization. It can result in an accuracy loss as we are transitioning from a higher precision to a lower precision and thus reducing the number of bits for representing the neural network’s parameters. There are steps to mitigate this such as setting a threshold for the accuracy loss by validating with the testing data and experimenting with various tuning strategies.

There are 2 ways of executing quantization:

Post-training: as you can infer from the name , we train the model using FP32 weights and inputs, then quantize the weights after training. This is the approach we will use with the Intel Neural Compressor.

Quantization-aware training: here we quantize the weights during training.

Quantization with the Intel Neural Compressor is done through its API . We first define all parameters such as the deep learning framework of origin, accuracy metric, tuning strategy inside a yaml file

The various tuning strategies are Basic, Bayesian and random. In the basic strategy, the Intel Neural Compressor will try to quantize the entire model and then, fallback to re-convert to FP32 certain layers if accuracy threshold is not met. Bayesian strategy will use Bayesian optimization to define best next iteration.

Sample yaml configuration file for quantizing an alexnet model is shown below

For this demo, we will cover quantizing a BERT model from FP32 precision to INT8 precision and compare the performance speedup for this.

# Copyright (c) 2021 Intel Corporation
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#   http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.

  name: bert
  framework: tensorflow
  inputs: input_file, batch_size
  outputs: IteratorGetNext:3, unstack:0, unstack:1

          root: eval.tf_record
          label_file: /home/rmallela/neural-compressor/examples/tensorflow/nlp/bert_large_squad/quantization/ptq/data/dev-v1.1.json
      batch_size: 64
          label_file: /home/rmallela/neural-compressor/examples/tensorflow/nlp/bert_large_squad/quantization/ptq/data/dev-v1.1.json
          vocab_file: /home/rmallela/neural-compressor/examples/tensorflow/nlp/bert_large_squad/quantization/ptq/data/vocab.txt
    iteration: 10
        num_of_instance: 4
        cores_per_instance: 7
          root: eval.tf_record
          label_file: /home/rmallela/neural-compressor/examples/tensorflow/nlp/bert_large_squad/quantization/ptq/data/dev-v1.1.json
      batch_size: 64

    sampling_size: 500
          root: eval.tf_record
          label_file: /home/rmallela/neural-compressor/examples/tensorflow/nlp/bert_large_squad/quantization/ptq/data/dev-v1.1.json
      batch_size: 64
      granularity: per_channel
    relative:  0.01   
    timeout: 0       
    max_trials: 100 
  random_seed: 9527

We have the quantizer code for the above configuration below:

        from neural_compressor.quantization import Quantization
        quantizer = Quantization('./bert.yaml')
        quantizer.model = FLAGS.input_model
        q_model = quantizer.fit()

And we can evaluate the performance of the FP32 model vs INT8 model with the script below

        from neural_compressor.experimental import Benchmark
        evaluator = Benchmark('./bert.yaml')
        evaluator.model = FLAGS.input_model
        results = evaluator()
        for mode, result in results.items():
            acc, batch_size, result_list = result
            latency = np.array(result_list).mean() / batch_size
            print('\n{} mode benchmark result:'.format(mode))
            print('Accuracy is {:.3f}'.format(acc))
            print('Batch size = {}'.format(batch_size))
            print('Latency: {:.3f} ms'.format(latency * 1000))
            print('Throughput: {:.3f} images/sec'.format(1./ latency))

We will be running the quantization on the 2nd Gen Intel Xeon Scalable Processors( Cascade Lake), system info displayed below

We then execute the quantization script with the YAML configuration. This should take a couple of minutes and once completed you should get the output below indicating the process was successful.

The output also provides us with a breakdown of the FP32 layers that were quantized to INT8 as shown below

As you can see , almost all the layers with the exception of 1 were quantized to INT8.

We also get the performance data, which is what we really care about here to observe the speedup as a result of quantizing the BERT model

Two important metrics to observe here are the Accuracy and Duration. For the FP32 model the accuracy, which is the baseline was 92.9861% and for the INT8 model the accuracy is 92.4100% . There is a negligible loss in accuracy here and this is expected with quantization since we are transitioning to a lower precision.

For the performance, the duration of the FP32 model, which is the baseline is 1557.2677s and for the INT8 model, this is 1200.9974s. This represents a 22.87% speedup!!!

In conclusion, quantization can help to speedup the performance of your neural networks by performing computations at fixed point integers that are lower than floating points. It can be a complex process, but the Intel Neural Compressor greatly simplifies the process for you.

Link to the tool below:


Accelerating scikit-learn algorithms with the Intel Extension for Scikit-learn

Scikit-learn is the most widely used package for data science and machine learning (ML), thus it is imperative that developers achieve the best performance with this package. Scikit-learn offers simple and efficient algorithms and tools for data mining and data analysis on various tasks such as classification, regression and clustering.

The Intel Extension for Scikit-learn provides optimized implementations of many scikit-learn algorithms.it uses the Intel oneAPI Data Analytics Library (oneDAL) to achieve its acceleration. This library enables all the latest vector instructions, such as the Intel Advanced Vector Extensions (Intel AVX-512). It also uses cache-friendly data blocking, fast BLAS operations with the Intel oneAPI Math Kernel Library (oneMKL), and scalable multithreading with the Intel oneAPI Threading Building Blocks (oneTBB).

The Intel Extension for Scikit-learn basically offers a mechanism to dynamically patch scikit-learn estimators to use the optimizations from Intel(R) oneAPI Data Analytics Library . In the event that you are using When you are using algorithms that are not supported by the extension, the package just falls back into the original scikit-learn thus ensuring a seamless workflow.

You can install the package through pip or conda as shown below:

pip install scikit-learn-intelex
conda install scikit-learn-intelex -c conda-forge

To use the Intel Extension for Scikit-learn, all you need to add is add the 2 lines of code below to your Python script

from sklearnex import patch_sklearn

Performance comparison:

For performance comparison we will use the Color Quantization using K-Means sample from the Scikit-learn website shown below

# Authors: Robert Layton <robertlayton@gmail.com>
#          Olivier Grisel <olivier.grisel@ensta.org>
#          Mathieu Blondel <mathieu@mblondel.org>
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin
from sklearn.datasets import load_sample_image
from sklearn.utils import shuffle
from time import time

n_colors = 64

# Load the Summer Palace photo
china = load_sample_image("china.jpg")

# Convert to floats instead of the default 8 bits integer coding. Dividing by
# 255 is important so that plt.imshow behaves works well on float data (need to
# be in the range [0-1])
china = np.array(china, dtype=np.float64) / 255

# Load Image and transform to a 2D numpy array.
w, h, d = original_shape = tuple(china.shape)
assert d == 3
image_array = np.reshape(china, (w * h, d))

print("Fitting model on a small sub-sample of the data")
t0 = time()
image_array_sample = shuffle(image_array, random_state=0, n_samples=1_000)
kmeans = KMeans(n_clusters=n_colors, random_state=0).fit(image_array_sample)
print(f"done in {time() - t0:0.3f}s.")

# Get labels for all points
print("Predicting color indices on the full image (k-means)")
t0 = time()
labels = kmeans.predict(image_array)
print(f"done in {time() - t0:0.3f}s.")

codebook_random = shuffle(image_array, random_state=0, n_samples=n_colors)
print("Predicting color indices on the full image (random)")
t0 = time()
labels_random = pairwise_distances_argmin(codebook_random,
print(f"done in {time() - t0:0.3f}s.")

def recreate_image(codebook, labels, w, h):
    """Recreate the (compressed) image from the code book & labels"""
    return codebook[labels].reshape(w, h, -1)

# Display all results, alongside original image
plt.title('Original image (96,615 colors)')

plt.title(f'Quantized image ({n_colors} colors, K-Means)')
plt.imshow(recreate_image(kmeans.cluster_centers_, labels, w, h))

plt.title(f'Quantized image ({n_colors} colors, Random)')
plt.imshow(recreate_image(codebook_random, labels_random, w, h))

We modify the above script by adding the Scikit-learn patch discussed earlier in order to get daal4py optimizations

from sklearnex import patch_sklearn

Our system info as is displayed below:

You can run the scripts to compare the timing results for Fitting model on a small sub-sample of the data, Predicting color indices on the full image (k-means) and Predicting color indices on the full image (random).

Below is a graphing of the timing results using regular Scikit-learn and the Intel Extension for Scikit-learn

From the above results, we can see that the Intel Extension outperforms regular Scikit-learn with the speedups below

10.44X on Fitting model on a small sub-sample of the data,

4X on Predicting color indices on the full image (k-means)

1.52X on Predicting color indices on the full image (random).

Memory Traffic Optimization to Improve Application Performance

Memory traffic optimization yields the greatest speedup compared to all the other optimization techniques to be deployed in optimizing an application. The other techniques being vectorization and multithreading. And if we want to really leverage the power of vectorization, then we have to optimize data re-use in caches. In addition to this it is important to understand that vector arithmetic in modern processors is cheap, it’s memory access that’s expensive. It’s therefore paramount that we optimize memory access for bandwidth bound applications.

But first before delving looking at locality of memory access in space and time, let’s have a quick refresher on vectorization and multithreading.

Vectorization is basically having a single instruction operating on multiple data elements. The speedup as a result of vectorizing in your application will depend on the instruction set on your hardware since the
number of registers differ across various platforms. Vectorization and multithreading can easily be implemented using OpenMP pragmas within your application. You can also perform automatic vectorization using compilers. However the caveat with automatic vectorization is that the compilers will only try to vectorize the innermost loops in cases where you have multiple nested loops. You can override this with a #pragma omp simd in your code. You also need to take care of vector dependence within your code to ensure that the code is vectorized correctly. An example of true vector dependence is as shown below:

for( int i=1;i<n; i++){



In the above snippet, the compiler does not have enough information due to the dependence of a[i] on  a[i-1], to enable it to implement vectorization. It is also important to understand that vectorization will only work if the compiler knows the number of iterations in a loop. This is the reason you should avoid using while loops because the number of iterations are not known during compilation, instead use the good ol’ for loops. If the compiler does not see any vectorization opportunities, you can provide this through a technique called strip mining. Strip mining is a programming technique that turns one loop into two nested loops. This technique not only presents compiler with vectorization opportunities but also achieves data locality in time.

If you really want to leverage vectorization, you will have to optimize data re-use in cache by achieving locality of memory access in both space and time. To achieve locality of memory address in space, ensure that your application has a unit stride! This is to enable optimal usage of the cache lines during data transfer and thus e.g if you implement unit stride access in a loop and you are accessing back to back memory elements, for one double precision memory access, you’ll end up getting 7 free accesses and for  single precision memory access, you end up with 15 free accesses. Also, another important aspect to unit stride access is if you have a 2D-container you should access it columns first, then move to the next rows and maintain this sequence to ensure that the cache line is loaded with back to back data neighbors. Spatial locality can also be improved by moving from an Array of Structures (AoS) to a Structure of Arrays (SoA) in your data containers.

The main aim of achieving data locality in time is to reduce the chances of cache misses. The most effective techniques to have an optimal cache hit rate within a loop is to by loop tiling. Loop tiling is quite complex but basically involves two steps; strip mining and permutation. If you have 2 nested loops and you detect that one of the data containers is being re-used but the the re-use in not optimal, then you strip mine the loop accessing the data container and permute the outer two loops.

The purpose of this blog post was to provide a quick primer into various performance optimization techniques and the procedures. I would recommend that you dig deeper into each of these areas: vectorization , multithreading and memory traffic optimization in order to get a better understanding and be able to apply these to your stencil code and applications.

CERN…Science will save us all!

I would like to begin this post with a quote from the Monstrous Regiment by Terry Pratchett- he says “The presence of those seeking the truth is infinitely to be preferred to the presence of those who think they’ve found it.”  Hopefully this will make sense once you read this post.

I had the pleasure of contributing to the DEEP NLP project for document analysis and classification at CERN in Switzerland/France. Yes, the CERN sits astride the Franco-Swiss border and here physicists and engineers are tasked with probing the fundamental structure of the universe.


The CERN is where the Higgs Boson alias the God particle was discovered. The science of particle discovery relies mainly on are purpose-built particle accelerators and detectors. Accelerators boost beams of particles to high energies before the beams are made to collide with each other or with stationary targets. Detectors observe and record the results of these collisions. According to CERN, the particles are so tiny that the task of making them collide is akin to firing two needles 10 kilometers apart with such precision that they meet halfway!!!!!

The LHC ( The Large Hadron Collider)  which is located at CERN  is the is the world’s largest and most powerful particle accelerator. The LHC consists of a 27-kilometre ring of superconducting magnets with a number of accelerating structures to boost the energy of the particles along the way.  The LHC has a number of experiments for particle detection, the most notable ones being the ATLAS and the CMS. The ATLAS was crucial in the discovery of the Higgs Boson and the interactions in the ATLAS detectors create an enormous flow of data. The ATLAS generates ~ 1 Petabyte of data/second which is approximately four times the internet’s output.  Below is a picture inside Continue reading

Twelve Ways to Fool the Masses when giving Perfomance results on Parallel Computers

I came across one of the most interesting and humorous research papers  while doing my nightly reads. The paper is titled Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers by David H. Bailey and published in 1991. You can download the full paper  here.
The title describes exactly what the paper is about and I’ll just share some interesting snippets from the document.

To quote in part the abstract:
Many of us in the field of highly parallel scientific computing recognize that it is often quite difficult to match the run time performance of the best conventional supercomputers.  But since lay persons usually don’t appreciate these difficulties and therefore don’t understand when we quote mediocre performance results, it is often necessary for us to adopt some advanced techniques in order to deflect attention from possibly unfavorable facts

Continue reading