Speeding up BERT model inference through Quantization with the Intel Neural Compressor

The Intel Neural Compressor(INC) is an open-source Python library offering network compression technologies such as quantization, pruning and knowledge distillation. In this blog post we will focus on quantization. Pruning and knowledge distillation will be covered in later blog posts.

Neural networks are resource-hungry algorithms that require significant compute and consume a lot of memory due to the large number of parameters used. This can be a challenge when we want to run our models on edge devices which are constrained by these factors. Quantization can help address this.

Quantization refers to processes that enable lower precision inference and training by performing computations at fixed point integers that are lower than floating points. This often leads to smaller model sizes and faster inference time.

There is a caveat to note with quantization. It can result in an accuracy loss as we are transitioning from a higher precision to a lower precision and thus reducing the number of bits for representing the neural network’s parameters. There are steps to mitigate this such as setting a threshold for the accuracy loss by validating with the testing data and experimenting with various tuning strategies.

There are 2 ways of executing quantization:

Post-training: as you can infer from the name , we train the model using FP32 weights and inputs, then quantize the weights after training. This is the approach we will use with the Intel Neural Compressor.

Quantization-aware training: here we quantize the weights during training.

Quantization with the Intel Neural Compressor is done through its API . We first define all parameters such as the deep learning framework of origin, accuracy metric, tuning strategy inside a yaml file

The various tuning strategies are Basic, Bayesian and random. In the basic strategy, the Intel Neural Compressor will try to quantize the entire model and then, fallback to re-convert to FP32 certain layers if accuracy threshold is not met. Bayesian strategy will use Bayesian optimization to define best next iteration.

Sample yaml configuration file for quantizing an alexnet model is shown below

For this demo, we will cover quantizing a BERT model from FP32 precision to INT8 precision and compare the performance speedup for this.

#
# Copyright (c) 2021 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

model:
  name: bert
  framework: tensorflow
  inputs: input_file, batch_size
  outputs: IteratorGetNext:3, unstack:0, unstack:1

evaluation:
  accuracy:
    metric:
      SquadF1:
    dataloader:
      dataset:
        bert:
          root: eval.tf_record
          label_file: /home/rmallela/neural-compressor/examples/tensorflow/nlp/bert_large_squad/quantization/ptq/data/dev-v1.1.json
      batch_size: 64
    postprocess:
      transform:
        SquadV1:
          label_file: /home/rmallela/neural-compressor/examples/tensorflow/nlp/bert_large_squad/quantization/ptq/data/dev-v1.1.json
          vocab_file: /home/rmallela/neural-compressor/examples/tensorflow/nlp/bert_large_squad/quantization/ptq/data/vocab.txt
  performance:
    iteration: 10
    configs:
        num_of_instance: 4
        cores_per_instance: 7
    dataloader:
      dataset:
        bert:
          root: eval.tf_record
          label_file: /home/rmallela/neural-compressor/examples/tensorflow/nlp/bert_large_squad/quantization/ptq/data/dev-v1.1.json
      batch_size: 64

quantization:            
  calibration:
    sampling_size: 500
    dataloader:
      dataset:
        bert:
          root: eval.tf_record
          label_file: /home/rmallela/neural-compressor/examples/tensorflow/nlp/bert_large_squad/quantization/ptq/data/dev-v1.1.json
      batch_size: 64
  model_wise:
    weight:
      granularity: per_channel
tuning:
  accuracy_criterion:
    relative:  0.01   
  exit_policy:
    timeout: 0       
    max_trials: 100 
  random_seed: 9527

We have the quantizer code for the above configuration below:

        from neural_compressor.quantization import Quantization
        quantizer = Quantization('./bert.yaml')
        quantizer.model = FLAGS.input_model
        q_model = quantizer.fit()
        q_model.save(FLAGS.output_model)

And we can evaluate the performance of the FP32 model vs INT8 model with the script below

        from neural_compressor.experimental import Benchmark
        evaluator = Benchmark('./bert.yaml')
        evaluator.model = FLAGS.input_model
        results = evaluator()
        for mode, result in results.items():
            acc, batch_size, result_list = result
            latency = np.array(result_list).mean() / batch_size
            print('\n{} mode benchmark result:'.format(mode))
            print('Accuracy is {:.3f}'.format(acc))
            print('Batch size = {}'.format(batch_size))
            print('Latency: {:.3f} ms'.format(latency * 1000))
            print('Throughput: {:.3f} images/sec'.format(1./ latency))

We will be running the quantization on the 2nd Gen Intel Xeon Scalable Processors( Cascade Lake), system info displayed below

We then execute the quantization script with the YAML configuration. This should take a couple of minutes and once completed you should get the output below indicating the process was successful.

The output also provides us with a breakdown of the FP32 layers that were quantized to INT8 as shown below

As you can see , almost all the layers with the exception of 1 were quantized to INT8.

We also get the performance data, which is what we really care about here to observe the speedup as a result of quantizing the BERT model

Two important metrics to observe here are the Accuracy and Duration. For the FP32 model the accuracy, which is the baseline was 92.9861% and for the INT8 model the accuracy is 92.4100% . There is a negligible loss in accuracy here and this is expected with quantization since we are transitioning to a lower precision.

For the performance, the duration of the FP32 model, which is the baseline is 1557.2677s and for the INT8 model, this is 1200.9974s. This represents a 22.87% speedup!!!

In conclusion, quantization can help to speedup the performance of your neural networks by performing computations at fixed point integers that are lower than floating points. It can be a complex process, but the Intel Neural Compressor greatly simplifies the process for you.

Link to the tool below:

https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html

Posted in ai

Leave a Reply

Your email address will not be published. Required fields are marked *