Keras Gradient Based Post Training Quantization

model_compression_toolkit.gptq.keras_gradient_post_training_quantization(in_model, representative_data_gen, gptq_config, gptq_representative_data_gen=None, target_resource_utilization=None, core_config=CoreConfig(), target_platform_capabilities=DEFAULT_KERAS_TPC)

Quantize a trained Keras model using post-training quantization. The model is quantized using a symmetric constraint quantization thresholds (power of two). The model is first optimized using several transformations (e.g. BatchNormalization folding to preceding layers). Then, using a given dataset, statistics (e.g. min/max, histogram, etc.) are being collected for each layer’s output (and input, depends on the quantization configuration). For each possible bit width (per layer) a threshold is then being calculated using the collected statistics. Then, if given a mixed precision config in the core_config, using an ILP solver we find a mixed-precision configuration, and set a bit-width for each layer. The model is then quantized (both coefficients and activations by default). In order to limit the maximal model’s size, a target resource utilization need to be passed after weights_memory is set (in bytes). Then, the quantized weights are optimized using gradient based post training quantization by comparing points between the float and quantized models, and minimizing the observed loss.

Parameters:
  • in_model (Model) – Keras model to quantize.

  • representative_data_gen (Callable) – Dataset used for calibration.

  • gptq_config (GradientPTQConfig) – Configuration for using gptq (e.g. optimizer).

  • gptq_representative_data_gen (Callable) – Dataset used for GPTQ training. If None defaults to representative_data_gen

  • target_resource_utilization (ResourceUtilization) – ResourceUtilization object to limit the search of the mixed-precision configuration as desired.

  • core_config (CoreConfig) – Configuration object containing parameters of how the model should be quantized, including mixed precision parameters.

  • target_platform_capabilities (TargetPlatformCapabilities) – TargetPlatformCapabilities to optimize the Keras model according to.

Returns:

A quantized model and information the user may need to handle the quantized model.

Examples

Import a Keras model:

>>> from tensorflow.keras.applications.mobilenet import MobileNet
>>> model = MobileNet()

Create a random dataset generator, for required number of calibration iterations (num_calibration_batches): In this example a random dataset of 10 batches each containing 4 images is used.

>>> import numpy as np
>>> num_calibration_batches = 10
>>> def repr_datagen():
>>>     for _ in range(num_calibration_batches):
>>>         yield [np.random.random((4, 224, 224, 3))]

Create an MCT core config, containing the quantization configuration:

>>> config = mct.core.CoreConfig()

If mixed precision is desired, create an MCT core config with a mixed-precision configuration, to quantize a model with different bitwidths for different layers. The candidates bitwidth for quantization should be defined in the target platform model:

>>> config = mct.core.CoreConfig(mixed_precision_config=mct.core.MixedPrecisionQuantizationConfig(num_of_images=1))

For mixed-precision set a target resource utilization object: Create a resource utilization object to limit our returned model’s size. Note that this value affects only coefficients that should be quantized (for example, the kernel of Conv2D in Keras will be affected by this value, while the bias will not):

>>> ru = mct.core.ResourceUtilization(model.count_params() * 0.75)  # About 0.75 of the model size when quantized with 8 bits.

Create GPTQ config:

>>> gptq_config = mct.gptq.get_keras_gptq_config(n_epochs=1)

Pass the model with the representative dataset generator to get a quantized model:

>>> quantized_model, quantization_info = mct.gptq.keras_gradient_post_training_quantization(model, repr_datagen, gptq_config, target_resource_utilization=ru, core_config=config)
Return type:

Tuple[Model, UserInformation]