Quickstart - Embedl Hub

Read this guide if you'd like a comprehensive overview of how to benchmark models with Embedl Hub. If you're eager to start benchmarking your own models or if you prefer to learn by doing, skip ahead to Your first benchmark.

This guide shows you how to go from having an idea for an application to benchmarking a model on remote hardware. To showcase this, we will optimize and profile a model that will run on a Samsung Galaxy S24 mobile phone.

You will learn how to quantize, compile, and benchmark a model using the Embedl Hub Python library.

Prerequisites

If you haven’t already done so, follow the instructions in the setup guide to:

Create an Embedl Hub account
Install the Embedl Hub Python library
Configure an API Key

Create a project and experiment

Create a project and experiment for the application:

embedl-hub init \

    --project "Quickstart" \

    --experiment "Samsung Galaxy S24 image classifier"

This also sets the project and experiment as defaults for future commands. You can view your current settings at any time:

embedl-hub show

For alternative ways to configure projects and experiments, see the configuration guide.

Compile the model from ONNX to TFLite

Now that we’ve created a project and experiment, let’s verify that the model runs as expected on the target hardware. This process requires a series of steps:

Compile: ONNX -> TFLite

(Quantize: TFLite -> TFLite)

Benchmark: TFLite

The compile step expects an ONNX file. You can save your existing PyTorch model in ONNX format using torch.onnx.export.

For this guide, we will convert the Torchvision MobileNet V2 model to ONNX using scripting:

import torch

from torchvision.models import mobilenet_v2

# Define the model and example input

model = mobilenet_v2(weights="IMAGENET1K_V2")

example_input = torch.rand(1, 3, 224, 224)

# Save model in ONNX format

torch.onnx.export(

    model,

    example_input,

    "path/to/mobilenet_v2.onnx",

    input_names=["input"],

    output_names=["output"],

    opset_version=18,

    external_data=False,

Compile the saved model to TFLite format for use in later steps.

embedl-hub compile \

    --model /path/to/mobilenet_v2.onnx

Since we haven’t set an output name, embedl-hub compile will save the model as mobilenet_v2.tflite.

(Optional) Quantize the model

Quantizing a model can drastically reduce its inference latency on hardware, so we recommend completing this step.

Quantization lowers the number of bits used to represent the weights and activations in a neural network, which reduces both the memory and compute needed to run the model.

Although lowering the model’s precision also decreases its ability to accurately “think”, you can mitigate this by calibrating the model on example data. You don’t need a large dataset to achieve a good quantization accuracy; usually, a few hundred samples are more than enough.

embedl-hub quantize \

    --model /path/to/mobilenet_v2.tflite \

    --data /path/to/dataset \

    --num-samples 100

Note: Some models have operations that are notoriously difficult to quantize, which can lead to a huge drop in accuracy. One example is the softmax function in the attention layers of large language models (LLMs).

Benchmark the model on remote hardware

Let’s evaluate how well the model performs using remote hardware:

embedl-hub benchmark \

    --model /path/to/mobilenet_v2.quantized.tflite \

    --device "Samsung Galaxy S24"

Benchmarking the model gives useful information such as the model’s latency on the hardware platform, which layers are slowest, the number of layers executed on each compute unit type, and more! We can use this information for advanced debugging and for iterating on the model’s design. We can answer questions like:

Can we optimize the slowest layer?
Why aren’t certain layers executed on the correct compute unit?

Next steps

Check out the benchmarking models guide for a more detailed walkthrough of benchmarking models and to learn how to interpret benchmark results.