Skip to main content
Documentation

ONNX Runtime over SSH

Compile, profile, and invoke ONNX Runtime models on your own hardware.

This guide walks you through compiling, profiling, and invoking an ONNX Runtime model on your own hardware over SSH using the embedl-onnxruntime backend.

You will learn how to:

  • Install and configure embedl-onnxruntime on the target device
  • Compile an ONNX model with quantization on the target device
  • Profile the compiled model
  • Invoke the model with real input data

Prerequisites

Make sure you have completed the setup guide and the your hardware prerequisites, including passwordless SSH access to the target device.

Installing embedl-onnxruntime on the target device

The embedl-onnxruntime provider requires the embedl-onnxruntime package to be installed on the target device. We recommend installing it in a virtual environment:

# On the target device:
$ python3 -m venv ~/embedl-ort-env
$ source ~/embedl-ort-env/bin/activate
$ pip install embedl-onnxruntime

If you installed into a virtual environment, note the full path to the embedl-onnxruntime binary — you will need it when compiling later:

realpath ~/embedl-ort-env/bin/embedl-onnxruntime
/home/pi/embedl-ort-env/bin/embedl-onnxruntime

If the binary is already on the device’s $PATH, you can skip this step.

Creating a project

from embedl_hub.core import HubContext
from embedl_hub.core import LocalPath
ctx = HubContext(
    project_name="ONNX Runtime SSH",
    artifact_base_dir=LocalPath("my-artifacts"),
)

The HubContext is your entry point. It manages the project, artifact directory, devices, and tracking. We’ll register a device in the next section.

The artifact_base_dir is where compiled models, profiling results, and other outputs are stored on disk. If omitted, HubContext creates a temporary directory when used as a context manager (with ctx:), and cleans it up automatically when the context exits. This is convenient for scripts where you only need the in-memory results and don’t need to persist artifacts to disk.

For alternative ways to configure project context, see the configuration guide.

Connecting to your device

Next, configure a connection to your target device over SSH.

from embedl_hub.core import HubContext
from embedl_hub.core.device import DeviceManager
from embedl_hub.core.device import SSHConfig
from embedl_hub.core.device import EmbedlONNXRuntimeConfig
from embedl_hub.core import LocalPath
device = DeviceManager.get_embedl_onnxruntime_device(
    SSHConfig(host="192.168.1.42", username="pi"),
    name="rpi",
    provider_config=EmbedlONNXRuntimeConfig(
        embedl_onnxruntime_path="/home/pi/embedl-ort-env/bin/embedl-onnxruntime",
    ),
)
ctx = HubContext(
    project_name="ONNX Runtime SSH",
    artifact_base_dir=LocalPath("my-artifacts"),
    devices=[device],
)

If embedl-onnxruntime is on the device’s $PATH, you can omit embedl_onnxruntime_path.

The name parameter is a label you choose for this device; you reference it by that label when creating components later (e.g. device="rpi").

Preparing a model

The compile step expects an ONNX file. You can save your existing PyTorch model in ONNX format using torch.onnx.export:

import torch
from torchvision.models import mobilenet_v2
model = mobilenet_v2(weights="IMAGENET1K_V2")
example_input = torch.rand(1, 3, 224, 224)
torch.onnx.export(
    model,
    example_input,
    "mobilenet_v2.onnx",
    input_names=["input"],
    output_names=["output"],
    opset_version=18,
    external_data=False,
    dynamo=False,
)

Compiling a model

Compile the ONNX model with quantization on the target device. The model is transferred to the device over SSH, compiled there, and the result is fetched back.

from embedl_hub.core import HubContext
from embedl_hub.core.device import DeviceManager
from embedl_hub.core.device import SSHConfig
from embedl_hub.core.device import EmbedlONNXRuntimeConfig
from embedl_hub.core import LocalPath
from embedl_hub.core.compile import ONNXRuntimeCompiler
device = DeviceManager.get_embedl_onnxruntime_device(
    SSHConfig(host="192.168.1.42", username="pi"),
    name="rpi",
    provider_config=EmbedlONNXRuntimeConfig(
        embedl_onnxruntime_path="/home/pi/embedl-ort-env/bin/embedl-onnxruntime",
    ),
)
ctx = HubContext(
    project_name="ONNX Runtime SSH",
    artifact_base_dir=LocalPath("my-artifacts"),
    devices=[device],
)
compiler = ONNXRuntimeCompiler(device="rpi")
with ctx:
    compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx"))
    print(compiled.path.file_path)

The embedl-onnxruntime provider quantizes the model as part of compilation, applying INT8 post-training quantization to lower the precision of weights and activations. This reduces memory usage and inference latency on the target device.

Providing calibration data

Although quantization reduces the model’s precision, you can mitigate the accuracy loss by providing calibration data — a small set of representative input samples. You don’t need a large dataset; usually, a few hundred samples are more than enough. If no calibration data is provided, random data is used.

compiler = ONNXRuntimeCompiler(
    device="rpi",
    calibration_data=LocalPath("path/to/dataset"),
)

The calibration_data parameter accepts a path to a directory of .npy files, or a dictionary of NumPy arrays where keys are the model input names and values have shape (num_samples, *input_shape).

For file-based calibration, place one .npy file per sample directly in the directory for single-input models. For multi-input models, create one subdirectory per input tensor (named after the input), each with the same number of files.

Note: Some models have operations that are notoriously difficult to quantize, which can lead to a large drop in accuracy. One example is the softmax function in the attention layers of large language models (LLMs).

Profiling a model

Profile the compiled model on the target device:

from embedl_hub.core import HubContext
from embedl_hub.core.device import DeviceManager
from embedl_hub.core.device import SSHConfig
from embedl_hub.core.device import EmbedlONNXRuntimeConfig
from embedl_hub.core import LocalPath
from embedl_hub.core.compile import ONNXRuntimeCompiler
from embedl_hub.core.profile import ONNXRuntimeProfiler
device = DeviceManager.get_embedl_onnxruntime_device(
    SSHConfig(host="192.168.1.42", username="pi"),
    name="rpi",
    provider_config=EmbedlONNXRuntimeConfig(
        embedl_onnxruntime_path="/home/pi/embedl-ort-env/bin/embedl-onnxruntime",
    ),
)
ctx = HubContext(
    project_name="ONNX Runtime SSH",
    artifact_base_dir=LocalPath("my-artifacts"),
    devices=[device],
)
compiler = ONNXRuntimeCompiler(device="rpi")
profiler = ONNXRuntimeProfiler(device="rpi")
with ctx:
    compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx"))
    result = profiler.run(ctx, compiled)
    print("Latency:", result.latency.value)
    print("FPS:", result.fps.value)

Your runs are automatically synced to your project on hub.embedl.com.

Profiling gives you the model’s latency on the target hardware, which layers are slowest, the number of layers executed on each compute unit type, and more. You can use this information to iterate on the model’s design and answer questions like:

  • Can we optimize the slowest layer?
  • Why aren’t certain layers running on the expected compute unit?

Invoking a model

Invoke the compiled model with real input data to get inference outputs:

import numpy as np
from embedl_hub.core import HubContext
from embedl_hub.core.device import DeviceManager
from embedl_hub.core.device import SSHConfig
from embedl_hub.core.device import EmbedlONNXRuntimeConfig
from embedl_hub.core import LocalPath
from embedl_hub.core.compile import ONNXRuntimeCompiler
from embedl_hub.core.invoke import ONNXRuntimeInvoker
device = DeviceManager.get_embedl_onnxruntime_device(
    SSHConfig(host="192.168.1.42", username="pi"),
    name="rpi",
    provider_config=EmbedlONNXRuntimeConfig(
        embedl_onnxruntime_path="/home/pi/embedl-ort-env/bin/embedl-onnxruntime",
    ),
)
ctx = HubContext(
    project_name="ONNX Runtime SSH",
    artifact_base_dir=LocalPath("my-artifacts"),
    devices=[device],
)
compiler = ONNXRuntimeCompiler(device="rpi")
invoker = ONNXRuntimeInvoker(device="rpi")
input_data = dict(input=np.random.rand(1, 3, 224, 224).astype(np.float32))
with ctx:
    compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx"))
    invocation = invoker.run(ctx, compiled, input_data)
    print(invocation.output)

The input_data dictionary maps input tensor names to NumPy arrays.

Next steps