ONNX Runtime over SSH
Compile, profile, and invoke ONNX Runtime models on your own hardware.
This guide walks you through compiling, profiling, and invoking an ONNX Runtime model on your own hardware over SSH using the embedl-onnxruntime backend.
You will learn how to:
- Install and configure
embedl-onnxruntimeon the target device - Compile an ONNX model with quantization on the target device
- Profile the compiled model
- Invoke the model with real input data
Prerequisites
Make sure you have completed the setup guide and the your hardware prerequisites, including passwordless SSH access to the target device.
Installing embedl-onnxruntime on the target device
The embedl-onnxruntime provider requires the embedl-onnxruntime package
to be installed on the target device. We recommend installing it in a
virtual environment:
# On the target device:$ python3 -m venv ~/embedl-ort-env$ source ~/embedl-ort-env/bin/activate$ pip install embedl-onnxruntimeIf you installed into a virtual environment, note the full path to the embedl-onnxruntime binary — you will need it when compiling later:
realpath ~/embedl-ort-env/bin/embedl-onnxruntime/home/pi/embedl-ort-env/bin/embedl-onnxruntimeIf the binary is already on the device’s $PATH, you can skip this step.
Creating a project
from embedl_hub.core import HubContextfrom embedl_hub.core import LocalPathctx = HubContext( project_name="ONNX Runtime SSH", artifact_base_dir=LocalPath("my-artifacts"),)The HubContext is your entry point. It manages the project, artifact
directory, devices, and tracking. We’ll register a device in the next
section.
The artifact_base_dir is where compiled models, profiling results, and
other outputs are stored on disk. If omitted, HubContext creates a
temporary directory when used as a context manager (with ctx:), and
cleans it up automatically when the context exits. This is convenient
for scripts where you only need the in-memory results and don’t need to
persist artifacts to disk.
For alternative ways to configure project context, see the configuration guide.
Connecting to your device
Next, configure a connection to your target device over SSH.
from embedl_hub.core import HubContextfrom embedl_hub.core.device import DeviceManagerfrom embedl_hub.core.device import SSHConfigfrom embedl_hub.core.device import EmbedlONNXRuntimeConfigfrom embedl_hub.core import LocalPathdevice = DeviceManager.get_embedl_onnxruntime_device( SSHConfig(host="192.168.1.42", username="pi"), name="rpi", provider_config=EmbedlONNXRuntimeConfig( embedl_onnxruntime_path="/home/pi/embedl-ort-env/bin/embedl-onnxruntime", ),)ctx = HubContext( project_name="ONNX Runtime SSH", artifact_base_dir=LocalPath("my-artifacts"), devices=[device],)If embedl-onnxruntime is on the device’s $PATH, you can omit embedl_onnxruntime_path.
The name parameter is a label you choose for this device; you reference
it by that label when creating components later (e.g. device="rpi").
Preparing a model
The compile step expects an ONNX file. You can save
your existing PyTorch model in ONNX format using torch.onnx.export:
import torchfrom torchvision.models import mobilenet_v2model = mobilenet_v2(weights="IMAGENET1K_V2")example_input = torch.rand(1, 3, 224, 224)torch.onnx.export( model, example_input, "mobilenet_v2.onnx", input_names=["input"], output_names=["output"], opset_version=18, external_data=False, dynamo=False,)Compiling a model
Compile the ONNX model with quantization on the target device. The model is transferred to the device over SSH, compiled there, and the result is fetched back.
from embedl_hub.core import HubContextfrom embedl_hub.core.device import DeviceManagerfrom embedl_hub.core.device import SSHConfigfrom embedl_hub.core.device import EmbedlONNXRuntimeConfigfrom embedl_hub.core import LocalPathfrom embedl_hub.core.compile import ONNXRuntimeCompilerdevice = DeviceManager.get_embedl_onnxruntime_device( SSHConfig(host="192.168.1.42", username="pi"), name="rpi", provider_config=EmbedlONNXRuntimeConfig( embedl_onnxruntime_path="/home/pi/embedl-ort-env/bin/embedl-onnxruntime", ),)ctx = HubContext( project_name="ONNX Runtime SSH", artifact_base_dir=LocalPath("my-artifacts"), devices=[device],)compiler = ONNXRuntimeCompiler(device="rpi")with ctx: compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx")) print(compiled.path.file_path)The embedl-onnxruntime provider quantizes the model as part of
compilation, applying INT8 post-training quantization to lower the
precision of weights and activations. This reduces memory usage and
inference latency on the target device.
Providing calibration data
Although quantization reduces the model’s precision, you can mitigate the accuracy loss by providing calibration data — a small set of representative input samples. You don’t need a large dataset; usually, a few hundred samples are more than enough. If no calibration data is provided, random data is used.
compiler = ONNXRuntimeCompiler( device="rpi", calibration_data=LocalPath("path/to/dataset"),)The calibration_data parameter accepts a path to a directory of .npy files, or a dictionary of NumPy arrays where keys are the model input
names and values have shape (num_samples, *input_shape).
For file-based calibration, place one .npy file per sample directly
in the directory for single-input models. For multi-input models, create
one subdirectory per input tensor (named after the input), each with the
same number of files.
Note: Some models have operations that are notoriously difficult to quantize, which can lead to a large drop in accuracy. One example is the softmax function in the attention layers of large language models (LLMs).
Profiling a model
Profile the compiled model on the target device:
from embedl_hub.core import HubContextfrom embedl_hub.core.device import DeviceManagerfrom embedl_hub.core.device import SSHConfigfrom embedl_hub.core.device import EmbedlONNXRuntimeConfigfrom embedl_hub.core import LocalPathfrom embedl_hub.core.compile import ONNXRuntimeCompilerfrom embedl_hub.core.profile import ONNXRuntimeProfilerdevice = DeviceManager.get_embedl_onnxruntime_device( SSHConfig(host="192.168.1.42", username="pi"), name="rpi", provider_config=EmbedlONNXRuntimeConfig( embedl_onnxruntime_path="/home/pi/embedl-ort-env/bin/embedl-onnxruntime", ),)ctx = HubContext( project_name="ONNX Runtime SSH", artifact_base_dir=LocalPath("my-artifacts"), devices=[device],)compiler = ONNXRuntimeCompiler(device="rpi")profiler = ONNXRuntimeProfiler(device="rpi")with ctx: compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx")) result = profiler.run(ctx, compiled) print("Latency:", result.latency.value) print("FPS:", result.fps.value)Your runs are automatically synced to your project on hub.embedl.com.
Profiling gives you the model’s latency on the target hardware, which layers are slowest, the number of layers executed on each compute unit type, and more. You can use this information to iterate on the model’s design and answer questions like:
- Can we optimize the slowest layer?
- Why aren’t certain layers running on the expected compute unit?
Invoking a model
Invoke the compiled model with real input data to get inference outputs:
import numpy as npfrom embedl_hub.core import HubContextfrom embedl_hub.core.device import DeviceManagerfrom embedl_hub.core.device import SSHConfigfrom embedl_hub.core.device import EmbedlONNXRuntimeConfigfrom embedl_hub.core import LocalPathfrom embedl_hub.core.compile import ONNXRuntimeCompilerfrom embedl_hub.core.invoke import ONNXRuntimeInvokerdevice = DeviceManager.get_embedl_onnxruntime_device( SSHConfig(host="192.168.1.42", username="pi"), name="rpi", provider_config=EmbedlONNXRuntimeConfig( embedl_onnxruntime_path="/home/pi/embedl-ort-env/bin/embedl-onnxruntime", ),)ctx = HubContext( project_name="ONNX Runtime SSH", artifact_base_dir=LocalPath("my-artifacts"), devices=[device],)compiler = ONNXRuntimeCompiler(device="rpi")invoker = ONNXRuntimeInvoker(device="rpi")input_data = dict(input=np.random.rand(1, 3, 224, 224).astype(np.float32))with ctx: compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx")) invocation = invoker.run(ctx, compiled, input_data) print(invocation.output)The input_data dictionary maps input tensor names to NumPy arrays.
Next steps
- Learn how to view, name, and tag your runs, and how to interpret profiling results in the exploring results guide.
- See the providers guide for the full reference of supported provider and toolchain combinations.