January 20, 20269 min read

Hardware-software co-design for inference at the edge

Running inference close to the data — on embedded hardware, in a factory, on a vehicle — forces every assumption about cloud-native AI architecture to be re-examined. The constraints are real and the engineering is interesting.

Most of the current discussion about deploying AI in production assumes a cloud inference endpoint, a reliable network connection, and an essentially unlimited compute budget constrained only by cost. These assumptions are false for a substantial and growing category of applications: quality inspection on a manufacturing line, anomaly detection in industrial equipment, document processing in a facility without reliable connectivity, real-time decision support in a vehicle or a drone.

Edge inference forces every default assumption about AI deployment to be re-examined. The hardware is constrained and fixed for a product lifecycle measured in years. The software stack needs to match that hardware precisely. The power envelope is real. The network may not exist. The failure modes are physical and sometimes irreversible. Engineering for this environment is a different discipline than deploying a containerized inference service on a cloud provider, and the gap between the two is worth examining carefully.

The hardware envelope shapes everything

Edge inference hardware exists on a spectrum from microcontrollers running quantized models under one watt of power to GPU-equipped embedded boxes running full-precision models at fifty watts. The right point on that spectrum for a given application is determined before a line of model code is written, and getting it wrong is expensive: hardware procurement cycles are long, and a model that requires more memory bandwidth than the target device can provide doesn't run there regardless of how well it's quantized.

The model architecture, the quantization scheme, and the inference runtime all need to be chosen in conjunction with the target hardware, not sequentially. A transformer with large embedding dimensions that fits comfortably on an Nvidia H100 may require careful INT4 quantization and careful operator fusion to run acceptably on an embedded ARM SoC with no dedicated matrix multiply hardware. The performance characterization of the target hardware — memory bandwidth, TOPS, cache size, operator support in the target runtime — should be understood before the model is selected, not after it's trained.

This is what hardware-software co-design means in practice: neither the hardware nor the software is fixed while the other is optimized. Both are chosen jointly, with explicit performance models for how the software will run on the hardware before any physical prototyping happens.

Quantization: correctness before compression

Quantization — representing model weights and activations at lower numerical precision than the training precision — is the primary tool for fitting inference into constrained compute envelopes. INT8 is now well-understood and widely supported. INT4 is increasingly viable for weights with careful calibration. Techniques like GPTQ, AWQ, and SmoothQuant have made post-training quantization to INT4 practical for large language models without catastrophic quality loss on most tasks.

The critical discipline is to evaluate quantization effects on your specific task distribution, not on generic benchmarks. A model that scores well on standard language modeling benchmarks after INT4 quantization may degrade significantly on the structured extraction task you actually need it to perform, particularly if that task involves numerical reasoning or precise pattern matching. Run your eval set — the same one you run in CI for the software changes — against each quantization target before committing to a hardware platform or a model architecture.

Quantization-aware training (QAT), where the model is fine-tuned with simulated quantization noise during training, produces models that are more robust to quantization at inference time than post-training quantization applied to a model trained at full precision. If the target hardware is known at training time, QAT is worth the additional training cost, particularly for tasks where INT4 post-training quantization shows noticeable quality degradation.

Latency, throughput, and the real-time constraint

Cloud inference latency is dominated by network round-trip time and queue depth at the serving infrastructure. Edge inference latency is dominated by the model's arithmetic intensity, the hardware's memory bandwidth, and the efficiency of the inference runtime's kernel implementations for the target device. These are different problems with different solutions.

Real-time constraints on edge applications are often hard constraints, not soft targets. A quality inspection system that must classify an image in under 20 milliseconds because the production line moves at a fixed speed cannot be optimized post-hoc with caching or request batching. The entire inference pipeline — from sensor to decision — must be designed to fit within that budget, with margin for the variance the real world introduces.

Profiling at the operator level — identifying which layers and which operations account for the bulk of the latency on the target hardware — is essential before any optimization work. The distribution of latency across operators on embedded hardware is often surprising. Attention mechanisms that are efficient on GPU may be bottlenecks on hardware without efficient attention operator support. Identifying the true bottleneck before optimizing avoids a common failure mode: carefully optimizing a part of the pipeline that accounts for fifteen percent of the runtime while the true bottleneck is elsewhere.

Reliability in uncontrolled environments

Edge hardware operates in conditions that cloud infrastructure never faces: temperature extremes, vibration, power instability, and physical access by people who are not software engineers. Inference reliability in this environment requires defensive engineering at every layer.

Model outputs need to be validated before they drive any action, regardless of model confidence. A classifier that returns 'anomaly detected' with 0.97 probability should not directly trigger a machine shutdown without a deterministic check that the input was valid (not corrupted by a sensor fault), the output was within the expected range (not a numerical artifact of a quantized model on an unexpected input), and the historical context warrants the action (not a false positive in a burst of sensor noise). Layering deterministic validation around probabilistic model outputs is not a workaround for model unreliability — it is sound engineering practice for any system where actions have real-world consequences.

Update management at the edge is a distinct operational problem. Cloud services update continuously with no coordination required from the field. Edge devices need deliberate update mechanisms: atomic rollout strategies that can be reversed, version pinning when a hardware-model combination has been validated and should not change, and testing pipelines that qualify a model update against the specific hardware targets it will run on before rollout. The operational discipline for edge AI updates has more in common with embedded firmware management than with continuous deployment of cloud services.

HardwareInferenceEngineering

The hardware envelope shapes everything

Quantization: correctness before compression

Latency, throughput, and the real-time constraint

Reliability in uncontrolled environments

Hardware-software co-design for inference at the edge

The hardware envelope shapes everything

Quantization: correctness before compression

Latency, throughput, and the real-time constraint

Reliability in uncontrolled environments

More insights

Production over demos: shipping LLM features that survive real users

Evals as a first-class artifact

Hardware-software co-design for inference at the edge

The hardware envelope shapes everything

Quantization: correctness before compression

Latency, throughput, and the real-time constraint

Reliability in uncontrolled environments

More insights

Production over demos: shipping LLM features that survive real users

Evals as a first-class artifact