PHBD.xyz

Experiments at the Network Edge:

Learnings from Trial-and-Error


Early Draft

TL;DR

  1. Edge inference brings server load to the client-side and can alleviate scaling headaches.
  2. Creative edge ML deployments can create mostly serverless, self-learning systems.
  3. Surprisingly many ML models can be brought closer to the user, if not to the edge.


The following is a broad survey of what I've learned building numeral recognition in browser, first by trying to rewrite Micrograd in Javascript, having aspirations to both train and run MNIST inference on the browser, failing, and discovering ONNX.

I began this exploration into edge ML with the pedagogical goal of better understanding JS through a project familiar to me: writing a simple neural network library from scratch. At the time, I was merely trying to brush up on my JS skills, but I have come to find running ML on the various form factors to be a fascinating challenge. Furthermore, I have been on a recent cultural crusade against JS dependency hell, and wanted to see how far I could push my abilities with no external dependencies.

Beyond the surprising lack of operator overloading and concise listcomps, I found the rewrite of Micrograd to be very straightforward. After understanding some quirks of the language (and how to deal with the lack of numpy), the JS was nearly idiomatic. Once I had rewritten Micrograd, I ran a few training loops and successfully trained a simple neural network on toy regression and classification data, running it in the browser.

Success! If I could do this, surely training more complex networks (e.g. numeral recognition with MNIST) would be an easy next step, right? Unfortunately, not quite: I quickly ran into performance issues arising from suboptimal backpropogation techniques and linalg optimization. Beyond this, I also started to feel the pain of constructing everything from a low-level library and moving data from point A to B without concise numpy transformations. Dependencies had won this battle, and I was left with a library that was too slow and unwieldy to train anything interesting. One day I hope to return with a better grasp of Wasm or WebGL to accelerate calculations, but for this time around I found a great middle ground.

ONNX to the Rescue


After a few hours of trying things and failing to find alternative approaches to model training (scouring for other Micrograd implementations that trained MNIST, exploring the idea of pretraining a network then importing model weights, transferring model weights from a higher-level library, etc.) I stumbled upon ONNX. ONNX is a universal model format that allows for the transfer of models between frameworks. Popular frameworks like PyTorch, Tensorflow, and Caffe2 all support ONNX exports and imports, and there is a variety of optimized runtimes for running pretrained models in a variety of languages. Particularly interesting to me was its extensive use of Wasm + WebGL to accelerate inference in the browser. The ONNX runtime essential enables you to write complex networks in high level frameworks like Pytorch while retaining the flexibility of running them in diverse runtime environments, from the browser to embedded systems.

On the pure performance side, I was able to easily achieve under 20ms inference time on my M1 MBA for a ~100k param CNN model, and about 100ms for my iPhone. Furthermore, this is likely on a suboptimal implementation of both the neural network and the ONNX runtime implementation in my client-side code. Some easily achievable latency wins could derive from: model quantization, smaller model sizes, and more parameter-efficient model architectures. I am currently loading ONNX runtime from script-tag CDN, while the current model comes out to 26KB, both of which cause negligable addition to load time.

The ease of small model deployment had me thinking about what the limits of the network periphery were: How large can you get models living on the edge without killing user experience? What practical applications can be built within these constraints? What architectures get you out of these constraints?