Show HN: Dual YOLOv8n UAV Detection on RK3588S at 42 FPS Using NPU
Posted by alebal123bal 3 days ago
Comments
Comment by robinduckett 3 days ago
Comment by snovv_crash 3 days ago
Comment by robinduckett 3 days ago
Comment by steinvakt2 2 days ago
Comment by alebal123bal 3 days ago
The goal here was an end-to-end RK3588S pipeline rather than comparing detector families: training/export, ONNX graph fixing, INT8 RKNN conversion, C++ postprocessing, and runtime inference across the 3 NPU cores. YOLOv8 has known-good export paths and Rockchip examples, so it was the most practical baseline.
Newer YOLO versions may be possible, but usually require more work around RKNN export compatibility.
Comment by alebal123bal 3 days ago
The main trick is not the YOLO model itself, but the pipeline structure: MIPI capture through the ISP, resize/color conversion through RGA, and YOLOv8n inference through all 3 NPU cores with one RKNN context per core. With a 3-thread inference pool the pipeline goes from ~31 FPS to the OS08A10 camera’s 46 FPS ceiling.
The memory footprint is also small: roughly 137–152 MB RSS for one 1080p stream, using a fixed preallocated buffer pool rather than per-frame allocations. Two streams are roughly 276–304 MB RSS.
The repo also has a multi-process side of the pipeline: detections are published over Unix-domain sockets to tracking, temporal features, a presence FSM, and an optional Qwen2.5-0.5B summary step. For the LLM step, the camera pipeline can temporarily blackout/resume so RKLLM gets the whole NPU.
I split the work into three repos:
- runtime dual-stream YOLOv8n RK3588S pipeline: https://github.com/alebal123bal/khadas_yolov8n_multithread
- train/export/INT8 RKNN conversion for YOLOv8/YOLOv5: https://github.com/alebal123bal/RKNN_TRAIN_YOLO
- Qwen on RK3588S, via RKLLM/NPU or llama.cpp/CPU: https://github.com/alebal123bal/RKLLM_LLAMA_QWEN
The demo class is UAV/drone, but this is meant as a general edge-inference pipeline example, not an operational/surveillance/defense system.
Comment by throwa356262 2 days ago
Sad they are mostly sitting there unused because very few people know how to program them.
Comment by ancientmoth 3 days ago
Comment by stefan_ 3 days ago
Comment by alebal123bal 3 days ago
This pipeline is processing live camera frames and displaying/streaming annotated output, so latency and frame freshness matter. Increasing batch size would add queueing latency and tends to make the output older, especially when the sensor is producing frames continuously.
The “multithreading” here is not treating the NPU like a CPU in the usual sense. The RK3588S NPU is exposed as 3 cores, and RKNN supports using separate contexts with `rknn_dup_context` and assigning them with `rknn_set_core_mask`. The point was to keep the 3 NPU cores fed while capture, RGA preprocessing, inference, and display are pipelined.
In the single-context loop I was seeing ~31 FPS. With one context per NPU core and pipelined frame handling, it reaches the camera ceiling, around 42–46 FPS depending on the mode. So in this particular real-time streaming setup, parallel contexts/core masks were the practical way to saturate the hardware without adding batch latency.
Comment by stefan_ 2 days ago
(You are not even measuring latency correctly)
Comment by alebal123bal 2 days ago