Release Notes - 0.10.0
Furiosa SDK 0.10.0 is a major release which includes the followings:
Adds the next generation runtime engine (FuriosaRT) with higher performance and multi-device features
Improves usability of optimization for vision models by removing quantization operators from models
Supports OpenMetrics format in Metrics Exporter and provide more metrics such as NPU utilization
Improves furiosa-litmus to collect and dump from the diagnosis steps for reporting
Removes Python dependencies from
furiosa-compiler
commandAdds the new benchmark tool
furiosa-bench
This release also includes a number of other feature additions, bug fixes, and performance improvements.
Package Name |
Version |
---|---|
NPU Driver |
1.9.2 |
NPU Firmware Tools |
1.5.1 |
NPU Firmware Image |
1.7.3 |
HAL (Hardware Abstraction Layer) |
0.11.0 |
Furiosa Compiler |
0.10.0 |
Furiosa Quantizer |
0.10.0 |
Furiosa Runtime |
0.10.0 |
Python SDK (furiosa-server, furiosa-serving, ..) |
0.10.0 |
NPU Toolkit (furiosactl) |
0.11.0 |
NPU Device Plugin |
0.10.1 |
NPU Feature Discovery |
0.2.0 |
Installing the latest SDK or Upgrading
If you are using APT repository, the upgrade process is simple. Please run as follows. If you are not familiar with how to use FurioaAI’s APT repository, please find more detais from Driver, Firmware, and Runtime Installation.
apt-get update && apt-get upgrade
You can also upgrade specific packages as follows:
apt-get update && \
apt-get install -y furiosa-driver-warboy furiosa-libnux
You can upgrade firmware as follows:
apt-get update && \
apt-get install -y furiosa-firmware-tools furiosa-firmware-image
You can upgrade Python package as follows:
pip install --upgrade pip setuptools wheel
pip install --upgrade furiosa-sdk
Warning
When installing or upgrading the furiosa-sdk without updating pip to the latest version, you may encounter the following errors.
ERROR: Could not find a version that satisfies the requirement furiosa-quantizer-impl==0.9.* (from furiosa-quantizer==0.9.*->furiosa-sdk) (from versions: none)
ERROR: No matching distribution found for furiosa-quantizer-impl==0.9.* (from furiosa-quantizer==0.9.*->furiosa-sdk)
Major changes
Next Generation Runtime Engine, FuriosaRT
SDK 0.10.0 includes the next-generation runtime engine called FuriosaRT. FuriosaRT is a newly designed runtime library that offers more advanced features and high performance in various workloads. Many components, such as furiosa-litmus, furiosa-bench, and furiosa-seving, are based on FuriosaRT, and the benefits of the new runtime engine are reflected in these components. FuriosaRT provides the backward compatibility with the previous runtime and includes the following new features:
New Runtime API
FuriosaRT introduces a native asynchronous API based on Python’s asyncio <https://docs.python.org/3/library/asyncio.html>. The existing APIs were sufficient for batch applications, but it requires extra code to implement high-performance serving applications, handling many concurrent individual requests. The new API natively supports asynchronous execution. With the new API, users can easily write their applications running on existing web frameworks such as FastAPI
The new API introduced many advanced features, and you can learn more about the details at Furiosa SDK API Reference - furiosa.runtime
Multi-device Support and Improvement on Device Configuration
FuriosaRT natively supports multiple devices with a single session. This feature leads
to high-performance inference using multiple devices without extra implementations.
Furthermore, FuriosaRT adopts more abstracted way to specify NPU devices.
Before 0.9.0 release, users used to set device file names (e.g., npu0pe0-1
) explicitly
in the environment variable NPU_DEVNAME
or session.create(.., device=”..”)
.
This way was inconvinient in many cases because users need
to find all available device files and specify them manually.
FuriosaRT allows users to specify NPU arch, count of NPUs in a textutal representation.
This representation is allowed in the new environment variable FURIOSA_DEVICES
as follows:
export FURIOSA_DEVICES="warboy(2)*8"
The above example lets FuriosaRT to find 8 Warboys, each of which is configured as two PEs fusion in the system.
export FURIOSA_DEVICES="npu:0:0-1,npu:1:0-1"
For backward compatibility, FuriosaRT still supports NPU_DEVNAME
environment variable.
However, NPU_DEVNAME
will be deprecated in a future release.
You can find more details about the device configuration at Furiosa SDK API Reference - Device Specification.
Higher Throughput
According to our benchmark, FuriosaRT shows significantly improved throughput
compared to the previous runtime. In particular, worker_num
configuration became more effective in FuriosaRT. For example, in the previous runtime,
higher than 2 worker_num
did not show significant performance improvement in most cases.
However, in FuriosaRT, we observed that performance improvement is still significant even with worker_num >= 10
.
We carried out benchmarking with Resnet50, YOLOv5m, YOLOv5L, SSD ResNet34, and SSD MobileNet models through
furiosa-bench
command introduced in this release. We observed that performance improvement
is significantly up to tens of percent depending on the model with worker_num >= 4
.
Model Server and Serving Framework
furiosa-server
and furioa-serving
are a web server and a web framework respectively for serving models.
The improvements of FuriosaRT are also reflected to the model server and serving framework.
Multi-device Support and Improvement on Device Configuration can be used to configure multiple NPU devices in
furiosa-server
andfurioa-serving
New asyncio-based API that FuriosaRT offers is introduced to handle more concurrent requests with less resources.
The model server and serving framework inherit the performance characteristics of FuriosaRT. Also, more
worker_num
can be used to improve the performance of the model server.
Please refer to Model Server (Serving Framework) to learn more about the model server and serving framework.
Model Quantization Tool
The furiosa-quantizer is a library that transforms trained models into quantized models
through the post-training quantization. In 0.10.0 release,
the usability of the quantization tool has been improved,
so some parameters of the furiosa.quantizer.quantize()
API have a few breaking changes.
Motivation for Change
furiosa.quantizer.quantize()
function is a core function of the model quantization tool.
furiosa.quantizer.quantize()
transforms an ONNX model into a quantized model and returns it. The function has
the parameter with_quantize
that allows the model to accept directly the uint8
type
instead of float32
, also enabling skipping the quantization process for inferences
when the original data type (e.g., pixel values) is uint8.
This option can result in significant performance improvements.
For instance, YOLOv5 Large with this option can dramatically reduce the execution time
from 60.639 ms to 0.277 ms.
Similarly, normalized_pixel_outputs
option allows to directly use unt8
type for outputs
instead of float32
. This option can be useful when the model output is an image in RGB format
or when it can be directly used as an integer value. This option shows significant performance boosts.
In certain applications, two options can reduce execution time by several times to hundreds of times. However, there were the following limitations and feedback:
The parameter
normalized_pixel_outputs
was ambiguous in expressing the purpose clearly.normalized_pixel_outputs
assumes the output tensor value ranged from 0 to 1 in floating-point, and it had limited in real application.with_quantize
andnormalized_pixel_outputs
options only supporteduint8
type, and didn’t supportint8
type.
What Changed
Removed the parameters
with_quantize
andnormalized_pixel_outputs
fromfuriosa.quantizer.quantize()
- Instead, added the class ModelEditor, allowing more options for model input/output types that offers following optimizations:
convert_input_type(tensor_name, TensorType)
method takes a tensor name, removes the corresponding quantize operator, and changes the input type to a givenTensorType
.convert_output_type(tensor_name, TensorType, tensor_range)
method takes a tensor name, removes the corresponding dequantize operator, and changes the output type toTensorType
, then modifies the scale of the model output to a giventensor_range
.
Since the
convert_{output,input}_type
methods are based on tensor names, users should be able to find tensor names from an original ONNX model.
For that, furiosa.quantizer
module provides get_pure_input_names(ModelProto)
and get_output_names(ModelProto)
functions to retrieve tensor names from the original ONNX model.
Note
The removal of with_quantize
, normalized_pixel_outputs
parameters from furiosa.quantizer.quantize()
is a breaking change that requires modifying existing code.
Please refer to ModelEditor to learn more about the ModelEditor API and find examples from Tutorial and Code Examples.
Compiler
Since this release, the compiler supports NPU acceleration for the Dequantize operator. So, the latency or throughput of models that include Dequantize operators can be enhanced. More details of this performance optimization can be found from Performance Optimization.
Since 0.10.0, the default lifetime of compiler cache has increased from 2 days to 30 days. Please refer to Compiler Cache to learn the details of compiler cache feature.
furiosa-compiler
command in 0.10.0 release also has the following improvements:
Add
furiosa-compiler
command in addition tofuriosa-compile
commandfuriosa-compiler
andfuriosa-compile
commands as a native executable and do not require any Python runtime environment.furiosa-compiler
is now available as an APT package, you can install viaapt install furiosa-compiler
.furiosa compile
is kept for backward compatibility, and it will be removed in a future release.
Please visit furiosa-compiler to learn more about furiosa-compiler command.
Performance Profiler
The performance profiler is a tool that helps users to analyze performance by measuring the actual execution time of inferences. Since 0.10.0, Tracing via Profiler Context API provides the pause/resume features.
This feature allows users to skip unnecessary steps like pre/post processing or warming up times,
leading to the reduction of the profiling overhead and the size of the profile result files.
Literally, calling profile.pause()
method immediately stops the profliling, and
profile.resume()
resumes the profiling again.
The profiler will not collect any profiling information between both method calls.
Please refer to Pause/Resume of Profiler Context to learn more about the profiling API.
furiosa-litmus
furiosa-litmus
is a command-line tool that checks the compatibility of models
with the NPU and Furiosa SDK. Since 0.10.0, furiosa-litmus
has a new feature to collect
logs, profiling information, and an environment information for error reporting.
This feature is enabled if --dump <OUTPUT_PREFIX>
option is specified.
The collected data is saved into a zip file named <OUTPUT_PREFIX>-<unix_epoch>.zip
.
$ furiosa-litmus <MODEL_PATH> --dump <OUTPUT_PREFIX>
The collected information does not include the model itself but does contain only metadata of the model, memory usage, and environmental information (e.g., Python version, SDK, compiler version, and dependency library versions). You can directly unzip the zip file to check the contents. When reporting bugs, attaching this file will be very helpful for error dianosis and analysis.
New Benchmark Tool ‘furiosa-bench’
The new benchmark tool, furiosa-bench
, has been added since 0.10.0.
furiosa-bench
command offers various options to run a diverse workloads with certain runtime settings.
Users can choose either latency-oriented or throughput-oriented workload, and can specify
the number of devices, how long time to run, and runtime settings. furiosa-bench
accepts
both ONNX and Tflite models as well as an ENF file compiled by the furiosa-compiler.
More details about the command can be found at furiosa-bench (Benchmark Tool).
An example of a throughput benchmark
$ furiosa-bench ./model.onnx --workload throughput -n 10000 --devices "warboy(1)*2" --workers 8 --batch 8
An example of a latency benchmark
$ furiosa-bench ./model.onnx --workload latency -n 10000 --devices "warboy(2)*1"
furiosa-bench
can be installed through apt package manager as follows:
$ apt install furiosa-bench
furiosa-toolkit
furiosa-toolkit is a collection of command line tools that provide NPU management and NPU device
monitoring. Since 0.10.0, furiosa-toolkit
includes the following improvements:
Improvements of furiosactl
Before 0.10.0, the sub-commands like list
, info
print out a tabular text.
Since 0.10.0, furiosactl
newly provides --format
option, allowing to
print out the result in a structured format like json
or yaml
.
It will be useful when a users implements a shell pipeline or a script to process
the output of furiosactl
.
$ furiosactl info --format json
[{"dev_name":"npu7","product_name":"warboy","device_uuid":"<device_uuid>","device_sn":"<device_sn>","firmware":"1.6.0, 7a3b908","temperature":"47°C","power":"0.99 W","pci_bdf":"0000:d6:00.0","pci_dev":"492:0"}]
$ furiosactl info --format yaml
- dev_name: npu7
product_name: warboy
device_uuid: <device_uuid>
device_sn: <device_sn>
firmware: 1.6.0, 7a3b908
temperature: 47°C
power: 0.98 W
pci_bdf: 0000:d6:00.0
pci_dev: 492:0
Also, the subcommand info
results in two more metrics:
NPU Clock Frequency
Entire power consumption of card
Improvements of furiosa-npu-metrics-exporter
furiosa-npu-metrics-exporter
is a HTTP server to export NPU metrics and status
in OpenMetrics format.
The metrics that furiosa-npu-metrics-exporter
exports can be collected by Prometheus and other OpenMetrics compatible collectors.
Since 0.10.0, furiosa-npu-metrics-exporter
includes NPU clock frequency and NPU utilization
as metrics. NPU utilziation is still an experimental feature, and it is disabled by default.
To enable this feature, you need to specify --enable-npu-utilization
option as follows:
furiosa-npu-metrics-exporter --enable-npu-utilization
Additionally, furiosa-npu-metrics-exporter
is now available as an APT package
in addition to the docker image. You can install it as follows:
apt install furiosa-toolkit