Release Notes - 0.8.0

Furiosa SDK 0.8.0 is a major release, including many performance enhancements, additional functions, and bug fixes. 0.8.0 also includes the serving framework, a core tool of user application development, as well as major improvements to the Model Zoo.

Component Version Information

Package Name

Version

NPU Driver

1.4.0

NPU Firmware Tools

1.2.0

NPU Firmware Image

1.2.0

HAL (Hardware Abstraction Layer)

0.9.0

Furiosa Compiler

0.8.0

Python SDK (furiosa-runtime, furiosa-server, furiosa-serving, furiosa-quantizer, ..)

0.8.0

NPU Device Plugin

0.10.1

NPU Feature Discovery

0.2.0

NPU Management CLI (furiosactl)

0.10.0

Installing the latest SDK

If you are using APT repository, the upgrade process is simpler.

apt-get update && apt-get upgrade

If you wish to designate a specific package for upgrade, execute as below: You can find more details about APT repository setup at Driver, Firmware, and Runtime Installation.

apt-get update && \
apt-get install -y furiosa-driver-pdma furiosa-libhal-warboy furiosa-libnux furiosactl

You can upgrade firmware as following:

apt-get update && \
apt-get install -y furiosa-firmware-tools furiosa-firmware-image

You can upgrade Python package as following:

pip install --upgrade furiosa-sdk

Major changes

Improvements to serving framework API

The furiosa-serving is a FastAPI-based serving framework. With the furiosa-serving, users can quickly develop Python-based high performance web service applications that utilize NPUs. The 0.8.0 release includes the following major updates.

Session pool that enables model serving with multiple NPUs

Session pools improve significantly throughput of model APIs by using multiple NPUs. If large inputs can be divided into a number of small inputs, this improvement can be used to reduce the latency of serving applications.

model: NPUServeModel = synchronous(serve.model("nux"))(
    "MNIST",
    location="./assets/models/MNISTnet_uint8_quant_without_softmax.tflite",
    # Specify multiple devices
    npu_device="npu0pe0,npu0pe1,npu1pe0"
    worker_num=4,
)

Shift from thread-based to asyncio-based NPU query processing

Small and frequent NPU inference queries may now be processed with lower latency. As shown in the example below, applications requiring multiple NPU inferences in a single API query can be processed with better performance.

async def inference(self, tensors: List[np.ndarray]) -> List[np.ndarray]:
    # The following code runs multiple inferences at the same time and wait until all requests are completed.
    return await asyncio.gather(*(self.model.predict(tensor) for tensor in tensors))

Added expanded support for external device & runtime

In complex serving scenarios, additional/external device and runtime programs may be required, in addition to NPU-based Furiosa Runtime. In this release, the framework has been expanded such that external device and runtime may be used. The first external runtime added is OpenVINO.

imagenet: ServeModel = synchronous(serve.model("openvino"))(
    'imagenet',
    location='./examples/assets/models/image_classification.onnx'
)

Support for S3 cloud storage repository

Set model location as S3 URL.

# Load model from S3 (Auth environment variable for aioboto library required)
densenet: ServeModel = synchronous(serve.model("nux"))(
    'imagenet',
 location='s3://furiosa/models/93d63f654f0f192cc4ff5691be60fb9379e9d7fd'
)

Support for OpenTelemetry compatible tracing

With the OpenTelemetry Collector function, you can now track the execution time of specific code sections of the serving applications.

To use this function, you can activate trace.get_tracer(), reset the tracer, activate the tracer.start_as_current_span() function, and designate the section.

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

class Application:

        async def process(self, image: Image.Image) -> int:
            with tracer.start_as_current_span("preprocess"):
                input_tensors = self.preprocess(image)
            with tracer.start_as_current_span("inference"):
                output_tensors = await self.inference(input_tensors)
            with tracer.start_as_current_span("postprocess"):
                return self.postprocess(output_tensors)

The specification of OpenTelemetry Collector can be done through the configuration of FURIOSA_SERVING_OTLP_ENDPOINT, as shown below. The following diagram is an example that visualizes the tracing result with Grafana.

An example of visualization with Grafana

Other major improvements are as follows:

  • Several inference requests can be executed at once, with serving API now supporting compiler setting batch_size

  • More threads can share the NPU, with serving API now supporting session option worker_num

Profiler

You can now analyze the profiler tracing results with Pandas, a data analysis framework. With this function, you can analyze the tracing result data, allowing you to quickly identify bottlenecks and reasons for model performance changes. More detailed instructions can be found at Trace analysis using Pandas DataFrame.

from furiosa.runtime import session, tensor
from furiosa.runtime.profiler import RecordFormat, profile

with profile(format=RecordFormat.PandasDataFrame) as profiler:
    with session.create("MNISTnet_uint8_quant_without_softmax.tflite") as sess:
        input_shape = sess.input(0)

        with profiler.record("record") as record:
            for _ in range(0, 2):
                sess.run(tensor.rand(input_shape))

df = profiler.get_pandas_dataframe()
print(df[df["name"] == "trace"][["trace_id", "name", "thread.id", "dur"]])

Quantization tool

Model Quantization is a tool that converts pre-trained models to quantized models. This release includes the following major updates.

  • Accuracy improvement when processing SiLU operator

  • Improved usability of compiler setting without_quantize

  • Accuracy improvement when processing MatMul/Gemm operators

  • Accuracy improvement when processing Add/Sub/Mul/Div operators

  • NPU acceleration now added for more auto_pad properties, when processing Conv/ConvTranspose/MaxPool operators

  • NPU acceleration support for PRelu operator

furiosa-toolkit

The furiosactl command line tool, which has been added to the furiosa-toolkit 0.10.0 release, includes the following improvements.

The newly added furiosactl ps command allows you to print the OS processes which are occupying the NPU device.

# furiosactl ps
+-----------+--------+------------------------------------------------------------+
| NPU       | PID    | CMD                                                        |
+-----------+--------+------------------------------------------------------------+
| npu0pe0-1 | 132529 | /usr/bin/python3 /usr/local/bin/uvicorn image_classify:app |
+-----------+--------+------------------------------------------------------------+

The furiosactl info command now prints the unique UUID for each device.

$ furiosactl info
+------+--------+--------------------------------------+-----------------+-------+--------+--------------+---------+
| NPU  | Name   | UUID                                 | Firmware        | Temp. | Power  | PCI-BDF      | PCI-DEV |
+------+--------+--------------------------------------+-----------------+-------+--------+--------------+---------+
| npu0 | warboy | 72212674-61BE-4FCA-A2C9-555E4EE67AB5 | v1.1.0, 12180b0 |  49°C | 3.12 W | 0000:24:00.0 | 235:0   |
+------+--------+--------------------------------------+-----------------+-------+--------+--------------+---------+
| npu1 | warboy | DF80FB54-8190-44BC-B9FB-664FA36C754A | v1.1.0, 12180b0 |  54°C | 2.53 W | 0000:6d:00.0 | 511:0   |
+------+--------+--------------------------------------+-----------------+-------+--------+--------------+---------+

Detailed instructions on installation and usage for furiosactl can be found in furiosa-toolkit.

Model Zoo API improvements, added models, and added native post-processing code

furioa-models is a public Model Zoo project, providing FuriosaAI NPU-optimized models. The 0.8.0 release includes the following major updates.

YOLOv5 Large/Medium models added

Support for YOLOv5l, YOLOv5m, which are SOTA object detection models, have been added. The total list of available models can be found in Model List.

Improvements to model class and loading API

The model class has been improved to include pre and post-processing code, while the model loading API has been improved as shown below.

More explanation on model class and the API can be found at Model Object.

Before update

from furiosa.models.vision import MLCommonsResNet50

resnet50 = MLCommonsResNet50()

Updated code

from furiosa.models.vision import ResNet50

resnet50 = ResNet50.load()

The model post-processing process converts the inference ouput tensor into structural data, which is more accessible for the application. Depending on the model, this may require a longer execution time. The 0.8.0 release includes native post-processing code for ResNet50, SSD-MobileNet, and SSD-ResNet34. Based on internal benchmarks, native post-processing code can reduce latency by up to 70%, depending on the model.

The following is a complete example of ResNet50, utilizing native post-processing code. More information can be found at Pre/Postprocessing.

from furiosa.models.vision import ResNet50
from furiosa.models.vision.resnet50 import NativePostProcessor, preprocess
from furiosa.runtime import session

model = ResNet50.load()

postprocessor = NativePostProcessor(model)
with session.create(model) as sess:
    image = preprocess("tests/assets/cat.jpg")
    output = sess.run(image).numpy()
    postprocessor.eval(output)

Other changes and updates can be found at Furiosa Model - 0.8.0 Changelogs.