Model Server (Serving Framework)

To serve DNN models through GRPC and REST API, you can use Furiosa Model Server. Model Server provides the endpoints compatible with KServe Predict Protocol Version 2.

Its major features are:

REST/GRPC endpoints support

Multiple model serving using multiple NPU devices

Installation

Its requirements are:

Ubuntu 20.04 LTS (Debian bullseye) or higher
Driver, Firmware, and Runtime Installation
Python 3.8 or higher version

If you need Python environment, please refer to Python execution environment setup first.

Run the following command

$ pip install 'furiosa-sdk[server]'

Check out the source code and run the following command

$ git clone https://github.com/furiosa-ai/furiosa-sdk.git
$ cd furiosa-sdk/python/furiosa-server
$ pip install .

Running a Model Server

You can run model sever command by running furiosa server in your shell.

To run simply a model server with tflite or onnx, you need to specify just the model path and its name as following:

$ cd furiosa-sdk
$ furiosa server \
--model-path examples/assets/quantized_models/MNISTnet_uint8_quant_without_softmax.tflite \
--model-name mnist

--model-path option allows to specify a path of a model file. If you want to use a specific binding address and port, you can use additionally --host, --host-port.

Please run furiosa server --help if you want to learn more about the command with various options.

$ furiosa server --help
libfuriosa_hal.so --- v0.11.0, built @ 43c901f

 Usage: furiosa-server [OPTIONS]

 Start serving models from FuriosaAI model server

╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --log-level            [ERROR|INFO|WARN|DEBUG|TRACE]  [default: LogLevel.INFO]                   │
│ --model-name           TEXT                           Model name [default: None]                 │
│ --model-path           TEXT                           Path to a model file (tflite, onnx are     │
│                                                       supported)                                 │
│                                                       [default: None]                            │
│ --model-version        TEXT                           Model version [default: default]           │
│ --host                 TEXT                           IPv4 address to bind [default: 0.0.0.0]    │
│ --http-port            INTEGER                        HTTP port to bind [default: 8080]          │
│ --model-config         FILENAME                       Path to a model config file                │
│                                                       [default: None]                            │
│ --server-config        FILENAME                       Path to a server config file               │
│                                                       [default: None]                            │
│ --help                                                Show this message and exit.                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

Running a Model Server with a Configuration File

If you need more advanced configurations like compilation options and device options, you can use a configuration file based on Yaml.

model_config_list:
- name: mnist
    path: "samples/data/MNISTnet_uint8_quant.tflite"
    version: 1
    npu_device: npu0pe0
    compiler_config:
        keep_unsignedness: true
        split_unit: 0
- name: ssd
    path: "samples/data/tflite/SSD512_MOBILENET_V2_BDD_int_without_reshape.tflite"
    version: 1
    npu_device: npu0pe1

When you run a model sever with a configuration file, you need to specify --model-config as following. You can find the model files described in the above example from furiosa-models/samples.

$ cd furiosa-sdk/python/furiosa-server
$ furiosa server --model-config samples/model_config_example.yaml
libfuriosa_hal.so --- v0.11.0, built @ 43c901f
Saving the compilation log into /root/.local/state/furiosa/logs/compile-20230509151914-axpfej.log
Using furiosa-compiler 0.9.0 (rev: e626c458c built at 2023-04-19T13:49:26Z)
2023-05-09T06:19:14.560585Z  INFO nux::npu: Npu (npu0pe0) is being initialized
2023-05-09T06:19:14.565216Z  INFO nux: NuxInner create with pes: [PeId(0)]
Saving the compilation log into /root/.local/state/furiosa/logs/compile-20230509151914-d063sw.log
Using furiosa-compiler 0.9.0 (rev: e626c458c built at 2023-04-19T13:49:26Z)
2023-05-09T06:19:14.591795Z  INFO nux::npu: Npu (npu0pe1) is being initialized
2023-05-09T06:19:14.595298Z  INFO nux: NuxInner create with pes: [PeId(0)]
INFO:     Started server process [1184080]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

Once a model server starts up, you can call the inference request through HTTP protocol. If the model name is mnist and its version 1, the endpoint of the model will be http://<host>:<port>/v2/models/mnist/version/1/infer, accepting POST http request. The following is an example using curl to send the inference request and return the response.

The following is a Python example, doing same as curl does in the above example.

import requests
import mnist
import numpy as np

mnist_images = mnist.train_images().reshape((60000, 1, 28, 28)).astype(np.uint8)
url = 'http://localhost:8080/v2/models/mnist/versions/1/infer'

data = mnist_images[0:1].flatten().tolist()
request = {
    "inputs": [{
        "name":
        "mnist",
        "datatype": "UINT8",
        "shape": (1, 1, 28, 28),
        "data": data
    }]
}

response = requests.post(url, json=request)
print(response.json())

Endpoints

The following table shows REST API endpoints and its descriptions. The model server is following KServe Predict Protocol Version 2. So, you can find more details from KServe Predict Protocol Version 2 - HTTP/REST.

Endpoints of KServe Predict Protocol Version 2
Method and Endpoint	Description
GET /v2/health/live	Returns HTTP Ok (200) if the inference server is able to receive and respond to metadata and inference requests. This API can be directly used for the Kubernetes livenessProbe.
GET /v2/health/ready	Returns HTTP Ok (200) if all the models are ready for inferencing. This API can be directly used for the Kubernetes readinessProbe.
GET /v2/models/${MODEL_NAME}/versions/${MODEL_VERSION}	Returns a model metadata
GET /v2/models/${MODEL_NAME}/versions/${MODEL_VERSION}/ready	Returns HTTP Ok (200) if a specific model is ready for inferencing.
POST /v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer	Inference request