Model Server (Serving Framework)

To serve DNN models through GRPC and REST API, you can use Furiosa Model Server. Model Server provides the endpoints compatible with KServe Predict Protocol Version 2.

Its major features are:

  • REST/GRPC endpoints support

  • Multiple model serving using multiple NPU devices

Installation

Its requirements are:

If you need Python environment, please refer to Python execution environment setup first.

Run the following command

$ pip install 'furiosa-sdk[server]'

Running a Model Server

You can run model sever command by running furiosa server in your shell.

To run simply a model server with tflite or onnx, you need to specify just the model path and its name as follows:

$ cd furiosa-sdk
$ furiosa server \
--model-path examples/assets/quantized_models/MNISTnet_uint8_quant_without_softmax.tflite \
--model-name mnist

--model-path option allows to specify a path of a model file. If you want to use a specific binding address and port, you can use additionally --host, --host-port.

Please run furiosa server --help if you want to learn more about the command with various options.

$ furiosa server --help
Usage: furiosa server [OPTIONS]

    Start serving models from FuriosaAI model server

Options:
    --log-level [ERROR|INFO|WARN|DEBUG|TRACE]
                                    [default: LogLevel.INFO]
    --model-path TEXT               Path to Model file (tflite, onnx are
                                    supported)
    --model-name TEXT               Model name used in URL path
    --model-version TEXT            Model version used in URL path  [default:
                                    default]
    --host TEXT                     IP address to bind  [default: 0.0.0.0]
    --http-port INTEGER             HTTP port to listen to requests  [default:
                                    8080]
    --model-config FILENAME         Path to a config file about models with
                                    specific configurations
    --server-config FILENAME        Path to Model file (tflite, onnx are
                                    supported)
    --install-completion [bash|zsh|fish|powershell|pwsh]
                                    Install completion for the specified shell.
    --show-completion [bash|zsh|fish|powershell|pwsh]
                                    Show completion for the specified shell, to
                                    copy it or customize the installation.
    --help                          Show this message and exit.

Running a Model Server with a Configuration File

If you need more advanced configurations like compilation options and device options, you can use a configuration file based on Yaml.

model_config_list:
  - name: mnist
    model: "samples/data/MNISTnet_uint8_quant.tflite"
    version: "1"
    platform: npu
    npu_device: warboy(1)*1
    compiler_config:
      keep_unsignedness: true
      split_unit: 0
  - name: ssd
    model: "samples/data/SSD512_MOBILENET_V2_BDD_int_without_reshape.tflite"
    version: "1"
    platform: npu
    npu_device: warboy(1)*1

When you run a model sever with a configuration file, you need to specify --model-config as follows. You can find the model files described in the above example from furiosa-models/samples.

$ cd furiosa-sdk/python/furiosa-server
$ furiosa server --model-config samples/model_config_example.yaml
libfuriosa_hal.so --- v0.11.0, built @ 43c901f
2023-08-02T07:42:42.263133Z  INFO furiosa_rt_core::driver::event_driven::device: DeviceManager has detected 1 NPUs
2023-08-02T07:42:42.267247Z  INFO furiosa_rt_core::driver::event_driven::device: [1] npu:6:1 (warboy-b0, 64dpes)
2023-08-02T07:42:42.267264Z  INFO furiosa_rt_core::driver::event_driven::coord: furiosa-rt (v0.10.0-rc6, rev: d021ff71d, built_at: 2023-07-31T19:05:26Z) is being initialized
2023-08-02T07:42:42.267269Z  INFO furiosa_rt_core::npu::async_impl::threaded: npu:6:1-io-thread-0 thread has started
2023-08-02T07:42:42.267398Z  INFO furiosa_rt_core::npu::async_impl::threaded: npu:6:1-commit-thread thread has started
2023-08-02T07:42:42.267405Z  INFO furiosa_rt_core::npu::async_impl::threaded: npu:6:1-io-thread-1 thread has started
2023-08-02T07:42:42.270837Z  INFO furiosa_rt_core::driver::event_driven::coord: Loaded libcompiler 0.10.0 (rev: f8f05c built: 2023-07-26T09:49:17Z)
2023-08-02T07:42:42.270851Z  INFO furiosa_rt_core::driver::event_driven::coord: Loaded libhal-warboy 0.11.0 (rev: 43c901f built: 2023-04-19T14:04:55Z)
2023-08-02T07:42:42.271144Z  INFO furiosa_rt_core::driver::event_driven::coord: [NONAME] Runtime has started
2023-08-02T07:42:42.273772Z  INFO furiosa_rt_core::driver::event_driven::coord: Model#0001 is being loaded to npu:6:1
2023-08-02T07:42:42.283260Z  INFO furiosa_rt_core::driver::event_driven::coord: Compiling Model#0001 (target: warboy-b0, 64dpes, file: MNISTnet_uint8_quant.tflite, size: 18.2 kiB)
2023-08-02T07:42:42.299091Z  INFO furiosa_rt_core::driver::event_driven::coord: Model#0001 has been compiled successfully (took 0 secs)
2023-08-02T07:42:42.299293Z  INFO furiosa_rt_core::dag: Task Statistics: TaskStats { cpu: 5, npu: 1, alias: 0, coalesce: 0 }
2023-08-02T07:42:42.300701Z  INFO furiosa_rt_core::driver::event_driven::coord: NpuApi (AsyncNpuApiImpl) has started..
2023-08-02T07:42:42.300721Z  INFO furiosa_rt_core::driver::event_driven::coord: Creating 1 Contexts on npu:6:1 (DRAM usage: 6.0 kiB / 16.0 GiB, SRAM usage: 124.0 kiB / 64.0 MiB)
2023-08-02T07:42:42.300789Z  INFO furiosa_rt_core::driver::event_driven::coord: npu:6:1 has scheduled to Model#0001
2023-08-02T07:42:42.304216Z  WARN furiosa_rt_core::consts::envs: NPU_DEVNAME will be deprecated. Use FURIOSA_DEVICES instead.
2023-08-02T07:42:42.313084Z  INFO furiosa_rt_core::driver::event_driven::device: DeviceManager has detected 1 NPUs
2023-08-02T07:42:42.315470Z  INFO furiosa_rt_core::driver::event_driven::device: [1] npu:6:0 (warboy-b0, 64dpes)
2023-08-02T07:42:42.315483Z  INFO furiosa_rt_core::driver::event_driven::coord: furiosa-rt (v0.10.0-rc6, rev: d021ff71d, built_at: 2023-07-31T19:05:26Z) is being initialized
2023-08-02T07:42:42.315560Z  INFO furiosa_rt_core::npu::async_impl::threaded: npu:6:0-io-thread-1 thread has started
2023-08-02T07:42:42.315610Z  INFO furiosa_rt_core::npu::async_impl::threaded: npu:6:0-io-thread-0 thread has started
2023-08-02T07:42:42.315657Z  INFO furiosa_rt_core::npu::async_impl::threaded: npu:6:0-commit-thread thread has started
2023-08-02T07:42:42.319127Z  INFO furiosa_rt_core::driver::event_driven::coord: Loaded libcompiler 0.10.0 (rev: f8f05c built: 2023-07-26T09:49:17Z)
2023-08-02T07:42:42.319141Z  INFO furiosa_rt_core::driver::event_driven::coord: Loaded libhal-warboy 0.11.0 (rev: 43c901f built: 2023-04-19T14:04:55Z)
2023-08-02T07:42:42.319364Z  INFO furiosa_rt_core::driver::event_driven::coord: [NONAME] Runtime has started
2023-08-02T07:42:42.324283Z  INFO furiosa_rt_core::driver::event_driven::coord: Model#0002 is being loaded to npu:6:0
2023-08-02T07:42:42.333521Z  INFO furiosa_rt_core::driver::event_driven::coord: Compiling Model#0002 (target: warboy-b0, 64dpes, file: SSD512_MOBILENET_V2_BDD_int_without_reshape.tflite, size: 5.2 MiB)
2023-08-02T07:42:42.814260Z  INFO furiosa_rt_core::driver::event_driven::coord: Model#0002 has been compiled successfully (took 0 secs)
2023-08-02T07:42:42.815406Z  INFO furiosa_rt_core::dag: Task Statistics: TaskStats { cpu: 26, npu: 1, alias: 0, coalesce: 0 }
2023-08-02T07:42:42.893745Z  INFO furiosa_rt_core::driver::event_driven::coord: NpuApi (AsyncNpuApiImpl) has started..
2023-08-02T07:42:42.893772Z  INFO furiosa_rt_core::driver::event_driven::coord: Creating 1 Contexts on npu:6:0 (DRAM usage: 1.0 MiB / 16.0 GiB, SRAM usage: 14.8 MiB / 64.0 MiB)
2023-08-02T07:42:42.894265Z  INFO furiosa_rt_core::driver::event_driven::coord: npu:6:0 has scheduled to Model#0002
INFO:     Started server process [2448540]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

Once a model server starts up, you can call the inference request through HTTP protocol. If the model name is mnist and its version 1, the endpoint of the model will be http://<host>:<port>/v2/models/mnist/version/1/infer, accepting POST http request. The following is an example using curl to send the inference request and return the response.

The following is a Python example, doing same as curl does in the above example.

import requests
import mnist
import numpy as np

mnist_images = mnist.train_images().reshape((60000, 1, 28, 28)).astype(np.uint8)
url = 'http://localhost:8080/v2/models/mnist/versions/1/infer'

data = mnist_images[0:1].flatten().tolist()
request = {
    "inputs": [{
        "name":
        "mnist",
        "datatype": "UINT8",
        "shape": (1, 1, 28, 28),
        "data": data
    }]
}

response = requests.post(url, json=request)
print(response.json())

Endpoints

The following table shows REST API endpoints and its descriptions. The model server is following KServe Predict Protocol Version 2. So, you can find more details from KServe Predict Protocol Version 2 - HTTP/REST.

Endpoints of KServe Predict Protocol Version 2

Method and Endpoint

Description

GET /v2/health/live

Returns HTTP Ok (200) if the inference server is able to receive and respond to metadata and inference requests. This API can be directly used for the Kubernetes livenessProbe.

GET /v2/health/ready

Returns HTTP Ok (200) if all the models are ready for inferencing. This API can be directly used for the Kubernetes readinessProbe.

GET /v2/models/${MODEL_NAME}/versions/${MODEL_VERSION}

Returns a model metadata

GET /v2/models/${MODEL_NAME}/versions/${MODEL_VERSION}/ready

Returns HTTP Ok (200) if a specific model is ready for inferencing.

POST /v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer

Inference request