.. _ModelServing: ********************************************************** Model Server (Serving Framework) ********************************************************** To serve DNN models through GRPC and REST API, you can use `Furiosa Model Server `_. Model Server provides the endpoints compatible with `KServe Predict Protocol Version 2 `_. Its major features are: * REST/GRPC endpoints support * Multiple model serving using multiple NPU devices Installation ============================ Its requirements are: * Ubuntu 20.04 LTS (Debian bullseye) or higher * :ref:`RequiredPackages` * Python 3.8 or higher version If you need Python environment, please refer to :ref:`SetupPython` first. .. tabs:: .. tab:: Installation using PIP Run the following command .. code-block:: sh $ pip install 'furiosa-sdk[server]' .. tab:: Installation from source code Check out the source code and run the following command .. code-block:: sh $ git clone https://github.com/furiosa-ai/furiosa-sdk.git $ cd furiosa-sdk/python/furiosa-server $ pip install . Running a Model Server ============================ You can run model sever command by running ``furiosa server`` in your shell. To run simply a model server with ``tflite`` or ``onnx``, you need to specify just the model path and its name as follows: .. code-block:: sh $ cd furiosa-sdk $ furiosa server \ --model-path examples/assets/quantized_models/MNISTnet_uint8_quant_without_softmax.tflite \ --model-name mnist ``--model-path`` option allows to specify a path of a model file. If you want to use a specific binding address and port, you can use additionally ``--host``, ``--host-port``. Please run ``furiosa server --help`` if you want to learn more about the command with various options. .. code-block:: sh $ furiosa server --help Usage: furiosa server [OPTIONS] Start serving models from FuriosaAI model server Options: --log-level [ERROR|INFO|WARN|DEBUG|TRACE] [default: LogLevel.INFO] --model-path TEXT Path to Model file (tflite, onnx are supported) --model-name TEXT Model name used in URL path --model-version TEXT Model version used in URL path [default: default] --host TEXT IP address to bind [default: 0.0.0.0] --http-port INTEGER HTTP port to listen to requests [default: 8080] --model-config FILENAME Path to a config file about models with specific configurations --server-config FILENAME Path to Model file (tflite, onnx are supported) --install-completion [bash|zsh|fish|powershell|pwsh] Install completion for the specified shell. --show-completion [bash|zsh|fish|powershell|pwsh] Show completion for the specified shell, to copy it or customize the installation. --help Show this message and exit. Running a Model Server with a Configuration File ============================================================= If you need more advanced configurations like compilation options and device options, you can use a configuration file based on Yaml. .. code-block:: yaml model_config_list: - name: mnist model: "samples/data/MNISTnet_uint8_quant.tflite" version: "1" platform: npu npu_device: warboy(1)*1 compiler_config: keep_unsignedness: true split_unit: 0 - name: ssd model: "samples/data/SSD512_MOBILENET_V2_BDD_int_without_reshape.tflite" version: "1" platform: npu npu_device: warboy(1)*1 When you run a model sever with a configuration file, you need to specify ``--model-config`` as follows. You can find the model files described in the above example from `furiosa-models/samples `_. .. code-block:: sh $ cd furiosa-sdk/python/furiosa-server $ furiosa server --model-config samples/model_config_example.yaml libfuriosa_hal.so --- v0.11.0, built @ 43c901f 2023-08-02T07:42:42.263133Z INFO furiosa_rt_core::driver::event_driven::device: DeviceManager has detected 1 NPUs 2023-08-02T07:42:42.267247Z INFO furiosa_rt_core::driver::event_driven::device: [1] npu:6:1 (warboy-b0, 64dpes) 2023-08-02T07:42:42.267264Z INFO furiosa_rt_core::driver::event_driven::coord: furiosa-rt (v0.10.0-rc6, rev: d021ff71d, built_at: 2023-07-31T19:05:26Z) is being initialized 2023-08-02T07:42:42.267269Z INFO furiosa_rt_core::npu::async_impl::threaded: npu:6:1-io-thread-0 thread has started 2023-08-02T07:42:42.267398Z INFO furiosa_rt_core::npu::async_impl::threaded: npu:6:1-commit-thread thread has started 2023-08-02T07:42:42.267405Z INFO furiosa_rt_core::npu::async_impl::threaded: npu:6:1-io-thread-1 thread has started 2023-08-02T07:42:42.270837Z INFO furiosa_rt_core::driver::event_driven::coord: Loaded libcompiler 0.10.0 (rev: f8f05c built: 2023-07-26T09:49:17Z) 2023-08-02T07:42:42.270851Z INFO furiosa_rt_core::driver::event_driven::coord: Loaded libhal-warboy 0.11.0 (rev: 43c901f built: 2023-04-19T14:04:55Z) 2023-08-02T07:42:42.271144Z INFO furiosa_rt_core::driver::event_driven::coord: [NONAME] Runtime has started 2023-08-02T07:42:42.273772Z INFO furiosa_rt_core::driver::event_driven::coord: Model#0001 is being loaded to npu:6:1 2023-08-02T07:42:42.283260Z INFO furiosa_rt_core::driver::event_driven::coord: Compiling Model#0001 (target: warboy-b0, 64dpes, file: MNISTnet_uint8_quant.tflite, size: 18.2 kiB) 2023-08-02T07:42:42.299091Z INFO furiosa_rt_core::driver::event_driven::coord: Model#0001 has been compiled successfully (took 0 secs) 2023-08-02T07:42:42.299293Z INFO furiosa_rt_core::dag: Task Statistics: TaskStats { cpu: 5, npu: 1, alias: 0, coalesce: 0 } 2023-08-02T07:42:42.300701Z INFO furiosa_rt_core::driver::event_driven::coord: NpuApi (AsyncNpuApiImpl) has started.. 2023-08-02T07:42:42.300721Z INFO furiosa_rt_core::driver::event_driven::coord: Creating 1 Contexts on npu:6:1 (DRAM usage: 6.0 kiB / 16.0 GiB, SRAM usage: 124.0 kiB / 64.0 MiB) 2023-08-02T07:42:42.300789Z INFO furiosa_rt_core::driver::event_driven::coord: npu:6:1 has scheduled to Model#0001 2023-08-02T07:42:42.304216Z WARN furiosa_rt_core::consts::envs: NPU_DEVNAME will be deprecated. Use FURIOSA_DEVICES instead. 2023-08-02T07:42:42.313084Z INFO furiosa_rt_core::driver::event_driven::device: DeviceManager has detected 1 NPUs 2023-08-02T07:42:42.315470Z INFO furiosa_rt_core::driver::event_driven::device: [1] npu:6:0 (warboy-b0, 64dpes) 2023-08-02T07:42:42.315483Z INFO furiosa_rt_core::driver::event_driven::coord: furiosa-rt (v0.10.0-rc6, rev: d021ff71d, built_at: 2023-07-31T19:05:26Z) is being initialized 2023-08-02T07:42:42.315560Z INFO furiosa_rt_core::npu::async_impl::threaded: npu:6:0-io-thread-1 thread has started 2023-08-02T07:42:42.315610Z INFO furiosa_rt_core::npu::async_impl::threaded: npu:6:0-io-thread-0 thread has started 2023-08-02T07:42:42.315657Z INFO furiosa_rt_core::npu::async_impl::threaded: npu:6:0-commit-thread thread has started 2023-08-02T07:42:42.319127Z INFO furiosa_rt_core::driver::event_driven::coord: Loaded libcompiler 0.10.0 (rev: f8f05c built: 2023-07-26T09:49:17Z) 2023-08-02T07:42:42.319141Z INFO furiosa_rt_core::driver::event_driven::coord: Loaded libhal-warboy 0.11.0 (rev: 43c901f built: 2023-04-19T14:04:55Z) 2023-08-02T07:42:42.319364Z INFO furiosa_rt_core::driver::event_driven::coord: [NONAME] Runtime has started 2023-08-02T07:42:42.324283Z INFO furiosa_rt_core::driver::event_driven::coord: Model#0002 is being loaded to npu:6:0 2023-08-02T07:42:42.333521Z INFO furiosa_rt_core::driver::event_driven::coord: Compiling Model#0002 (target: warboy-b0, 64dpes, file: SSD512_MOBILENET_V2_BDD_int_without_reshape.tflite, size: 5.2 MiB) 2023-08-02T07:42:42.814260Z INFO furiosa_rt_core::driver::event_driven::coord: Model#0002 has been compiled successfully (took 0 secs) 2023-08-02T07:42:42.815406Z INFO furiosa_rt_core::dag: Task Statistics: TaskStats { cpu: 26, npu: 1, alias: 0, coalesce: 0 } 2023-08-02T07:42:42.893745Z INFO furiosa_rt_core::driver::event_driven::coord: NpuApi (AsyncNpuApiImpl) has started.. 2023-08-02T07:42:42.893772Z INFO furiosa_rt_core::driver::event_driven::coord: Creating 1 Contexts on npu:6:0 (DRAM usage: 1.0 MiB / 16.0 GiB, SRAM usage: 14.8 MiB / 64.0 MiB) 2023-08-02T07:42:42.894265Z INFO furiosa_rt_core::driver::event_driven::coord: npu:6:0 has scheduled to Model#0002 INFO: Started server process [2448540] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit) Once a model server starts up, you can call the inference request through HTTP protocol. If the model name is ``mnist`` and its version ``1``, the endpoint of the model will be ``http://:/v2/models/mnist/version/1/infer``, accepting ``POST`` http request. The following is an example using ``curl`` to send the inference request and return the response. .. code-block: sh $ curl -X POST -H "Content-Type: application/json" \ -d "@samples/mnist_input_sample_01.json" \ http://localhost:8080/v2/models/mnist/versions/1/infer {"model_name":"mnist","model_version":"1","id":null,"parameters":null,"outputs":[{"name":"0","shape":[1,10],"datatype":"UINT8","parameters":null,"data":[0,0,0,1,0,255,0,0,0,0]}]} The following is a Python example, doing same as ``curl`` does in the above example. .. code-block:: python import requests import mnist import numpy as np mnist_images = mnist.train_images().reshape((60000, 1, 28, 28)).astype(np.uint8) url = 'http://localhost:8080/v2/models/mnist/versions/1/infer' data = mnist_images[0:1].flatten().tolist() request = { "inputs": [{ "name": "mnist", "datatype": "UINT8", "shape": (1, 1, 28, 28), "data": data }] } response = requests.post(url, json=request) print(response.json()) Endpoints ======================================= The following table shows REST API endpoints and its descriptions. The model server is following KServe Predict Protocol Version 2. So, you can find more details from `KServe Predict Protocol Version 2 - HTTP/REST `_. .. list-table:: Endpoints of KServe Predict Protocol Version 2 :widths: 50 50 :header-rows: 1 * - Method and Endpoint - Description * - GET /v2/health/live - Returns HTTP Ok (200) if the inference server is able to receive and respond to metadata and inference requests. This API can be directly used for the Kubernetes livenessProbe. * - GET /v2/health/ready - Returns HTTP Ok (200) if all the models are ready for inferencing. This API can be directly used for the Kubernetes readinessProbe. * - GET /v2/models/${MODEL_NAME}/versions/${MODEL_VERSION} - Returns a model metadata * - GET /v2/models/${MODEL_NAME}/versions/${MODEL_VERSION}/ready - Returns HTTP Ok (200) if a specific model is ready for inferencing. * - POST /v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer - Inference request