모델 서버 (서빙 프레임워크)

준비된 모델을 실제 서비스 환경에 배포할 때 모델을 GRPC나 REST API를 통하는 경우가 일반적이다. 이런 유스케이스를 위해 FuriosaAI SDK는 KServe Predict Protocol Version 2 를 지원하는 모델 서버를 제공한다.

모델 서버는 다음 주요 기능을 제공한다.

REST API 지원

다수의 NPU 디바이스 및 다수의 모델를 하나의 서버로 서빙 지원

모델 서버 설치

모델 서버 설치를 위한 최소 요구사항은 다음과 같다.

Ubuntu 20.04 LTS (Debian bullseye) 또는 상위 버전
FuriosaAI SDK 필수 패키지
Python 3.8 또는 상위 버전

Python 실행환경 준비가 필요하다면 Python 실행 환경 구성 를 참고한다.

간단하게 다음 커맨드를 실행해주세요.

$ pip install 'furiosa-sdk[server]'

아래와 같이 Github에서 소스를 다운받아 설치한다.

$ git clone https://github.com/furiosa-ai/furiosa-sdk.git
$ cd furiosa-sdk/python/furiosa-server
$ pip install .

모델 서버 실행

모델 서버는 명령행 도구 furiosa server 커맨드를 통해 실행 할 수 있다. furiosa server --help 을 실행하면 아래와 같은 도움말을 볼 수 있다.

$ furiosa server --help
libfuriosa_hal.so --- v0.11.0, built @ 43c901f

 Usage: furiosa-server [OPTIONS]

 Start serving models from FuriosaAI model server

╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --log-level            [ERROR|INFO|WARN|DEBUG|TRACE]  [default: LogLevel.INFO]                   │
│ --model-name           TEXT                           Model name [default: None]                 │
│ --model-path           TEXT                           Path to a model file (tflite, onnx are     │
│                                                       supported)                                 │
│                                                       [default: None]                            │
│ --model-version        TEXT                           Model version [default: default]           │
│ --host                 TEXT                           IPv4 address to bind [default: 0.0.0.0]    │
│ --http-port            INTEGER                        HTTP port to bind [default: 8080]          │
│ --model-config         FILENAME                       Path to a model config file                │
│                                                       [default: None]                            │
│ --server-config        FILENAME                       Path to a server config file               │
│                                                       [default: None]                            │
│ --help                                                Show this message and exit.                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

간단한 모델 서빙은 커맨드로 tflite, onnx 포맷의 모델 이미지의 패스와 모델 이름을 지정하면 실행할 수 있다.

$ cd furiosa-sdk
$ furiosa server \
--model-path examples/assets/quantized_models/MNISTnet_uint8_quant_without_softmax.tflite \
--model-name mnist

--model-path 옵션으로 로컬 파일 시스템에 저장된 모델을 지정할 수 있다. 또한 모델 서버가 지정한 호스트 이름과 포트로 연결 요청을 대기하기 원하는 경우 --host, --host-port 로 각각 설정할 수 있다.

모델 설정을 이용한 모델 서버 실행

컴파일 옵션이나 서빙에 대한 더 고급 설정이 필요한 경우 또는 반복적으로 같은 옵션을 사용하는 경우 모델 설정을 활용할 수 있다.

model_config_list:
- name: mnist
    path: "samples/data/MNISTnet_uint8_quant.tflite"
    version: 1
    npu_device: npu0pe0
    compiler_config:
        keep_unsignedness: true
        split_unit: 0
- name: ssd
    path: "samples/data/tflite/SSD512_MOBILENET_V2_BDD_int_without_reshape.tflite"
    version: 1
    npu_device: npu0pe1

위와 같은 설정을 준비한 뒤에 아래와 같이 --model-config 옵션을 이용하여 설정 파일의 패스를 지정하여 실행할 수 있다. 위 예제 실행을 위해서는 모델이 필요한데 위 모델은 Furiosa Server Github 저장소 의 samples 디렉토리에서 찾을 수 있다. 모델과 모델 설정이 준비되어 있다면 아래 예제를 실행해볼 수 있다.

$ cd furiosa-sdk/python/furiosa-server
$ furiosa server --model-config samples/model_config_example.yaml
libfuriosa_hal.so --- v0.11.0, built @ 43c901f
Saving the compilation log into /root/.local/state/furiosa/logs/compile-20230509151914-axpfej.log
Using furiosa-compiler 0.9.0 (rev: e626c458c built at 2023-04-19T13:49:26Z)
2023-05-09T06:19:14.560585Z  INFO nux::npu: Npu (npu0pe0) is being initialized
2023-05-09T06:19:14.565216Z  INFO nux: NuxInner create with pes: [PeId(0)]
Saving the compilation log into /root/.local/state/furiosa/logs/compile-20230509151914-d063sw.log
Using furiosa-compiler 0.9.0 (rev: e626c458c built at 2023-04-19T13:49:26Z)
2023-05-09T06:19:14.591795Z  INFO nux::npu: Npu (npu0pe1) is being initialized
2023-05-09T06:19:14.595298Z  INFO nux: NuxInner create with pes: [PeId(0)]
INFO:     Started server process [1184080]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

모델 서버가 시작되고 나면 일반적인 HTTP 요청을 보내 모델의 추론 작업을 해볼 수 있다. 모델 설정에서 모델 이름이 mnist 이고 버전이 1 인 모델에 추론 요청을 보낼 때는 http://<host>:<port>/v2/models/mnist/version/1/infer 에 POST 요청을 보내면 된다.

아래 예제는 위와 동일한 요청을 Python 코드를 통해 보내는 예제이다.

import requests
import mnist
import numpy as np

mnist_images = mnist.train_images().reshape((60000, 1, 28, 28)).astype(np.uint8)
url = 'http://localhost:8080/v2/models/mnist/versions/1/infer'

data = mnist_images[0:1].flatten().tolist()
request = {
    "inputs": [{
        "name":
        "mnist",
        "datatype": "UINT8",
        "shape": (1, 1, 28, 28),
        "data": data
    }]
}

response = requests.post(url, json=request)
print(response.json())

엔드포인트(Endpoint) 정보

다음 테이블은 모델 서버가 제공하는 주요 REST API 엔드포인트 정보이다. 모델 서버는 KServe Predict Protocol Version 2 - HTTP/REST 를 따르고 있으므로 더 자세한 정보는 링크에서 찾아볼 수 있다.

Endpoints of KServe Predict Protocol Version 2
Method and Endpoint	Description
GET /v2/health/live	서버가 요청을 처리할 수 있는 상태면 HTTP 상태 Ok 리턴 (Kubernetes livenessProbe에 해당)
GET /v2/health/ready	모든 모델이 추론 작업을 위한 준비가 되면 HTTP 상태 Ok 리턴 (Kubernetes readinessProbe에 해당)
GET /v2/models/${MODEL_NAME}/versions/${MODEL_VERSION}	모델 메타데이터 반환
GET /v2/models/${MODEL_NAME}/versions/${MODEL_VERSION}/ready	특정 버전의 모델이 추론 요청을 처리할 준비가 되었다면 HTTP 상태 Ok 리턴
POST /v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer	추론 요청