모델 서버 (서빙 프레임워크)

준비된 모델을 실제 서비스 환경에 배포할 때 모델을 GRPC나 REST API를 통하는 경우가 일반적이다. 이런 유스케이스를 위해 FuriosaAI SDK는 KServe Predict Protocol Version 2 를 지원하는 모델 서버를 제공한다.

모델 서버는 다음 주요 기능을 제공한다.

REST API 지원

다수의 NPU 디바이스 및 다수의 모델를 하나의 서버로 서빙 지원

모델 서버 설치

모델 서버 설치를 위한 최소 요구사항은 다음과 같다.

Ubuntu 18.04 LTS (Debian buster) 또는 상위 버전
FuriosaAI SDK 필수 패키지
Python 3.7 또는 상위 버전

Python 실행환경 준비가 필요하다면 Python 실행 환경 구성 를 참고한다.

간단하게 다음 커맨드를 실행해주세요.

$ pip install 'furiosa-sdk[server]'

아래와 같이 Github에서 소스를 다운받아 설치한다.

$ git clone https://github.com/furiosa-ai/furiosa-sdk.git
$ cd furiosa-sdk/python/furiosa-server
$ pip install .

모델 서버 실행

모델 서버는 명령행 도구 furiosa server 커맨드를 통해 실행 할 수 있다. furiosa server --help 을 실행하면 아래와 같은 도움말을 볼 수 있다.

$ furiosa server --help
Usage: furiosa server [OPTIONS]

    Start serving models from FuriosaAI model server

Options:
    --log-level [ERROR|INFO|WARN|DEBUG|TRACE]
                                    [default: LogLevel.INFO]
    --model-path TEXT               Path to Model file (tflite, onnx are
                                    supported)
    --model-name TEXT               Model name used in URL path
    --model-version TEXT            Model version used in URL path  [default:
                                    default]
    --host TEXT                     IP address to bind  [default: 0.0.0.0]
    --http-port INTEGER             HTTP port to listen to requests  [default:
                                    8080]
    --model-config FILENAME         Path to a config file about models with
                                    specific configurations
    --server-config FILENAME        Path to Model file (tflite, onnx are
                                    supported)
    --install-completion [bash|zsh|fish|powershell|pwsh]
                                    Install completion for the specified shell.
    --show-completion [bash|zsh|fish|powershell|pwsh]
                                    Show completion for the specified shell, to
                                    copy it or customize the installation.
    --help                          Show this message and exit.

간단한 모델 서빙은 커맨드로 tflite, onnx 포맷의 모델 이미지의 패스와 모델 이름을 지정하면 실행할 수 있다.

$ cd furiosa-sdk
$ furiosa server \
--model-path examples/assets/quantized_models/MNISTnet_uint8_quant_without_softmax.tflite \
--model-name mnist

--model-path 옵션으로 로컬 파일 시스템에 저장된 모델을 지정할 수 있다. 또한 모델 서버가 지정한 호스트 이름과 포트로 연결 요청을 대기하기 원하는 경우 --host, --host-port 로 각각 설정할 수 있다.

모델 설정을 이용한 모델 서버 실행

컴파일 옵션이나 서빙에 대한 더 고급 설정이 필요한 경우 또는 반복적으로 같은 옵션을 사용하는 경우 모델 설정을 활용할 수 있다.

model_config_list:
- name: mnist
    path: "samples/data/MNISTnet_uint8_quant.tflite"
    version: 1
    npu_device: npu0pe0
    compiler_config:
        keep_unsignedness: true
        split_unit: 0
- name: ssd
    path: "samples/data/tflite/SSD512_MOBILENET_V2_BDD_int_without_reshape.tflite"
    version: 1
    npu_device: npu0pe1

위와 같은 설정을 준비한 뒤에 아래와 같이 --model-config 옵션을 이용하여 설정 파일의 패스를 지정하여 실행할 수 있다. 위 예제 실행을 위해서는 모델이 필요한데 위 모델은 Furiosa Server Github 저장소 의 samples 디렉토리에서 찾을 수 있다. 모델과 모델 설정이 준비되어 있다면 아래 예제를 실행해볼 수 있다.

$ cd furiosa-sdk/python/furiosa-server
$ furiosa server --model-config samples/model_config_example.yaml

Saving the compilation log into /Users/hyunsik/.local/state/furiosa/logs/compile-20211126143917-2731kz.log
Using furiosa-compiler 0.5.0 (rev: 407c0c51f built at 2021-11-26 12:05:30)
2021-11-26T22:39:17.819518Z  INFO Npu (npu0pe0) is being initialized
2021-11-26T22:39:17.823511Z  INFO NuxInner create with pes: [PeId(0)]
...
INFO:     Started server process [62087]
INFO:uvicorn.error:Started server process [62087]
INFO:     Waiting for application startup.
INFO:uvicorn.error:Waiting for application startup.
INFO:     Application startup complete.
INFO:uvicorn.error:Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
INFO:uvicorn.error:Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

모델 서버가 시작되고 나면 일반적인 HTTP 요청을 보내 모델의 추론 작업을 해볼 수 있다. 모델 설정에서 모델 이름이 mnist 이고 버전이 1 인 모델에 추론 요청을 보낼 때는 http://<host>:<port>/v2/models/mnist/version/1/infer 에 POST 요청을 보내면 된다.

아래 예제는 위와 동일한 요청을 Python 코드를 통해 보내는 예제이다.

import requests
import mnist
import numpy as np

mnist_images = mnist.train_images().reshape((60000, 1, 28, 28)).astype(np.uint8)
url = 'http://localhost:8080/v2/models/mnist/versions/1/infer'

data = mnist_images[0:1].flatten().tolist()
request = {
    "inputs": [{
        "name":
        "mnist",
        "datatype": "UINT8",
        "shape": (1, 1, 28, 28),
        "data": data
    }]
}

response = requests.post(url, json=request)
print(response.json())

엔드포인트(Endpoint) 정보

다음 테이블은 모델 서버가 제공하는 주요 REST API 엔드포인트 정보이다. 모델 서버는 KServe Predict Protocol Version 2 - HTTP/REST 를 따르고 있으므로 더 자세한 정보는 링크에서 찾아볼 수 있다.

Endpoints of KServe Predict Protocol Version 2
Method and Endpoint	Description
GET /v2/health/live	서버가 요청을 처리할 수 있는 상태면 HTTP 상태 Ok 리턴 (Kubernetes livenessProbe에 해당)
GET /v2/health/ready	모든 모델이 추론 작업을 위한 준비가 되면 HTTP 상태 Ok 리턴 (Kubernetes readinessProbe에 해당)
GET /v2/models/${MODEL_NAME}/versions/${MODEL_VERSION}	모델 메타데이터 반환
GET /v2/models/${MODEL_NAME}/versions/${MODEL_VERSION}/ready	특정 버전의 모델이 추론 요청을 처리할 준비가 되었다면 HTTP 상태 Ok 리턴
POST /v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer	추론 요청