Compiler
The FuriosaAI compiler compiles models of formats TFLite and Onnx model ((OpSet 13 or lower version), thereby generating programs that execute inference using FuriosaAI NPU and resources (CPU, memory, etc) of the host machine. In this process, the compiler analyses the model at the operator level, optimizes it, and generates a program so as to maximize NPU acceleration and host resources utilization. Even for models that are not well known, so long as supported operators are utilized well, you can design models that are optimized for the NPU .
You can find the list of NPU acceleration supported operators at List of Supported Operators for Warboy Acceleration.
furiosa-compiler
The most common ways to use a compiler would be to automatically call it during the process of resetting the inference API or the NPU.
But you can directly compile a model and generate a program by using the command line tool furiosa-compiler
in shell. You can install furiosa-compiler
command via APT package manager.
$ apt install furiosa-compiler
The usage of furiosa-compiler
is as follows:
$ furiosa-compiler --help
Furiosa SDK Compiler v0.10.0 (f8f05c8ea 2023-07-31T19:30:30Z)
Usage: furiosa-compiler [OPTIONS] <SOURCE>
Arguments:
<SOURCE>
Path to source file (tflite, onnx, and other IR formats, such as dfg, cdfg, gir, lir)
Options:
-o, --output <OUTPUT>
Writes output to <OUTPUT>
[default: output.<TARGET_IR>]
-b, --batch-size <BATCH_SIZE>
Specifies the batch size which is effective when SOURCE is TFLite, ONNX, or DFG
--target-ir <TARGET_IR>
(experimental) Target IR - possible values: [enf]
[default: enf]
--target-npu <TARGET_NPU>
Target NPU family - possible values: [warboy, warboy-2pe]
[default: warboy-2pe]
--dot-graph <DOT_GRAPH>
Filename to write DOT-formatted graph to
--analyze-memory <ANALYZE_MEMORY>
Analyzes the memory allocation and save the report to <ANALYZE_MEMORY>
-v, --verbose
Shows details about the compilation process
--no-cache
Disables the compiler result cache
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
SOURCE
is the file path of
TFLite or ONNX.
You have to use quantized models through Model Quantization for NPU accleration.
You can omit the option -o OUTPUT, and you can also choose to designate the output file name.
When omitted, the default output file name is output.enf
. Here, enf stands for Executable NPU Format.
So if you run as shown below, it will generate a output.enf
file.
furiosa-compiler foo.onnx
If you designate the output file name as below, it will generate a foo.enf
file.
furiosa-compiler foo.onnx -o foo.enf
--target-npu
lets the generated binary to designate target NPU는.
NPU Family |
Number of PEs |
Value |
---|---|---|
Warboy |
1 |
warboy |
Warboy |
2 |
warboy-2pe |
If generated program’s target NPU is Warboy that uses one PE independently, you can run the following command.
furiosa-compiler foo.onnx --target-npu warboy
When 2 PEs are fused, execute as follows.
furiosa-compiler foo.onnx --target-npu warboy-2pe
The --batch-size
option lets you specify batch size, the number of samples
to be passed as input when executing inference through the inference API.
The larger the batch size, the higher the NPU utilization, since more data is given as input and executed
at once. This allows the inference process to be shared across the batch, increasing efficiency.
However, if the larger batch size results in the necessary memory size exceeding NPU DRAM size,
the memory I/O cost between the host and the NPU may increase and lead to significant performance degradation.
The default value of batch size is one. Appropriate value can usually be found through trial and error.
For reference, the optimal batch sizes for some models included in the
MLPerf™ Inference Edge v2.0 benchmark are as follows.
Model |
Optimal Batch |
---|---|
SSD-MobileNets-v1 |
2 |
Resnet50-v1.5 |
1 |
SSD-ResNet34 |
1 |
If your desired batch size is two, you can run the following command.
furiosa-compiler foo.onnx --batch-size 2
Using ENF files
After the compilation process, the final output of the FuriosaAI compiler is ENF (Executable NPU Format) type data. In general, the compilation process takes from a few seconds to several minutes depending on the model. Once you have the ENF file, you can reuse it to omit this compilation process.
This may be useful if you need to frequently create sessions or serve one model across several machines in an actual operation environment.
For example, you can first create an ENF file by referring to furiosa-compiler.
Then, with PythonSDK as shown below,
you can instantly create a runner without the compilation process by
passing the ENF file as an argument to the create_runner()
function as follows:
from furiosa.runtime import sync
with sync.create_runner("path/to/model.enf") as runner:
outputs = runner.run(inputs)
Compiler Cache
Compiler cache allows to user applications to reuse once-compiled results. It’s very helpful especially when you are developing applications because the compilation usually takes at least a couple of minutes.
By default, the compiler cache uses a local file system ($HOME/.cache/furiosa/compiler
) as a cache storage.
If you specify a configuration, you can also use Redis as a remote and distributed cache storage.
The compiler cache is enabled by default, but you can explicitly enable or disable the cache by setting FC_CACHE_ENABLED
.
This setting is effective in CLI tools, Python SDK, and serving frameworks.
# Enable Compiler Cache
export FC_CACHE_ENABLED=1
# Disable Compiler Cache
export FC_CACHE_ENABLED=0
The default cache location is $HOME/.cache/furiosa/compiler
, but you can explicitly specify the cache storage
by setting the shell environment variable FC_CACHE_STORE_URL
. If you want to Redis as a cache storage,
you can specify some URLs starting with redis://
or rediss://
(over SSL).
# When you want to specify a cache directory
export FC_CACHE_STORE_URL=/tmp/cache
# When you want to specify a Redis cluster as the cache storage
export FC_CACHE_STORE_URL=redis://:<PASSWORD>@127.0.0.1:6379
# When you want to specify a Redis cluster over SSL as the cache storage
export FC_CACHE_STORE_URL=rediss://:<PASSWORD>@127.0.0.1:25945
The cache will be valid for 30 days by default, but you can explicitly specify the cache lifetime by setting
seconds to the environment variable FC_CACHE_LIFETIME
.
# 2 hours cache lifetime
export FC_CACHE_LIFETIME=7200
Also, you can control more the cache behavior according to your purpose as follows:
Value (secs) |
Description |
Example |
---|---|---|
N > 0 |
Cache will be alive for N secs |
7200 (2 hours) |
0 |
All previous cache will be invalidated. (When you want to compile the model without cache) |
0 |
N < 0 |
Cache will be alive forever without expiration. (it can be useful when you want read-only cache) |
-1 |