FuriosaAI NPU

FuriosaAI NPU is a chip with an architecture optimized for deep learning inference. It demonstrates high performance for deep learning inference while maintaining cost-efficiency. FuriosaAI NPU is optimized for inferences with low batch sizes; for inference requests with low batch sizes, all of the chip’s resources are maximally utilized to achieve low latency.

The large on-chip memory is also able to retain most major CNN models, thereby eliminating memory bottlenecks, and achieving high energy efficiency.

FuriosaAI NPU supports key CNN models used in various vision tasks, including Image Classification, Object Detection, OCR, and Super Resolution. In particular, the chip demonstrates superior performance in computations such as depthwise/group convolution, that drive high accuracy and computational efficiency in state-of-the-art CNN models.

FuriosaAI Warboy

FuriosaAI’s first generation NPU Warboy, delivers 64 TOPS performance and includes 32MB of SRAM. Warboy consists of two processing elements (PE), which each delivers 32 TOPS performance and can be deployed independently. With a total performance of 64 TOPS, should there be a need to maximize response speed to models, the two PEs may undergo fusion, to aggregate as a larger, single PE. Depending on the users’ model size or performance requirements the PEs may be 1) fused so as to optimize response time, or 2) utilized independently to optimize for throughput.

FuriosaAI SDK provides the compiler, runtime software, and profiling tools for the FuriosaAI NPU. It also supports the INT8 quantization scheme, used as a standard in TensorFLow and PyTorch, while providing tools to convert Floating Point models using Post Training Quantization. With the FuriosaAI SDK, users can compile trained or exported models in formats commonly used for inference (TensorFlowLite or ONNX), and accelerate them on FuriosaAI NPU.

FuriosaAI Warboy HW Specifications

The chip is built with 5 billion transistors, dimensions of 180mm^2, clock speed of 2GHz, and delivers peak performance of 64 TOPS of INT8. It also supports a maximum of 4266 for LPDDR4x. Warboy has a DRAM bandwidth of 66GB/s, and supports PCIe Gen4 8x.

Warboy Hardware Specification
Peak Performance	64 TOPS
On-chip SRAM	32 MB
Host Interface	PCIe Gen4 8-lane
Form Factor	Full-Height Half-Length (FHHL) Half-Height Half-Length (HHHL)
Thermal Solution	Passive Fan Active Fan
TDP	40 - 60W (Configurable)
Operating Temperature	0 ~ 50℃
Clock Speed	2.0 GHz
DDR Speed	4266 Mbps
Memory Type	LPDDR4X
Memory Size	16 GB (max. 32 GB)
Peak Memory Bandwidth	66 GB/s

FuriosaAI Warboy Performance

Results submitted to MLCommons can be found at MLPerf™ Inference Edge v2.0 Results

FuriosaAI NPU

FuriosaAI Warboy

FuriosaAI Warboy HW Specifications

FuriosaAI Warboy Performance

See Also