Each of the tiles consists of a five-port router, a processing engine (PE), a 32-entry register file (six read ports and four write ports), 2K of data memory, 3K of instruction memory, and two floating-point units. Each FPU is a single-precision, floating-point multiply-accumulator (FMAC) with a nine-stage pipeline with a sustained rate of 2FLOPs/cycle.
The PE implements a non-x86, VLIW ISA that uses a 96-bit instruction word containing up to up to 8 ops/cycle. The instruction latencies for the different functional blocks in the PE are as follows: