When Tachyum unveiled the concept of its Prodigy universal processor at Hot Chips 18, it caused a stir with a chip designed to run any code using a dynamic binary translator. It demonstrated high performance when running native and translated code. It took the company a while to design the actual hardware, taking pre-orders on review kits (opens in a new tab); the company also divulges the exact specifications of its Prodigy. They certainly look impressive, but they’re also scary with a thermal design power of 950W per chip.
Tremendous performance at tremendous power
Each Tachyum Prodigy processor has up to 128 proprietary cores coupled with 16 channels of DDR5 memory (for 1024-bit interface) supporting a data transfer rate of up to 7200 MT/s (and thus providing up to at 921.6 Gbps of bandwidth) as well as 64 PCIe 5.0 lanes. Additionally, the chip supports up to 8TB of DDR5 memory in total, which matches what we’ll see with upcoming server processors from other manufacturers. As for clock rates, Tachyum’s Prodigy is designed to operate up to 5.7 GHz and is a product of TSMC’s performance-optimized N5P process technology.
When it comes to performance, Tachyum expects its flagship Prodigy T16128-AIX processor (opens in a new tab) to deliver up to 90 FP64 TFLOPS for HPC as well as up to 12 ‘AI PetaFLOPS’ for inference and training, presumably when running native code and consuming up to 950 W (and using cooling liquid), according to published specifications (opens in a new tab) by the company and at Golem.de (opens in a new tab). Meanwhile, Tachyum’s Prodigy processors can operate in both 2-way and 4-way configurations. To put the numbers into context, AMD’s Instinct MI250X has a peak throughput of 96 FP64 TFLOPS for HPC at around 560W. In contrast, Nvidia’s H100 SXM5 can deliver up to 20 INT8/FP8 PetaOPS/PetaFLOPS for AI (up to 40 PetaOPS/PetaFLOPS sparingly) at 700W. Still, none of the compute GPUs work for general-purpose workloads. And this is exactly where it gets interesting.
A new processor is born
Tachyum’s Prodigy is a universal homogeneous processor containing up to 128 proprietary 64-bit VLIW cores that feature two 1024-bit vector units per core and one 4096-bit matrix unit per core. Additionally, each core has a 64 KB instruction cache, 64 KB data cache, 1 MB L2 cache, and can use other cores’ unused L2 caches as victim L3 cache. .
Tachyum’s VLIW cores are in-order cores, but when the compiler makes proper optimizations, they can support 4-way failure issues, according to Tachuym CEO and co-founder Radoslav Danilak, who s is maintained with Golem.de (opens in a new tab). He also re-emphasized that the Prodigy instruction set architecture can achieve very high instruction level parallelism with software using so-called poison bits.
These cores run native code written and explicitly optimized for Prodigy (where the VLIW architecture promises to shine) as well as x86, Arm, and RISC-V binaries using software emulation and with no performance degradation, according to the company. Historically, all attempts to get VLIW processors to run x86 code have failed (e.g. Transmeta’s Crusoe, Intel’s Itanium) primarily due to particular CPU architectures and emulation inefficiencies. The Tachyum maintainer admits that Qemu’s binary translation degrades performance by 30%-40% (without disclosing benchmarks), but hopes real-world performance will still be high enough to be competitive. Meanwhile, some programs are already natively supported.
“We support GCC and Linux natively, and FreeBSD now also works [on Prodigy]”, Danilak said. “Apache, MongoDB or Python already work natively, Pytorch and Tensorflow frameworks are also available.”
Tachyum points out that Prodigy is not an accelerator but a real processor that will compete with AMD, Intel and others. To ensure that the processor can deliver competitive performance on general-purpose and AI workloads, the company has made numerous changes to its design implementation since it was first introduced in 2018.
“We are a CPU replacement not an AI accelerator company, we target cloud/hyperscalers and telcos,” Danilak said. “Over time, we expect to gain supercomputer customers, so we have doubled the width of vector/MAC units from 512-bit to 1024-bit [which also brings in necessary data paths for the 4,096-bit matrix operations for artificial intelligence].”
Indeed, one particular benefit that Tachyum’s Prodigy promises is its ability to run a different kind of code. Assuming it can provide decent performance at decent power while running general purpose workloads (instances), this may give AWS, Microsoft Azure and others additional flexibility as they will be able to use the same machines to AI, HPC, and general purpose instances if needed. This will, of course, require some actual software work from different parties, but it could work, at least in theory.
still not here
It should be noted that Tachyum still does not have Prodigy silicon. As a result, all performance projections are the product of simulations, and the only thing the company currently has is an FPGA prototype of its processor.
Meanwhile, the company recently started taking pre-orders on Tachyum’s Prodigy review platform, which will be used on select Prodigy silicon. Companies must place orders by July 31, 2022, and delivery of actual hardware is approximately “six to nine months after receipt of order.”
Tachyum expects to record the first Prodigy silicon (which could be less than 500mm^2) in mid-August if all goes according to plan. After that, the company expects to get the first samples of its chip around December, and if the chip works fine, the company plans to start sampling (i.e. send evaluation kits ). Typically, silicon development takes about a year after the initial chip is returned from the factory. Still, Tachyum hopes its first processor will perform as expected and can start actual mass production in the first half of 2023.
In the future, Danilak envisions a Prodigy 2 processor made using one of TSMC’s N3 nodes that will deliver twice the performance at the same power with PCIe Gen6 support.