Simultaneous heterogeneous multithreading (SHMT) may be a solution that can harness the power of a device’s CPU, GPU, and AI accelerators all at once, according to a research paper from the University of California, Riverside. The paper claims that this new multithreading technique doubles performance, halves power consumption, and quadruples efficiency. But as a proof of concept, don’t get too excited. It’s still in the early stages.
Many devices already use multithreading techniques such as simultaneous multithreading (SMT), which splits a processor core into two threads for more efficient computing. However, SHMT spans multiple devices, including CPUs, GPUs, and at least one AI-powered accelerator. The idea is to allow each processor to perform separate tasks at the same time, while also distributing GPU and AI resources across multiple tasks.
According to a paper written by Hung-Wei Tseng and Kuan-Chieh Hsu, SHMT can improve performance by 1.95 times and reduce power consumption by 51%. These results were recorded on his Maxwell-era Jetson Nano from Nvidia with a quad-core Cortex A57 Arm CPU, 4GB LPDDR4, and 128-core GPU. Additionally, the researchers installed a Google Edge TPU into his M.2 slot on the Jetson to provide his AI accelerator, since Jetson comes with an AI accelerator.
The researchers achieved this result by creating a quality-aware workstealing (QAWS) scheduler. Basically, the scheduler is tuned to avoid high error rates and evenly balance the workload across all components. With the QAWS policy, tasks that require high precision and accuracy are not assigned to error-prone AI accelerators, and tasks that do not meet performance expectations are dynamically reassigned to other components.
You might be wondering what you’re getting with twice the performance, half the power, and four times the efficiency. According to the paper, “The limitations of SHMT lie not in the model itself, but in the ability of programmers to rethink the algorithm to allow for the type of parallelism that makes SHMT easier to exploit.” It talks about how software should be written for this purpose and how not all software can take full advantage of SHMT.
Rewriting software is known to be a pain. For example, when Apple switched from Intel to its own Arm chips for Mac PCs, it had to do a lot of work. Especially when it comes to multithreading, it can take some time for developers to adjust. It took several years for software to be able to take advantage of multi-core CPUs, and we may see a similar timeline for developers to be able to utilize multiple components for the same task.
Additionally, the paper details how the performance improvement of SHMT depends on the problem size. The 1.95x figure comes from the largest problem size tested in the paper, but as the problem size gets smaller, the performance improvement is lower. There is essentially no performance advantage in the smallest problem size because as the problem size decreases, all components have less chance of working in parallel.
As computers of all types increasingly include multiple computing devices, such as AI processors, it is perhaps inevitable that developers will want to use more hardware to speed things up. Probably not. Even if SHMT doesn’t live up to the best-case scenario outlined in this paper, he says, if SHMT or similar technology becomes mainstream, it could give a boost to PCs and smartphones.