Something tells me there might be an extremely recent article on Ars that goes into this.Genuine question. It appears you can run AI workload on the GPU on NVidia's CUDA, on the CPU using AVX-512 instructions, and on dedicated processing units (NPU inside SoC, Tensor accelerator, etc). How can you compare the performance across all these implementations?
Image processing and object identification, for one. If you have a forward-facing camera in your car, you likely have an NPU in its main computer.What is the actual use case for an NPU at this point?
What is the actual use case for an NPU at this point?
The intent is to offload tasks that would either potentially make the system less responsive (if they ran on the CPU) or use more power (in the case of a discrete GPU). On Windows, it's like Windows Studio effects for cameras, transcriptions, translations, voice isolation, etc. Whether it actually ends up working the way AI enthusiasts think it will is anybody's guess, but the idea is to have dedicated hardware that doesn't use a ton of power but can do a lot of the math that is required for AI to work.What is the actual use case for an NPU at this point?
How much real estate do you think these things take?The one thing more expensive than real estate on a microchip is unused real estate on a microchip. I fear this will make for higher prices to enable doing things that purchasers have no strong interest in doing. Start writing 2026's articles now: "After spending $HORRIBLE_NUMBER to add NPU capabilities to their product lines, only to be met with consumer indifference, $COMPANY faces a hostile takeover attempt from Worms 'N' Bucks Hedge Fund and Bait Shop."
It doesn't sound that hard to me. If you want to measure how fast each of these compute primes, you measure the number of primes computed in a given timebox. If you want to measure how much AI they can crunch, you measure the number of weighted matrix operations they can do in a timebox.Genuine question. It appears you can run AI workload on the GPU on NVidia's CUDA, on the CPU using AVX-512 instructions, and on dedicated processing units (NPU inside SoC, Tensor accelerator, etc). How can you compare the performance across all these implementations?
By looking at the resulting Geekbench scores, presumably.Genuine question. It appears you can run AI workload on the GPU on NVidia's CUDA, on the CPU using AVX-512 instructions, and on dedicated processing units (NPU inside SoC, Tensor accelerator, etc). How can you compare the performance across all these implementations?
Why do you need an NPU for this? There have always been dedicated image, audio and video processing chips embedded for accomplishing these enhancements.Video noise reduction, audio noise reduction, image background replacement (for video conferencing, etc), video enhancements (adding bokeh to webcam feed, etc), face tracking for video conferencing, image enhancement routines for photo editing software ("one-click photo adjustment")
Using DNNs for these tasks provides much better results, and requires only one blob of dedicated silicon...Why do you need an NPU for this? There have always been dedicated image, audio and video processing chips embedded for accomplishing these enhancements.
I agree with requiring less silicon space. However, I am skeptical that it provides better results. A dedicated chip is always capable of outperforming a chip with a mish mash of functionalities.Using DNNs for these tasks provides much better results, and requires only one blob of dedicated silicon...
What is the actual use case for an NPU at this point?
That's an intuitive but wrong understanding of the situation.I agree with requiring less silicon space. However, I am skeptical that it provides better results. A dedicated chip is always capable of outperforming a chip with a mish mash of functionalities.
It's analogous to an iphone camera and a dedicated camera. The iphone camera can be better than most camera's but will never beat the high end dedicated cameras. It's the same for the Digital Signal Processors.
Perhaps the image generation capabilities and less silicon space are the only true standout features of an NPU.
I’ve seen tests of various smartphones doing ML tasks and the results are all over the place. Is this something that can even be accurately measured? What about the models companies like Apple or Google run on their devices which may be optimized specifically for their NPU?
Or am I missing something?
Because a generalized thing that can do multiple tasks well is better in a myriad of ways than lots of individual specialized things that can only do a single task well? If I can replace all those chips with a single component that does as well or better than any of them, I've saved cost, energy use, space....Why do you need an NPU for this? There have always been dedicated image, audio and video processing chips embedded for accomplishing these enhancements.
Face recognition, object recognition, scene recognition, fingerprint recognition, text recognition, image enhancement, object selection, voice recognition, transcription, translation, song recognition, and various transformations.What is the actual use case for an NPU at this point?
(a) Are training or inferring?Genuine question. It appears you can run AI workload on the GPU on NVidia's CUDA, on the CPU using AVX-512 instructions, and on dedicated processing units (NPU inside SoC, Tensor accelerator, etc). How can you compare the performance across all these implementations?
Sounds like almost everything Apple already uses an NPU for today.There are a TON of potential uses, but the software isn’t there yet. Photo editing, filtering, etc. Spell checking. Also, AI upscaling. Sound file cleanup, I could go on. Oh, even virus scanning.
Justify that statement.Man, I love metrics that have little relevance to the real world!
That's an exceptionally dumb statement.It doesn't sound that hard to me. If you want to measure how fast each of these compute primes, you measure the number of primes computed in a given timebox. If you want to measure how much AI they can crunch, you measure the number of weighted matrix operations they can do in a timebox.
You are correct that this (boldened) is a HUGE problem with all these benchmarks.I’ve seen tests of various smartphones doing ML tasks and the results are all over the place. Is this something that can even be accurately measured? What about the models companies like Apple or Google run on their devices which may be optimized specifically for their NPU?
Or am I missing something?
That’s a terrible example because there’s no real difference between a dedicated camera and an iPhone camera except size. You’re making the analogy that a four door car is better than a motorcycle at being a 4 door car.I agree with requiring less silicon space. However, I am skeptical that it provides better results. A dedicated chip is always capable of outperforming a chip with a mish mash of functionalities.
It's analogous to an iphone camera and a dedicated camera. The iphone camera can be better than most camera's but will never beat the high end dedicated cameras. It's the same for the Digital Signal Processors.
Perhaps the image generation capabilities and less silicon space are the only true standout features of an NPU.
Developing an ISP is harder than developing a ML model.Why do you need an NPU for this? There have always been dedicated image, audio and video processing chips embedded for accomplishing these enhancements.
For the particular case you describe, the model is probably running on the CPU or GPU - you can look at some sort of tool like PowerMetrix or asitop on Apple Silicon to see what hardware is being used.This will be very interesting to play with. I use GPT4all on several devices, and it is remarkable to see the differences in TPS results. For instance, on an Intel Core i7-13700HX, which is a pretty powerful chip, with a RTX3060 and 32 GB RAM running Llama 3.1 8B running on the GPU, I get 5 TPS.
When running on a Macbook Pro M2, running the same model also with 32 GB RAM, I get 33 TPS, or over 6x the performance.
The i7 benches slightly higher in most "normal" benchmarks, and I don't think the Mac is using the NPU.
But when using PhI 3 Mini, I get 30 TPS on the i7, and about 60 TPS on the Ms, for a 2x ratio.
So the chip architectural differences are really quite remarkable for this type of workload.
As an additional data point using Win11, ONNX and DirectML:Radeon RX 7900 GRE
View attachment 88006
View attachment 88007
16" MacBook Pro M1 Max
View attachment 88008
View attachment 88009
Numbers without giving the OS are meaningless.Radeon RX 7900 GRE
View attachment 88006
View attachment 88007
16" MacBook Pro M1 Max
View attachment 88008
View attachment 88009
All of Apple's computational photography including FaceID is done in the NPU. It's why they added them 6 years ago. It's why iPhone has double the NPU compute of the corresponding M series processor.What is the actual use case for an NPU at this point?
Yes, I was asking about PCs. Embedded use cases are pretty straightforward especially in environments like you note where msec count. I am not as impressed by edge computing on a PC - the cloud is "pretty close" at that point and I don't think the savings would be anything much to talk about. That is, content creators are probably going to want the all-up versions you get in the cloud anyway, aside from some fairly nice applications -- I think. If I'm wrong I'd like to know what the applications really are at this point.Image processing and object identification, for one. If you have a forward-facing camera in your car, you likely have an NPU in its main computer.
If you're asking what's the use case for an NPU in a desktop PC, besides video and audio processing, there's also running language and inference models locally, rather than piping everything up to the cloud.
It’s a chicken and egg thing here. If NPUs don’t exist then SW using NPUs won’t be written.Yes, I was asking about PCs. Embedded use cases are pretty straightforward especially in environments like you note where msec count. I am not as impressed by edge computing on a PC - the cloud is "pretty close" at that point and I don't think the savings would be anything much to talk about. That is, content creators are probably going to want the all-up versions you get in the cloud anyway, aside from some fairly nice applications -- I think. If I'm wrong I'd like to know what the applications really are at this point.