Arm Announces Neoverse V1, N2 Platforms & CPUs, CMN-700 Mesh: More Performance, More Cores, More Flexibility
by Andrei Frumusanu on April 27, 2021 9:00 AM EST- Posted in
- CPUs
- Arm
- Servers
- Infrastructure
- Neoverse N1
- Neoverse V1
- Neoverse N2
- CMN-700
The SVE Factor - More Than Just Vector Size
We’ve talked a lot about SVE (Scalable Vector Extensions) over the past few years, and the new Arm ISA feature has been most known as being employed for the first time in Fujitsu’s A64FX processor core, which now powers the world’s most performance supercomputer.
Traditionally, employing CPU microarchitectures with wider SIMD vector capabilities always came with the caveat that you needed to use a new instruction set to make use of these wider vectors. For example, in the x86 world, we’ve seen the move from 128b (SSE-SSE4.2 & AVX) to 256b (AVX & AVX2) to 512b (AVX512) vectors always be coupled with a need for software to be redesigned and recompiled to make use of newer wider execution capabilities.
SVE on the other hand is hardware vector execution unit width agnostic, meaning that from a software perspective, the programmer doesn’t actually know the length of the vector that the software will end up running at. On the hardware side, CPU designers can implement execution units in 128b increments from 128b to 2048b in width. As noted earlier, the Neoverse N2 uses this smaller implementation of 128b units, while the Neoverse V1 uses 256b implementations.
Generally speaking, the actual execution width of the vector isn’t as important as the total execution width of a microarchitecture, 2x256b isn’t necessarily faster than 4x128b, however it does play a larger role on the software side of things where the same binary and code path can now be deployed to very different target products, which is also very important for Arm and their mobile processor designs.
More important than the actual scalable nature of the vectors in SVE, is the new addition of helper instructions and features such as gather-loads, scatter-stores, per-lane predication, predicate-driven loop control (conditional execution depending on SIMD data), and many other features.
Where these things particularly come into play is for allowing compilers to generate better auto-vectorised code, meaning the compiler would now be capable of emitting SIMD instructions on SVE where previously it wasn’t possible with NEON – regardless of the vector length changes.
Arm here discloses that the performance advantages on auto-vectorizable code can be quite significant. In a 2x128b comparison between the N1 and the N2, we can see around 40th-percentile gains of at least 20% of performance, with some code reaching even much higher gains of up to +90%.
The V1 versus N1 increase being higher comes natural from the fact that the core has double the vector execution capabilities over the N1.
In general, both the N2, but particularly the V1, promise quite large increase in HPC workloads with vector heavy compute characteristics. It’ll definitely be interesting to see how these future designs play out and how SVE auto-vectorisation plays out in more general purpose workloads.
95 Comments
View All Comments
mode_13h - Wednesday, April 28, 2021 - link
Ah, yes! wikichip says of Zen 1:> Accordingly the peak throughput is four SSE/AVX-128 instructions
> or two AVX-256 instructions per cycle.
And Zen 2:
> This improvement doubles the peak throughput of AVX-256 instructions to four per cycle
Wow!
mode_13h - Tuesday, April 27, 2021 - link
What's SLC? I figured it was Second-Level Cache, until I saw the slide referencing "SLC -> L2 traffic"."System Level Cache", maybe? Could it be the term they use instead of L3 or LLC?
Thala - Tuesday, April 27, 2021 - link
I think you are totally right - SLC == LLC.Thala - Tuesday, April 27, 2021 - link
Quick addition. The term SLC is more popular lately, as it emphasize that the cache is not only shared among the cores but also with the system (GPU, DMAs etc).mode_13h - Wednesday, April 28, 2021 - link
Thanks. I guess I should've just waited until I'd finished reading it, because the interconnect slide made it abundantly clear.Now, I'm wondering about this "snoop filter" and why so much RAM is needed for it, when Graviton 2 & Altra have so little SLC. So, I gather it's not like tag RAM, then? Does it index the L2 of the adjacent cores, or something like that?
mode_13h - Tuesday, April 27, 2021 - link
Question and corrections on Page 6: PPA & ISO Performance ProjectionsWhat do the colors on the chip plots mean?
> Only losing out 10% IPC versus the N1
I'm sure that's meant to say "V1".
> In terms of absolute IPC improvements
Huh? These are definitely "relative IPC improvements" or just "IPC improvements".
Calin - Wednesday, April 28, 2021 - link
AWS share by vendor type: It should have been "Vendor A" and "Vendor I"mode_13h - Wednesday, April 28, 2021 - link
That slide was provided by ARM and I think they're trying to have at least the *appearance* of maintaining anonymity, even if the identities are abundantly clear.Also, you realize that their Vendor A is your Vendor I, right?
serendip - Wednesday, April 28, 2021 - link
How does the narrower front end and shallower pipeline of the N2 compare to Apple's M1? I'm thinking about how this could translate to the A78 successor, if that uses an evolution of the X1 core with improvements from N2 brought in.mode_13h - Thursday, April 29, 2021 - link
Good point. It suggests the A78+1 will perform < N2.Although, a derivative X-core would likely be > N2.