Comparing ARM vs RISC-V vs x86_64 with GCC vs Clang

Рік тому

☟☟ Important conference, book and swag info in description ☟☟
How do GCC and Clang compare when generating ARM, RISC-V and x86_64 code?
Link to Compiler Explorer examples: compiler-explorer.com/z/PoEaT...
Upcoming Workshop!
► Digging Deeper With Best Practices
Sept 10-11, Aurora, CO, USA
cppcon.org/class-2022-diggin-...
WANT MORE JASON?
► My Training Classes: emptycrate.com/training.html
► Follow me on twitter: / lefticus
SUPPORT THE CHANNEL
► Patreon: / lefticus
► Github Sponsors: github.com/sponsors/lefticus
► Paypal Donation: www.paypal.com/donate/?hosted...
JASON'S BOOKS
► C++ Best Practices
Amazon Paperback (Color Printing): amzn.to/3wpAU3Z
Leanpub Ebook: leanpub.com/cppbestpractices
JASON'S PUZZLE BOOKS
► Object Lifetime Puzzlers Book 1
Amazon Paperback: amzn.to/3g6Ervj
Leanpub Ebook: leanpub.com/objectlifetimepuz...
► Object Lifetime Puzzlers Book 2
Leanpub Ebook: leanpub.com/objectlifetimepuz...
► Copy and Reference Puzzlers Book 1
Amazon Paperback: amzn.to/3g7ZVb9
Leanpub Ebook: leanpub.com/copyandreferencep...
► OpCode Puzzlers Book 1
Amazon Paperback: amzn.to/3KCNJg6
Leanpub Ebook: leanpub.com/opcodepuzzlers_book1
RECOMMENDED BOOKS
► Bjarne Stroustrup's A Tour of C++: amzn.to/3IUiVGt
AWESOME PROJECTS
► The C++ Starter Project - Gets you started with Best Practices Quickly - github.com/cpp-best-practices...
► C++ Best Practices Forkable Coding Standards - github.com/cpp-best-practices...
O'Reilly VIDEOS
► Inheritance and Polymorphism in C++ - www.oreilly.com/library/view/...
► Learning C++ Best Practices - www.oreilly.com/library/view/...

КОМЕНТАРІ: 43

@vytah Рік тому

20:20 on some CPUs POPCNT has false dependency on the destination register, so the XOR lets the CPU to schedule POPCNT earlier. GCC added this workaround in 2014, see the issue 62011 on GCC Bugzilla. The comments there suggest that this might be fixed on newer CPUs, but I haven't found a direct confirmation yet.

@ycombinator765 Рік тому

This was so fun to watch even tho I only understood 30% or so of it. This tells me how much I have yet to learn. Thanks for posting this

@johanmyreen1027 Рік тому

05:09 The RISC-V architecture defines 32 registers named x0 to x31. x0 is always read as 0, and writes to x0 are discarded. The a0 register is actually an alias for x10. Every register has an alias name, and the aliases reflect the usage of the registers in the standard calling convention ABI. For example, the return address register is x1 and has an ABI name ra, x2 is used as the stack pointer and has the ABI name sp, and so on. The "a" registers a0-a7 (x10-x17) are used for function arguments and return values (a0-a1).

@FlippieCoetser 5 місяців тому

More of this stuff, Please! Awsome!

@theIpatix Рік тому

24:35 I think this is because clang believes the target risc-v doesn't have the hardware floating point extensions. I'm fairly certain that for other architectures gcc nor clang usually inline software emulation of missing hardware features. Probably because this allows embedded developers to use custom implementations for things like divisions or floating point.

@bobweiram6321 Рік тому

Was it "Computer Architecture: a quantitative approach." John Hennessy and David Patterson?

@Arthur-qv8np Рік тому

@21:35 RISC-V actually has a popcount instruction, it's located in extension B (Bitmanip). In order to use it you must add "b_zbb" to the "march" option.

@coshvjicujmlqef6047 Рік тому

That is why Riscv is retarded. It is not even a real ISA. It is a bunch of disjunct ISAs which are not compatible with each other at all. What's the point of this shit when you can just make your own ISA to satisfy your demand? ukposts.info/have/v-deo/hYKkZ2x_ZGuBo4U.html

@coshvjicujmlqef6047 Рік тому

riscv is just objectively worse than loongarch

@Fill_In_The_Blank_Programmer Рік тому

Off air I did try really hard to find that instruction, but was unable to. I also played with different architectures, but got a little lost in the options. Thanks for the info!

@coshvjicujmlqef6047 Рік тому

@@Fill_In_The_Blank_Programmer The problem is that libraries include glibc and libstdc++ are not compiled with mb_zbb thing that means it runs extremely slow for majority of tasks. It is just like -march=native for x86_64 which rarely people would use it.

@coshvjicujmlqef6047 Рік тому

@@Fill_In_The_Blank_Programmer BTW. It is just a proposal for bit extensions for RISCV. It is not mandatory.

@Malephex Рік тому

Was it "Universal Assembly Language", Fitz and Crocket 1986?

@Fill_In_The_Blank_Programmer Рік тому

I don't think so. I feel like it was more of a "computer architecture" title

@bobweiram6321 Рік тому

@@Fill_In_The_Blank_Programmer David Patterson

@pipony8939 Рік тому

@@Fill_In_The_Blank_Programmer David Patterson

@theexplosionist2019 5 місяців тому

popcnt had a false dependancy on skylake, fixed in sunny cove.and onwards hence xor eax,eax

@Quarky_ Рік тому

I think it's RISC 5 since it's the fifth version of the original RISC ISA.

@Fill_In_The_Blank_Programmer Рік тому

Yup, I figure that out a few minutes into the episode :D

@Omnifarious0 Рік тому

18:52 - I'm guessing that gcc's Risc-V code will be faster. There is a micro-parallelization opportunity here involving doing the add and shift at the same time in different ALU units. If you do things the way clang is doing them, you're forcing a false serialization because of the way a register is being re-used. I suspect (from my limited understanding of exactly how the hardware works) that a really good instruction pipeline handler would notice this and re-arrange things to remove the false data dependence. But, a less sophisticated one wouldn't.

@Arthur-qv8np Рік тому

First, modern Superscalar Out-of-Order processors do not check if instructions can strictly be executed in parallel, they check if instructions can be reordered, which is a higher property (If they can be reordered, then they can be run in parallel). _Note: Superscalar In-order processors, which have lower performance, actually strictly check if the instructions can be executed in parallel. But compilers here must target Out-of-Order processors, and my conclusion here is identical for in-order processors._ To determine if instructions can be reordered, we must identify the dependencies between these instructions. For basic arithmetic instructions like ADD/SHIFT there are only register dependencies. There are 4 types of register dependencies: WAW: Write After Write, a register is written after another instruction has written this register; WAR: Write After Read, a register is written after another instruction has read this register; RAW: Write After Read, a register is read after another instruction has written this register; RAR: Read After Read, a register is read after another instruction has read this register. Among these dependencies there are false-dependencies. First, it's obvious that the RAR dependency is a false-dependency. Instructions that read the same register can be reordered (and executed in parallel). But, less obvious, WAR and WAW dependencies are also false-dependencies: they could have been avoided by using another register for the last register write (the destination register). They are naming-dependencies. To get rid of these dependencies we can rename the destination register and reorder the instructions. That's the register-renaming job in modern processors. Example: (1) addiw a1, a1, 42 (2) addiw a1, a2, 33 instruction (2) write register 'a1' after instruction (1) both read (WAR dependency) and write (WAW dependency) this register. This prevents to reorder instruction (2) before instruction (1) so you must be careful. By renaming the destination registers the problem disappears and it is semantically equivalent (as long as you know that p2 corresponds to the new a1 register): (1) addiw p1, a1, 42 (2) addiw p2, a2, 33 When renamed, it's safe to reorder them. The last dependency is the RAW dependency, this dependency is the only true register dependency that prevent reordering and parallel execution. You can't reorder two instructions if the second instruction depends on the result of the first one. @18:52 both GCC and CLANG output instructions that have RAW dependencies. Because the destination register of each instruction is always read in the following instruction. It is therefore impossible to reorder them. But we can rewrite the code to expose parallelism: (1) slliw a0, a0, 3 (2) addiw a1, a1, 44 (3) addu a0, a0, a1 Here instructions (1) and (2) are independent, but instruction (3) depends on (1) and (2). However, there is a subtlety here that prevents the compiler from doing this re-write/re-ordering.

@Omnifarious0 Рік тому

@@Arthur-qv8np - That was very educational. Thank you!

@NoX-512 Рік тому

@@Arthur-qv8npThey are talking about macro-op fusion, not parallel execution.

@Arthur-qv8np Рік тому

@@NoX-512 Not sure who you are referring to with "they". The message I'm responding to mention: "doing the add and shift at the same time in different ALU units" That's parallel execution. And macro-op fusion has nothing to do with execution things in parallel on different units. In fact it's quite the opposite, fused operations usually remain so, except for load/store operation but they do not split with respect to the original macro-ops. Macro-op fusion is kind of more about doing several operations on the same unit (e.g. div/rem)

@Arthur-qv8np Рік тому

@@NoX-512 Oh I think you are talking about the code example in the video at the timecode. Yes macro-op fusion could be relevant here. But I doubt this is a macro-op fusion pattern that exists in the current riscv processors

@dascandy Рік тому

@5:30 you're returning a 64-bit integer, but the instruction for loading a 32-bit value into the 32-bit half also zeroes the top half, making it smaller to encode.

@Arthur-qv8np Рік тому

You can check that int corresponds to 4 bytes (aka 32 bits) in this situation. Just open the compiler explorer and write: int sizeof_int() { return sizeof(int); } All functions will return 4, so int is 4 byte here.

@waynes84 Рік тому

Still don't understand why ARM is efficient. It still has more instructions on average to handle right ?

@miroslavbrabec94 Рік тому

Try "ARM64 is much better ISA. x86 is old garbage. Deep dive ISA comparison and future outlook."

@piotrc966 Рік тому

"Still don't understand why ARM is efficient." Because procesor are build like for mobile, not desktop. Efficient not depends on ISA.

@piotrc966 Рік тому

@@miroslavbrabec94 "ARM64 is much better ISA. x86 is old garbage." Generally yes, but it isn't matter. E.g. x86 CPU are faster and more efficient than RISC processor like IBM POWER. POWER ISA is more modern than ARM.

@volodymyrdobrovolsky8610 Рік тому

Dear Mark. Sans this "magnificent eight" the RISC-V will show very low performance, and will not reach the technical level of Intel/AMD/ARM. My MHP RISC architecture is far ahead, for it do not need "magnificent eight" at all.

@landspide Рік тому

risc five

@AndrewRoberts11 5 місяців тому

For future videos, can you please turn off DARK MODE, as makes viewing in a mobile screen almost impossible.

@volodymyrdobrovolsky8610 Рік тому

The RISC-V is based on 40-year old ideas as RISC-V Foundation claims. There is no sense to port the huge x86 and ARM software ecosystems on it. Thus, RISC-V will never gain a victory over x86 and ARM. The most of positives about the RISC-V processor are arbitrary speculations. The advantage of RISC-V is open architecture. RISC-V has instructions of variable lengths. This is bad, it is a departure from the RISC architecture principles. The Contemporary microprocessors contain 8 specific hardware components: (1) SMT (Simultaneous Multithreading), (2) register renaming, (3) instruction reordering, (4) out-of-order execution, (5) speculative execution, (6) superscalar execution, (7) delayed branch, (8) branch prediction. These components make up some kind of a “magnificent eight” of components which essentially raise the performance of microprocessors. But unfortunately they are very complex. A processor core having these components is a full-fledged one, otherwise it is good for simple applications, e. g. for embedded systems. The “magnificent eight” of components is very hard to design, only the experienced firms and developers are able to do this, and much know-how was acquired, some effective solutions are patented. Particularly complex is the SMT. Only powerful and advanced firms like Intel, AMD, IBM are able to equip their processors with the “magnificent eight” components. It is not surprising that some Intel processors, and the famous Apple's M1 processor do not contain SMTs. If a company is able create the full-fledged RISC-V processor with all “magnificent eight” components then it would be a serious achievement, and such RISC-V would be considered of the World's class comparable with x86, with ARM, but not more. As far as I understand most of the developed RISC-V processors have no components from the “magnificent eight”, and are intended for embedded systems. A course directed on further development of RISC-V is a wrong way, and leads the computer architecture to deadlock. The RISC-V is not perspective for computer industry. The World demands absolutely novel microprocessor having much more higher performance than all contemporary ones. The novel and effective ideas on computer architectures do exist! Here’s such a novel processor architecture: V. K. Dobrovolskyi. Microprocessor Based on the Minimal Hardware Principle. Electronic Modeling, 2019, vol 41, No 6. pp. 77-90. The article is posted (under the Cyrillic name добровольский.pdf): www.emodel.org.ua/en/ touch ARCHIVE, then move to 2019, then to VOL 41, NO 6(2019) pp. 77-90. This processor does not have the “magnificent eight”, it is not necessary at all. This comment reflects different view on the RISC-V architecture, and the computer community has a right to become familiar with such a view. I’m Volodymyr Dobrovolskyi.

@Amadi605 Рік тому

I can only read the abstract. How can I read the entire document?

@markteague8889 Рік тому

It seems like the Chinese would be very eager to adopt something like the Risc-V ISA for their CPU designs going forward; and subsequently, abandon Intel and ARM to: 1) start from a clean slate, 2) avoid paying the exorbitant prices that Intel and ARM demand. In conjunction with this, they have no need to observe Western intellectual property rights laws with respect to any "magnificent eight" set of features that may or may not be incorporated into an actual silicon implementation of Risc-V.

@rosomak8244 8 днів тому

@@markteague8889 They already did just that.