Leading in to Machine Code: Why?

I’m going to write a few posts about programming in machine language. It seems that many more people are interested in learning about the ARM processor, so that’s what I’ll be writing about. In particular, I’m going to be working with the Raspberry Pi running Raspbian linux. For those who aren’t familiar with it, the Pi is a super-inexpensive computer that’s very easy to program, and very easy to interface with the outside world. It’s a delightful little machine, and you can get one for around $50!

Anyway, before getting started, I wanted to talk about a few things. First of all, why learn machine language? And then, just what the heck is the ARM thing anyway?

Why learn machine code?

My answer might surprise you. Or, if you’ve been reading this blog for a while, it might not.

Let’s start with the wrong reason. Most of the time, people say that you should learn machine language for speed: programming at the machine code level gets you right down to the hardware, eliminating any layers of junk that would slow you down. For example, one of the books that I bought to learn ARM assembly (Raspberry Pi Assembly Language RASPBIAN Beginners: Hands On Guide) said:

even the most efficient languages can be over 30 times
slower than their machine code equivalent, and that’s on a good
day!

This is pure, utter rubbish. I have no idea where he came up with that 30x figure, but it’s got no relationship to reality. (It’s a decent book, if a bit elementary in approach; this silly statement isn’t representative of the book as a whole!)

In modern CPUs – and the ARM definitely does count as modern! – the fact is, for real world programs, writing code by hand in machine language will probably result in slower code!

If you’re talking about writing a single small routine, humans can be very good at that, and they often do beat compilers. Butonce you get beyond that, and start looking at whole programs, any human advantage in machine language goes out the window. The constraints that actually affect performance have become incredibly complex – too complex for us to juggle effectively. We’ll look at some of these in more detail, but I’ll explain one example.

The CPU needs to fetch instructions from memory. But memory is dead slow compared to the CPU! In the best case, your CPU can execute a couple of instructions in the time it takes to fetch a single value from memory. This leads to an obvious problem: it can execute (or at least start executing) one instruction for each clock tick, but it takes several ticks to fetch an instruction!

To get around this, CPUs play a couple of tricks. Basically, they don’t fetch single instructions, but instead grab entire blocks of instructions; and they start retrieving instructions before they’re needed, so that by the time the CPU is ready to execute an instruction, it’s already been fetched.

So the instruction-fetching hardware is constantly looking ahead, and fetching instructions so that they’ll be ready when the CPU needs them. What happens when your code contains a conditional branch instruction?

The fetch hardware doesn’t know whether the branch will be taken or not. It can make an educated guess by a process called branch prediction. But if it guesses wrong, then the CPU is stalled until the correct instructions can be fetched! So you want to make sure that your code is written so that the CPUs branch prediction hardware is more likely to guess correctly. Many of the tricks that humans use to hand-optimize code actually have the effect of confusing branch prediction! They shave off a couple of instructions, but by doing so, they also force the CPU to sit idle while it waits for instructions to be fetched. That branch prediction failure penalty frequently outweighs the cycles that they saved!

That’s one simple example. There are many more, and they’re much more complicated. And to write efficient code, you need to keep all of those in mind, and fully understand every tradeoff. That’s incredibly hard, and no matter how smart you are, you’ll probably blow it for large programs.

If not for efficiency, then why learn machine code? Because it’s how your computer really works! You might never actually use it, but it’s interesting and valuable to know what’s happening under the covers. Think of it like your car: most of us will never actually modify the engine, but it’s still good to understand how the engine and transmission work.

Your computer is an amazingly complex machine. It’s literally got billions of tiny little parts, all working together in an intricate dance to do what you tell it to. Learning machine code gives you an idea of just how it does that. When you’re programming in another language, understanding machine code lets you understand what your program is really doing under the covers. That’s a useful and fascinating thing to know!

What is this ARM thing?

As I said, we’re going to look at machine language coding on the
ARM processor. What is this ARM beast anyway?

It’s probably not the CPU in your laptop. Most desktop and laptop computers today are based on a direct descendant of the first microprocessor: the Intel 4004.

Yes, seriously: the Intel CPUs that drive most PCs are, really, direct descendants of the first CPU designed for desktop calculators! That’s not an insult to the intel CPUs, but rather a testament to the value of a good design: they’ve just kept on growing and enhancing. It’s hard to see the resemblance unless you follow the design path, where each step follows directly on its predecessors.

The Intel 4004, released in 1971, was a 4-bit processor designed for use in calculators. Nifty chip, state of the art in 1971, but not exactly what we’d call flexible by modern standards. Even by the standards of the day, they recognized its limits. So following on its success, they created an 8-bit version, which they called the 8008. And then they extended the instruction set, and called the result the 8080. The 8080, in turn, yielded successors in the 8088 and 8086 (and the Z80, from a rival chipmaker).

The 8086 was the processor chosen by IBM for its newfangled personal computers. Chip designers kept making it better, producing the 80286, 386, Pentium, and so on – up to todays CPUs, like the Core i7 that drives my MacBook.

The ARM comes from a different design path. At the time that Intel was producing the 8008 and 8080, other companies were getting into the same game. From the PC perspective, the most important was the 6502, which
was used by the original Apple, Commodore, and BBC microcomputers. The
6502 was, incidentally, the first CPU that I learned to program!

The ARM isn’t a descendant of the 6502, but it is a product of the 6502 based family of computers. In the early 1980s, the BBC decided to create an educational computer to promote computer literacy. They hired a company called Acorn to develop a computer for their program. Acorn developed a
beautiful little system that they called the BBC Micro.

The BBC micro was a huge success. Acorn wanted to capitalize on its success, and try to move it from the educational market to the business market. But the 6502 was underpowered for what they wanted to do. So they decided to add a companion processor: they’d have a computer which could still run all of the BBC Micro programs, but which could do fancy graphics and fast computation with this other processor.

In a typical tech-industry NIH (Not Invented Here) moment, they decided that none of the other commercially available CPUs were good enough, so they set out to design their own. They were impressed by the work done by the Berkeley RISC (Reduced Instruction Set Computer) project, and so they adopted the RISC principles, and designed their own CPU, which they called the Acorn RISC Microprocessor, or ARM.

The ARM design was absolutely gorgeous. It was simple but flexible
and powerful, able to operate on very low power and generating very little heat. It had lots of registers and an extremely simple instruction set, which made it a pleasure to program. Acorn built a lovely computer with a great operating system called RiscOS around the ARM, but it never really caught on. (If you’d like to try RiscOS, you can run it on your Raspberry Pi!)

But the ARM didn’t disappear. Tt didn’t catch on in the desktop computing world, but it rapidly took over the world of embedded devices. Everything from your cellphone to your dishwasher to your iPad are all running on ARM CPUs.

Just like the Intel family, the ARM has continued to evolve: the ARM family has gone through 8 major design changes, and dozens of smaller variations. They’re no longer just produced by Acorn – the ARM design is maintained by a consortium, and ARM chips are now produced by dozens of different manufacturers – Motorola, Apple, Samsung, and many others.

Recently, they’ve even starting to expand even beyond embedded platforms: the Chromebook laptops are ARM based, and several companies are starting to market server boxes for datacenters that are ARM based! I’m looking forward to the day when I can buy a nice high-powered ARM laptop.

21 thoughts on “Leading in to Machine Code: Why?

  1. SWT

    Fascinating post … ages ago I learned (sort of) IBM Assembler and 8080 assembler and was still under the impression that coding in assembler was usually faster, or would be if it were reasonable to do so. Thanks for setting me straight on that.

    The down side, of course, is that I must now procure a Raspberry Pi …

    Reply
    1. Simon Farnsworth

      It depends on the assumptions you make; roughly speaking, there are two axes: “how good is the compiler” and “how good is the programmer”.

      If you assume a perfect compiler, hand-written assembler will never improve runtimes – the compiler will do what you meant, not what you wrote, and get the best performance possible for your problem.

      If you assume a perfect programmer, compiler output will never improve runtimes – the programmer will always write code that gets the best performance possible for your problem.

      In practice, it depends on the quality of the compiler and the quality of the programmer; we’ve reached the point where the average compiler is better than all but the very best of programmers in terms of performance. Of course, most programmers think they’re in the very best category, even when they’re not, and very few programmers are aware of how complex compilers have become.

      Reply
    2. UserGoogol

      My impression is that it was true “ages ago.” Back in the olden days, (however long ago that was) compilers weren’t as good and the hardware wasn’t as complicated, so that gave programmers more of an advantage than they have nowadays.

      Reply
  2. Stuart

    Nitpick: it was the 8088, not the 8086, that was chosen by IBM for its PC. The 8086 had a 16 bit external data bus, whereas the 8088 had an 8 bit external data bus; this made for somewhat simpler designs, at the cost of some performance.

    The instruction set, however, was identical.

    Reply
  3. medivh

    Any word on major OSes being ported to ARM? I doubt Microsoft would want the hassle of maintaining a forked code base for Windows after the whole Alpha business, but I think it would be easier for Apple, given the core of OSX is based off Linux. I’d imagine that the Raspberry Pi indicates what kind of changes need to be made to the Darwin core, and most of the rest of the OS should be cross-compile-able. Maybe.

    Reply
    1. Keith Gaughan

      OSX isn’t based off of Linux: it contains large amounts of BSD-derived code (especially FreeBSD), and a number of GNU utilities in userland, but nothing from the Linux kernel.

      Reply
    2. MarkCC Post author

      The linux kernel is already ported, and there are multiple linux distributions supporting ARM. In particular, there’s a great Ubuntu port, which is the basis for the Raspian distro used by the Pi; and Fedora is starting to provide a supported ARM distro.

      For OSX, the kernel is already ported: iOS uses the same darwin kernel as MacOS, and iOS runs on ARM. iOS Jailbreaks run the full Berkeley stack on top of the iOS darwin/mach kernel on ARM.

      Reply
  4. Tristan

    I believe Raspbian is Debian based – Ubuntu made the decision not to support the ARMv6 instruction set which the Raspberry Pi has.

    Linux has been supported on ARM for many years now (I had it running on Acorn hardware many years ago), and of course, Android is Linux based and usually runs on ARM based processors.

    Windows sort-of runs on ARM – Windows RT which is a cut down Windows 8 for tablets, only runs on ARM, as does Windows Phone (quite what relationship this bears to desktop Windows I don’t know). I don’t see either gaining much traction though.

    Reply
  5. rdb

    http://www.raspberrypi.org/archives/5282
    The Wolfram Language and Mathematica on Raspberry Pi, for free
    21 November 2013

    Today, at the CBM education summit in New York, we announced a partnership with Wolfram Research to bundle a free copy of Mathematica and the Wolfram Language into future Raspbian images.
    (Raspbian is the Debian linux distribution for Raspberry Pi)

    Reply
  6. Simon Farnsworth

    Just as a historic note; the reason Acorn decided that none of the existing CPUs were good enough for their needs is that the BBC Micro depended heavily on fast response to interrupts for its interactive “feel”; the 6502’s worst case interrupt latency was 20 clock cycles (7 cycle to complete current operation [1], 7 to call your IRQ vector, 6 for the RTI [2]), plus the code you needed to execute to handle the interrupt.

    In contrast, something like the 68000 took hundreds of clock cycles to enter the interrupt handler in the worst case [3]. Given that a 6502 would be clocked at 2MHz, you’d need the 68000s available when the Archimedes launched to run at 20MHz to get the same interrupt latency.

    There were a few commercial RISC CPUs around at this point (notably the MIPS R2000), but they were targeted at “workstation” uses, and were expensive (more expensive than Acorn felt they could afford – one CPU plus chipset would cost more than the low-end A305).

    And as a final note, ARM was never designed to be low power, that was a happy accident. Acorn got into the CPU design game when they went to visit WDC to find out what the plans were for successors to the 65816 (which they’d used before), and discovered that the CPU design team was not the hundreds of engineers they’d believed, but a group of under 10. They had as many electronic engineers as WDC did, so decided to give CPU design a go; the result was the ARM, which has turned out more successful than they could have imagined.

    [1] http://www.masswerk.at/6502/6502_instruction_set.html
    [2] http://www.6502.org/tutorials/interrupts.html
    [3] http://patpend.net/technical/68000/68000faq.txt

    Reply
  7. _Arthur

    A tidbit: Apple was one of the founders of ARM, with ACORN and VLSI Technologies.
    At the time, Apple direly needed a chip to power its upcoming Newton tablet, each Newton contained two ARM chips.

    After the Newton flop, Apple divested itself of most of its 43% ARM stake in the early 2000s. Apple initial $1.5M stake was now worth $1.1 billions.

    Reply
  8. Wyrd Smythe

    I can attest to the value of learning machine code. I was a software designer for over thirty years, and I found an almost 100% correlation between the best programmers and those who’d learned machine code at some point.

    Reply
  9. MrFancyPants

    A minor nitpick: “machine code” is 1’s and 0’s, while their symbolic representation is “assembly code”. Programming directly in machine code would be tedious beyond imagination. That triviality aside, I agree wholeheartedly with gaining a deeper understanding of how processors work by coding at a lower level, and I laughed at the quote of code produced by higher level languages as being “30x slower”. Optimizing compilers are marvels of efficiency. One very interesting thing to do is to write a simple program in assembly (say, “hello world”), and then write it in C and disassemble the compiler’s product to compare it to the handmade version. This is especially interesting with slightly more complex code that optimizing compilers are very good at getting right, such as loops, tail recursion, etc.

    Reply
    1. Bob Munck

      “Programming directly in machine code would be tedious beyond imagination.”

      That’s the way Woz wrote the Integer Basic interpreter for the Apple I. (Well, he wrote it on paper in assembler, then hand-assembled it into bits.) Note that there is reason to believe that he is actually a Space Alien in disguise. (I had lunch with Woz and Jobs in the Homebrew Computer Club days; at the time, I thought it more likely that Jobs was the Martian.) Read about the 100-transistor calculator that Woz designed and built at the age of 13 (human years).

      We wrote microcode for the Interdata Mod III in bits and entered it into the machine by scraping dots of printed circuit material off of a board containing 128×128 dots. This was indeed tedious. We wrote a program (in PL/I) that determined how to arrange code, variables, and constants so as to minimize the amount of scraping needed.

      Years later I wrote the Real-Time Executive for the Navy AN/UYK-20 in Ada and, because there were no Ada compilers yet, hand-compiled it into assembler. That code is still running in a lot of ships, planes, and subs.

      Reply
  10. Bob Munck

    Random points:

    1. The IBM FORTRAN H Optimizing compiler for the Sys/360 was widely believed to produce better code than would be likely to be created by hand programming — in 1965.

    2. In the 60’s, we taught the introductory programming course at Brown using a home-brew assembler and interpreter for a simplified version of the S/360 instruction set. My mentor Andy van Dam and I presented a paper on it at the 1967 ACM National Conference, which turned into a panel discussion on teaching assembler coding to novices. Taking a position against it were Alan Perlis, Bernie Galler, and Kelly Gotlieb.

    The discussion was inconclusive. I was distracted by a gorgeous girl sitting about ten rows back. 47 years later, she’s sitting across the room from me playing Farmtown on her Thinkpad. Coincidentally, Gotlieb was her thesis advisor at Toronto.

    Reply
  11. Colin Howell

    I know this is an old posting, but I’d like to set you straight on a commonly repeated misconception here:

    “Most desktop and laptop computers today are based on a direct descendant of the first microprocessor: the Intel 4004.”

    It turns out this isn’t true at all.

    There’s no question that the processors in most PCs are directly descended from the 8086 and 8088. Before then matters are somewhat sketchy. The 8086 and 8088 were designed so that 8-bit 8080 and 8085 assembly code could be automatically translated into (suboptimal) 8086 assembly code, but that’s not the same as direct descent. If you compare the instruction set architectures, they have relatively little in common. The 8086 has a large number of features completely absent from the 8080, such as string instructions, integer multiply and divide, and multi-bit shifts and rotates. It also has a set of general registers which can be used flexibly for many operations (though still with a lot of restrictions); in the 8080, the single accumulator must be used for most operations. Memory addressing between the two architectures is also quite different. So the 8086’s 8080-heritage is quite limited.

    But if you accept that the 8086 family does have some of the 8080 in its background, this leads to the question of what the 8080’s ancestry is. It was a direct improvement of the 8-bit 8008, Intel’s second microprocessor. However, the 8008 was NOT derived from Intel’s first microprocessor, the 4-bit 4004. The two processors were developed completely independently for separate customers, and it turns out they are as different as night and day. (Seriously, just compare the assembly language manuals for the two chips if you have any doubts.)

    The 8008 was developed to control an intelligent terminal, the Datapoint 2200 built by Computer Terminal Corporation (CTC). CTC dictated the processor’s architecture to Intel, who set about developing a chip that would implement it. In the end, Intel could not finish the 8008 fast enough for CTC, who ended up using their own implementation built from discrete TTL chips. Intel was left to market the 8008 on its own.

    All this happened concurrently with the 4004’s development, though the 4004 started earlier and was completed sooner. The 4004’s architecture was determined by the Japanese calculator maker Busicom and the needs of its calculator designs. To modern eyes, the 4004 appears to be a strange beast indeed, with separate instruction and data memories and a very peculiar memory addressing scheme. To my knowledge, the 4004 has only had one descendant, Intel’s improved 4-bit 4040 microprocessor.

    Reply

Leave a Reply