Category Archives: Machine Language

Hello World in ARM Assembly Language

Since K&R’s book on C, it’s become traditional to start any tutorial on a new language is to present a program that prints “Hello world”. For ARM assembly running on Raspbian Linux, that traditional program looks like:

.global _start
_start:
  MOV R7, #4
  MOV R0, #1
  MOV R2, #12
  LDR R1, =string
  SWI 0
  MOV R7, #1
  SWI 0
  .data
string:
  .ascii "Hello Worldn"

It’s definitely a bit more cryptic than most languages, but it doesn’t look all that bad, now does it? Before I can explain how that works, we’ll need to talk a bit about what we’re programming, and how we can program it. We’re going to go through a bunch of introductory material here; everything that we touch on, I’ll come back to in more detail in a later post.


arm
In the diagram to the right, you can see my attempt to draw a diagram illustrating the important parts of an ARM CPU from our initial perspective. As we learn more about it, we’ll gradually refine this picture, adding more details – but for now, this is what things look like.

For now, we’ll say that the CPU has 5 main parts:

  1. A collection of 16 registers. A register is a memory cell that’s built in to the CPU. On an ARM processor, any time that you want to do any kind of operation – arithmetic, logic, comparison, you name it! – you’ll need to have the values in a register. The first thirteen registers are available to you, to use for whatever you want. The last three are special; R13 is called the stack pointer (SP), R14 is called the link register (LR), and R15 is called the program counter (PC). We’ll talk about what those three mean as we learn to program.
  2. An arithmetic/logic unit (ALU). This is where the CPU does integer arithmetic and logic. Most of our programs will work exclusively with the ALU. (Floating point is important, but it’s possible to do an awful lot of programming without it.)
  3. A floating point unit (FPU). This is where the CPU does floating point arithmetic.
  4. A status register. This is, like the other registers, a chunk of internal storage. But you can’t manipulate it or access it directly. It’s automatically updated by the ALU/FPU. Individual bits of the status register get updated to reflect various conditions about the current status of the CPU, and the results of the previous instruction. For example, the way that you can compare two values in the ARM is to subtract one from the other. If the two values were equal, then the ZERO flag in the status register will be set to 1; otherwise it will be set to 0. There’s a branch instruction that only actually branches if the ZERO flag is set.
  5. A data channel, called the bus. The bus connects the CPU to the rest of the computer. Memory, other storage devices, and input and output devices are all connected to the CPU via the bus. Doing anything that involves communicating through the bus is slow compared to doing anything that doesn’t. For now, we’ll say that memory is the only thing on the bus.

Now that we have a bit of a clue about the basic pieces of this thing we’re going to program, we can start looking at our hello world program. We still need to talk about one other bit of background before we can get started.

For a computer, on the lowest level, a “program” is just a chunk of numbers. It’s not even a chunk of instructions – it’s just numbers. The numbers can be instructions, data, or both at the same time! That last bit might sound strange, but you’ll see instructions like MOV R0, #4. That’s saying load the literal value 4 into register R0. The 4 is a value encoded as a part of an instruction. So that 4 is both literal data sitting in the middle of a collection of instructions, and it’s also a part of an instruction. The actual instruction doesn’t really say “load the value 4”; it says “load the data value that’s at this position in the instruction sequence”.

We’re not going to program the ARM using the numeric instructions directly. We’re going to program the ARM using assembly language. Assembly language is a way of writing that chunk of numbers that is your program, but doing it with a syntax that easy for a human being to read. Then a program called an assembler will translate from that readable format into the raw numeric format used by the computer. Conceptually, the assembler sounds a lot like the compiler that you’d use with a higher level language. But it’s quite different: compilers take your code, and change it. Frequently, if you look at code that your compiler generates, you’d have a hard time recognizing code that was generated for a program that you wrote! But an assembel doesn’t change anything. There’s no restructuring, no optimization, no changes at all. In an assembly language program, you’re describing how to lay out a bunch of instructions and data in memory, and the assembler does nothing but generate that exact memory layout.


Ok. That said, finally, we can get to the program!

Programming in assembly is quite different from programming in any reasonable programming language. There are no abstractions to make your life easier. You need to be painfully explicit about everything. It really brings home just how many abstractions you generally use in your code.

For example, in assembly language, you don’t really have variables. You can store values anywhere you want in the computer’s memory, but you have to decide where to put them, and how to lay them out, by yourself. But as I said before – all of the arithmetic and logic that makes up a program has to be done on values in registers. So a value in memory is only good if you can move it from memory into a register. It’s almost like programming in a language with a total of 16 variables – only you’re only really allowed to use 13 of them!

Not only do you not have variables, but you don’t really have parameters. In a high level programming language, you can just pass things to subroutines. You don’t need to worry about how. Maybe they’re going onto a stack; maybe there’ doing some kind of fancy lambda calculus renaming thing; maybe there’s some magic variables. You don’t need to know or care. But in assembly, there is no built-in notion of parameter-passing. You need to use the computer’s register and memory to build a parameter passing system. In the simplest form of that, which is what we’re using here, you designate certain registers as carrying certain parameters. There’s nothing in assembly to enforce that: if your program puts something into register R3, and a function was expecting it to be in R4, you won’t get any kind of error.

In our “Hello world” program above, the first three instructions are loading specific values into registers expected by the operating system “print” function. For example, MOV R0, #4 means move the specific number 4 into register R0.

Loading literal values into registers are done using the MOV instruction. It’s got two operands, the register to move the data into, and the source of the data. The source of the data can be either a literal value, or another register. If you want to load data from memory, you need to use a different instruction – LDR.

With the LDR instruction, we can see one of the conveniences of using assembly language. We want to print the string “Hello world”. So we need to have that string in memory somewhere. The assembler lets us do that using a .ascii directive. The directive isn’t an ARM instruction; it’s an instruction to the assembler telling it “I want to put this string data into a block in memory”. The .ascii directive is prefaced with a label, which allows us to refer to the beginning of the memory block populated by the directive. Now we can use “string” to refer to the memory block. So the instruction LDR R1, =string is exactly the same as saying LDR R1, address, where address is the memory location where the first byte of the string is stored.

These four instructions have been preparation for calling a function provided by the operating system. R0 and R7 are used by the operating system to figure out what function we want to call. R1 and R2 are being used to pass parameters to the function. The print function expects R1 to contain the memory location of the first byte in the string we want to print, and R2 to contain the number of characters in the string.

We call the function using SWI 0. SWI is the software interrupt function. We can’t call the operating system directly! One of the purposes of the operating system is to provide a safe environment, where different programs can’t accidentally interfere with one another. If you could just branch into an OS function directly, any program would be able to do anything it wanted! But we don’t allow that, so the program can’t directly call anything in the OS. Instead, what it does is send a special kind of signal called an interrupt. Before it runs our program, the operating system has already told the CPU that any time it gets an interrupt, control should be handed to the OS. So the operating system gets called by the interrupt. It sees the values in R0 and R7, and recognizes that the interrupt is a request to run the “print” function, so it does that. Then it returns from the interrupt – and execution continues at the first instruction after the SWI call.

Now it’s returned from the print, and we don’t want to do anything else. If we didn’t put something here to tell the operating system that we’re done, the CPU would just proceed to the next memory address after our SWI, and interpret that as an instruction! We need to specifically say “We’re done”, so that the operating system takes control away from our program. The way we do that is with another SWI call. This SWI is the operating system “exit” call. To exit a program and kill the process, you call SWI with R0=1 and R7=1.

And that’s it. That’s hello-world in assembly.

Leading in to Machine Code: Why?

I’m going to write a few posts about programming in machine language. It seems that many more people are interested in learning about the ARM processor, so that’s what I’ll be writing about. In particular, I’m going to be working with the Raspberry Pi running Raspbian linux. For those who aren’t familiar with it, the Pi is a super-inexpensive computer that’s very easy to program, and very easy to interface with the outside world. It’s a delightful little machine, and you can get one for around $50!

Anyway, before getting started, I wanted to talk about a few things. First of all, why learn machine language? And then, just what the heck is the ARM thing anyway?

Why learn machine code?

My answer might surprise you. Or, if you’ve been reading this blog for a while, it might not.

Let’s start with the wrong reason. Most of the time, people say that you should learn machine language for speed: programming at the machine code level gets you right down to the hardware, eliminating any layers of junk that would slow you down. For example, one of the books that I bought to learn ARM assembly (Raspberry Pi Assembly Language RASPBIAN Beginners: Hands On Guide) said:

even the most efficient languages can be over 30 times
slower than their machine code equivalent, and that’s on a good
day!

This is pure, utter rubbish. I have no idea where he came up with that 30x figure, but it’s got no relationship to reality. (It’s a decent book, if a bit elementary in approach; this silly statement isn’t representative of the book as a whole!)

In modern CPUs – and the ARM definitely does count as modern! – the fact is, for real world programs, writing code by hand in machine language will probably result in slower code!

If you’re talking about writing a single small routine, humans can be very good at that, and they often do beat compilers. Butonce you get beyond that, and start looking at whole programs, any human advantage in machine language goes out the window. The constraints that actually affect performance have become incredibly complex – too complex for us to juggle effectively. We’ll look at some of these in more detail, but I’ll explain one example.

The CPU needs to fetch instructions from memory. But memory is dead slow compared to the CPU! In the best case, your CPU can execute a couple of instructions in the time it takes to fetch a single value from memory. This leads to an obvious problem: it can execute (or at least start executing) one instruction for each clock tick, but it takes several ticks to fetch an instruction!

To get around this, CPUs play a couple of tricks. Basically, they don’t fetch single instructions, but instead grab entire blocks of instructions; and they start retrieving instructions before they’re needed, so that by the time the CPU is ready to execute an instruction, it’s already been fetched.

So the instruction-fetching hardware is constantly looking ahead, and fetching instructions so that they’ll be ready when the CPU needs them. What happens when your code contains a conditional branch instruction?

The fetch hardware doesn’t know whether the branch will be taken or not. It can make an educated guess by a process called branch prediction. But if it guesses wrong, then the CPU is stalled until the correct instructions can be fetched! So you want to make sure that your code is written so that the CPUs branch prediction hardware is more likely to guess correctly. Many of the tricks that humans use to hand-optimize code actually have the effect of confusing branch prediction! They shave off a couple of instructions, but by doing so, they also force the CPU to sit idle while it waits for instructions to be fetched. That branch prediction failure penalty frequently outweighs the cycles that they saved!

That’s one simple example. There are many more, and they’re much more complicated. And to write efficient code, you need to keep all of those in mind, and fully understand every tradeoff. That’s incredibly hard, and no matter how smart you are, you’ll probably blow it for large programs.

If not for efficiency, then why learn machine code? Because it’s how your computer really works! You might never actually use it, but it’s interesting and valuable to know what’s happening under the covers. Think of it like your car: most of us will never actually modify the engine, but it’s still good to understand how the engine and transmission work.

Your computer is an amazingly complex machine. It’s literally got billions of tiny little parts, all working together in an intricate dance to do what you tell it to. Learning machine code gives you an idea of just how it does that. When you’re programming in another language, understanding machine code lets you understand what your program is really doing under the covers. That’s a useful and fascinating thing to know!

What is this ARM thing?

As I said, we’re going to look at machine language coding on the
ARM processor. What is this ARM beast anyway?

It’s probably not the CPU in your laptop. Most desktop and laptop computers today are based on a direct descendant of the first microprocessor: the Intel 4004.

Yes, seriously: the Intel CPUs that drive most PCs are, really, direct descendants of the first CPU designed for desktop calculators! That’s not an insult to the intel CPUs, but rather a testament to the value of a good design: they’ve just kept on growing and enhancing. It’s hard to see the resemblance unless you follow the design path, where each step follows directly on its predecessors.

The Intel 4004, released in 1971, was a 4-bit processor designed for use in calculators. Nifty chip, state of the art in 1971, but not exactly what we’d call flexible by modern standards. Even by the standards of the day, they recognized its limits. So following on its success, they created an 8-bit version, which they called the 8008. And then they extended the instruction set, and called the result the 8080. The 8080, in turn, yielded successors in the 8088 and 8086 (and the Z80, from a rival chipmaker).

The 8086 was the processor chosen by IBM for its newfangled personal computers. Chip designers kept making it better, producing the 80286, 386, Pentium, and so on – up to todays CPUs, like the Core i7 that drives my MacBook.

The ARM comes from a different design path. At the time that Intel was producing the 8008 and 8080, other companies were getting into the same game. From the PC perspective, the most important was the 6502, which
was used by the original Apple, Commodore, and BBC microcomputers. The
6502 was, incidentally, the first CPU that I learned to program!

The ARM isn’t a descendant of the 6502, but it is a product of the 6502 based family of computers. In the early 1980s, the BBC decided to create an educational computer to promote computer literacy. They hired a company called Acorn to develop a computer for their program. Acorn developed a
beautiful little system that they called the BBC Micro.

The BBC micro was a huge success. Acorn wanted to capitalize on its success, and try to move it from the educational market to the business market. But the 6502 was underpowered for what they wanted to do. So they decided to add a companion processor: they’d have a computer which could still run all of the BBC Micro programs, but which could do fancy graphics and fast computation with this other processor.

In a typical tech-industry NIH (Not Invented Here) moment, they decided that none of the other commercially available CPUs were good enough, so they set out to design their own. They were impressed by the work done by the Berkeley RISC (Reduced Instruction Set Computer) project, and so they adopted the RISC principles, and designed their own CPU, which they called the Acorn RISC Microprocessor, or ARM.

The ARM design was absolutely gorgeous. It was simple but flexible
and powerful, able to operate on very low power and generating very little heat. It had lots of registers and an extremely simple instruction set, which made it a pleasure to program. Acorn built a lovely computer with a great operating system called RiscOS around the ARM, but it never really caught on. (If you’d like to try RiscOS, you can run it on your Raspberry Pi!)

But the ARM didn’t disappear. Tt didn’t catch on in the desktop computing world, but it rapidly took over the world of embedded devices. Everything from your cellphone to your dishwasher to your iPad are all running on ARM CPUs.

Just like the Intel family, the ARM has continued to evolve: the ARM family has gone through 8 major design changes, and dozens of smaller variations. They’re no longer just produced by Acorn – the ARM design is maintained by a consortium, and ARM chips are now produced by dozens of different manufacturers – Motorola, Apple, Samsung, and many others.

Recently, they’ve even starting to expand even beyond embedded platforms: the Chromebook laptops are ARM based, and several companies are starting to market server boxes for datacenters that are ARM based! I’m looking forward to the day when I can buy a nice high-powered ARM laptop.