sources/tech/20201028 Program in Arm6 assembly language on a Raspberry Pi.md
39 KiB
Program in Arm6 assembly language on a Raspberry Pi
Assembly language offers special insights into how machines work and how they can be programmed.
The Arm website touts the processor's underlying architecture as "the keystone of the world's largest compute ecosystem," which is plausible given the number of handheld and embedded devices with Arm processors. Arm processors are prevalent in the Internet of Things (IoT), but they are also used in desktop machines, servers, and even high-performance computers, such as the Fugaku HPC. But why look at Arm machines through the lens of assembly language?
Assembly language is the symbolic language immediately above machine code and thereby offers special insights into how machines work and how they can be programmed efficiently. In this article, I hope to illustrate this point with the Arm6 architecture using a Raspberry Pi 4 mini-desktop machine running Debian.
The Arm6 family of processors supports two instruction sets:
- The Arm set, with 32-bit instructions throughout.
- The Thumb set, with a mix of 16-bit and 32-bit instructions.
The examples in this article use the Arm instruction set. The Arm assembly code is in lowercase, and, for contrast, pseudo-assembly code is in uppercase.
Load-store machines
The RISC/CISC distinction is often seen when comparing the Arm family and the Intel x86 family of processors, both of which are commercial products competing on the market. The terms RISC (reduced instruction set computer) and CISC (complex instruction set computer) date from the middle 1980s. Even then the terms were misleading, in that both RISC (e.g., MIPS) and CISC (e.g., Intel) processors had about 300 instructions in their instruction sets; today the counts for core instructions in Arm and Intel machines are close, although both types of machines have extended their instruction sets. A sharper distinction between Arm and Intel machines draws on an architectural feature other than instruction count.
An instruction set architecture (ISA) is an abstract model of a computing machine. Processors from Arm and Intel implement different ISAs: Arm processors implement a load-store ISA, whereas their Intel counterparts implement a register-memory ISA. The difference in ISAs can be described as:
- In a load-store machine, only two instructions move data between a CPU and the memory subsystem:
- A load instruction copies bits from memory into a CPU register.
- A store instruction copies bits from a CPU register into memory.
- Other instructions—in particular, the ones for arithmetic-logic operations—use only CPU registers as source and destination operands. For example, here is pseudo-assembly code on a load-store machine to add two numbers originally in memory, storing their sum back in memory (comments start with
##
): [code] ## R0 is a CPU register, RAM[32] is a memory location LOAD R0, RAM[32] ## R0 = RAM[32] LOAD R1, RAM[64] ## R1 = RAM[64] ADD R2, R0, R1 ## R2 = R0 + R1 STORE R2, RAM[32] ## RAM[32] = R2 [/code] The task requires four instructions: two LOADs, one ADD, and one STORE. - By contrast, a register-memory machine allows the operands for arithmetic-logic instructions to be registers or memory locations, usually in any combination. For example, here is pseudo-assembly code on a register-memory machine to add two numbers in memory: [code]
ADD RAM[32], RAM[32], RAM[64] ## RAM[32] += RAM[64]
[/code] The task can be accomplished with a single instruction, although the bits to be added must still be fetched from memory to a CPU, and the sum then must be copied back to memory location RAM[32].
Any ISA comes with tradeoffs. As the example above illustrates, a load-store ISA has what architects call "low instruction density": relatively many instructions may be required to perform a task. A register-memory machine has high instruction density, which is an upside. There are upsides, as well, to the load-store ISA.
Load-store design is an effort to simplify an architecture. For instance, consider the case in which a register-memory machine has an instruction with mixed operands:
COPY R2, RAM[64] ## R2 = RAM[64]
ADD RAM[32], RAM[32], R2 ## RAM[32] = RAM[32] + R2
Executing the ADD instruction is tricky in that the access times for the numbers to be added differs—perhaps significantly if the memory operand happens to be only in main memory rather than also in a cache thereof. Load-store machines avoid the problem of mixed access times in arithmetic-logic operations: all operands, as registers, have the same access time.
Furthermore, load-store architectures emphasize fixed-sized instructions (e.g., 32-bits apiece), limited formats (e.g., one, two, or three fields per instruction), and relatively few addressing modes. These design constraints mean that the processor's control unit (CU) and arithmetic-logic unit (ALU) can be simplified: fewer transistors and wires, less required power and generated heat, and so on. Load-store machines are designed to be architecturally sparse.
My aim is not to step into the debate over load-store versus register-memory machines but rather to set up a code example in the load-store Arm6 architecture. This first look at load-store helps to explain the code that follows. The two programs (one in C, one in Arm6 assembly) are available on my website.
The hstone program in C
Among my favorite short code examples is the hailstone function, which takes a positive integer as an argument. (I used this example in an earlier article on WebAssembly.) This function is rich enough to highlight important assembly-language details. The function is defined as:
3N+1 if N is odd
hstone(N) =
N/2 if N is even
For example, hstone(12) evaluates to 6, whereas hstone(11) evaluates to 34. If N is odd, then 3N+1 is even; but if N is even, then N/2 could be either even (e.g., 4/2 = 2) or odd (e.g., 6/2 = 3).
The hstone function can be used iteratively by passing the returned value as the next argument. The result is a hailstone sequence, such as this one, which starts with 24 as the original argument, the returned value 12 as the next argument, and so on:
`24,12,6,3,10,5,16,8,4,2,1,4,2,1,...`
It takes 10 steps for the sequence to converge to 1, at which point the sequence of 4,2,1 repeats indefinitely: (3x1)+1 is 4, which is halved to yield 2, which is halved to yield 1, and so on. For an explanation of why "hailstone" seems an appropriate name for such sequences, see "Mathematical mysteries: Hailstone sequences."
Note that powers of 2 converge quickly: 2N requires just N divisions by 2 to reach 1. For example, 32 = 25 has a convergence length of 5, and 512 = 29 has a convergence length of 9. If the hailstone function returns any power of 2, then the sequence converges to 1. Of interest here is the sequence length from the initial argument to the first occurrence of 1.
The Collatz conjecture is that a hailstone sequence converges to 1 no matter what the initial argument N > 0 happens to be. Neither a counterexample nor a proof has been found. The conjecture, simple as it is to illustrate with a program, remains a profoundly challenging problem in number theory.
Below is the C source code for the hstoneC program, which computes the length of the hailstone sequence whose starting value is given as user input. The assembly-language version of the program (hstoneS) is provided after an overview of Arm6 basics. For clarity, the two programs are structurally similar.
Here is the C source code:
#include <stdio.h>
/* Compute steps from n to 1.
-- update an odd n to (3 * n) + 1
-- update an even n to (n / 2) */
unsigned hstone(unsigned n) {
unsigned len = 0; /* counter */
while (1) {
if (1 == n) break;
n = (0 == (n & 1)) ? n / 2 : (3 * n) + 1;
len++;
}
return len;
}
int main() {
[printf][7]("Integer > 0: ");
unsigned num;
[scanf][8]("%u", &num);
[printf][7]("Steps from %u to 1: %u\n", num, hstone(num));
return 0;
}
When the program is run with an input of 9, the output is:
`Steps from 9 to 1: 19`
The hstoneC program has a simple structure. The main
function prompts the user for an input N (an integer > 0) and then calls the hstone
function with this input as an argument. The hstone
function loops until the sequence from N reaches the first 1, returning the number of steps required.
The most complicated statement in the program involves C's conditional operator, which is used to update N:
`n = (0 == (n & 1)) ? n / 2 : (3 * n) + 1;`
This is a terse form of an if-then construct. The test (0 == (n & 1))
checks whether the C variable n
(representing N) is even or odd depending on whether the bitwise AND of N and 1 is zero: an integer value is even just in case its least-significant (rightmost) bit is zero. If N is even, N/2 becomes the new value; otherwise, 3N+1 becomes the new value. The assembly-language version of the program (hstoneS) likewise avoids an explicit if-else construct in updating its implementation of N.
My Arm6 mini-desktop machine includes the GNU C toolset, which can generate the corresponding code in assembly language. With %
as the command-line prompt, the command is:
`% gcc -S hstoneC.c ## -S flag produces and saves assembly code`
This produces the file hstoneC.s, which is about 120 lines of assembly-language source code, including a nop
("no operation") instruction. Compiler-generated assembly tends to be hard to read and may have inefficiencies such as the nop
. A hand-crafted version, such as hstoneS.s
(below), can be easier to follow and even significantly shorter (e.g., hstoneS.s
has about 50 lines of code).
Assembly language basics
Arm6, like most modern architectures, is byte-addressable: a memory address is of a byte, even if the addressed item (e.g., a 32-bit instruction) consists of multiple bytes. Instructions are addressed in little-endian fashion: the address is of the low-order byte. Data items are addressed in little-endian fashion by default, but this can be changed to big-endian so that the address of a multi-byte data item points to the high-order byte. By tradition, the low-order byte is depicted as the rightmost one and the high-order byte as the leftmost one:
high-order low-order
/ /
+----+----+----+----+
| b1 | b2 | b3 | b4 | ## 4 bytes = 32 bits
+----+----+----+----+
Addresses are 32-bits in size, and data items come in three standard sizes:
- A byte is 8 bits in size.
- A halfword is 16 bits in size.
- A word is 32 bits in size.
Aggregates of bytes, halfwords, and words (e.g., arrays and structures) are supported. CPU registers are 32-bits in size.
Assembly languages, in general, have three key features with a syntax that is close and, at times, identical:
- Directives in both Arm6 and Intel assembly start with a period. Here are two Arm6 examples, which happen to work in Intel as well:
.data
.align 4
The first directive indicates that the following section holds data items rather than code. The .align 4
directive specifies that data items should be laid out, in memory, on 4-byte boundaries, which is common in modern architectures. As the name suggests, a directive gives direction to the translator (the "assembler") as this does its work.
By contrast, this directive indicates a code rather than a data section:
`.text`
The term "text" is traditional, and its meaning, in this context, is "read-only": during program execution, code is read-only, whereas data can be read and written.
- Labels in both recent Arm and Intel assembly end with colons. A label is a memory address for either data items (e.g., variables) or code blocks (e.g., functions). Assembly languages, in general, rely heavily on addresses, which means that manipulating pointers—in particular, dereferencing them to get the values to which they point—takes front stage in assembly-language programming. Here are two labels in the hstoneS program:
collatz: /* label */
mov r0, #0 /* instruction */
loop_start: /* label */
...
The first label marks the start of the collatz
function, whose first instruction copies the value zero (#0
) into the register r0
. (The mov
for "move" opcode occurs across assembly languages but really means "copy.") The second label, loop_start:
, is the address of the loop that computes the length of the hailstone sequence. The register r0
serves as the sequence counter.
- Instructions, which assembly-sensitive editors usually indent along with directives, specify the operations to be performed (e.g.,
mov
) together with operands (in this case,r0
and#0
). There are instructions with no operands and others with several.
The mov
instruction above does not violate the load-store principle about memory access. In general, a load instruction (ldr
in Arm6) loads memory contents into a register. By contrast, a mov
instruction can be used to copy an "immediate value," such as an integer constant, into a register:
`mov r0, #0 /* copy zero into r0 */`
A mov
instruction also can be used to copy the contents of one register into another:
`mov r1, r0 /* r1 = r0 */`
The load opcode ldr
would be inappropriate in both cases because a memory location is not in play. Examples of Arm6 ldr
("load register") and str
("store register") instructions are forthcoming.
The Arm6 architecture has 16 primary CPU registers (each 32-bits in size), a mix of general-purpose and special-purpose. Table 1 gives a summary, listing special features and uses beyond scratchpad:
Table 1. Primary CPU registers
Register | Special features |
---|---|
r0 | 1st arg to library function, retval |
r1 | 2nd arg to library function |
r2 | 3rd arg to library function |
r3 | 4th arg to library function |
r4 | callee-saved |
r5 | callee-saved |
r6 | callee-saved |
r7 | callee-saved, system calls |
r8 | callee-saved |
r9 | callee-saved |
r10 | callee-saved |
r11 | callee-saved, frame pointer |
r12 | intra-procedure |
r13 | stack pointer |
r14 | link register |
r15 | program counter |
In general, CPU registers serve as a backup for the stack, the area of main memory that provides reusable scratchpad storage for the arguments passed to functions and the local variables used in functions and other code blocks (e.g., the body of a loop). Given that CPU registers reside on the same chip as the CPU, access time is fast. Access to the stack is significantly slower, with the details depending on the particularities of a system. However, registers are scarce. In the case of Arm6, there are only 16 primary CPU registers, and some of these have special uses beyond scratchpad.
The first four registers, r0
through r3
, are used for scratchpad but also to pass arguments along to library functions. For example, calling a library function such as printf
(used in both the hstoneC and hstoneS programs) requires that the expected arguments be in the expected registers. The printf
function takes at least one argument (a format string) but usually takes others, as well (the values to be formatted). The address of the format string has to be in register r0
for the call to succeed. A programmer-defined function can implement its own register strategy, of course, but using the first four registers for function arguments is common in Arm6 programming.
Register r0
also has special uses. For example, it typically holds the value returned from a function, as in the collatz
function of the hstoneS program. If a program calls the syscall
function, which is used to invoke system functions such as read
and write
, register r0
holds the integer identifier of the system function to be called (e.g., function write
has 4 as its identifier). In this respect, register r0
is similar in purpose to register r7
, which holds such an identifier when function svc
("supervisor call") is used instead of syscall
.
Registers r4
through r11
are general-purpose and "callee saved" (aka "non-volatile" or "call-preserved"). Consider the case in which function F1 calls function F2 using registers to pass arguments to F2. The registers r0
through r3
are "caller saved" (aka "volatile" or "call-clobbered") in that, for example, the called function F2 might call some other function F3 by using the very same registers that F1 did—but with new values therein:
27 13 191 437
\ \ \ \
r0, r1 r0, r1
F1-------->F2-------->F3
After F1 calls F2, the contents of registers r0
and r1
get changed for F2's call to F3. Accordingly, F1 must not assume that its values in r0
and r1
(27 and 13, respectively) have been preserved; instead, these values have been overwritten—clobbered by the new values 191 and 437. Because the first four registers are not "callee saved," called function F2 is not responsible for preserving and later restoring the values in the registers set by F1.
Callee-saved registers bring responsibility to a called function. For example, if F1 used callee-saved registers r4
and r5
in its call to F2, then F2 would be responsible for saving the contents of these registers (typically on the stack) and then restoring the values before returning to F1. F2's code then might start and end as follows:
push {r4, r5} /* save r4 and r5 values on the stack */
... /* reuse r4 and r5 for some other task */
pop {r4, r5} /* restore r4 and r5 values */
The push
operation saves the values in r4
and r5
to the stack. The matching pop
operation then recovers these values from the stack and puts them into r4
and r5
.
Other registers in Table 1 can be used as scratchpad, but some have a special use, as well. As noted earlier, register r7
can be used to make system calls (e.g., to function write
), which a later example shows in detail. In an svc
instruction, the integer identifier for a particular system function must be in register r7
(e.g., 4 to identify the write
function).
Register r11
is aliased as fp
for "frame pointer," which points to the start of the current call frame. When one function calls another, the called function gets its own area of the stack (a call frame) for use as scratchpad. A frame pointer, unlike the stack pointer described below, typically remains fixed until a called function returns.
Register r12
, also known as ip
("intra-procedure"), is used by the dynamic linker. Between calls to dynamically linked library functions, however, a program can use this register as scratchpad.
Register r13
, which has sp
("stack pointer") as an alias, points to the top of the stack and is updated automatically through push
and pop
operations. The stack pointer also can be used as a base address with an offset; for example, sp - #4
points 4 bytes below where the sp
points. The Arm6 stack, like its Intel counterpart, grows from high to low addresses. (Some authors accordingly describe the stack pointer as pointing to the bottom rather than the top of the stack.)
Register r14
, with lr
as an alias, serves as the "link register" that holds a return address for a function. However, a called function can call another with a bl
("branch with link") or bx
("branch with exchange") instruction, thereby clobbering the contents of the lr
register. For example, in the hstoneS program, the function main
calls four others. Accordingly, function main
saves the lr
of its caller on the stack and later restores this value. The pattern occurs regularly in Arm6 assembly language:
push {lr} /* save caller's lr */
... /* call some functions */
pop {lr} /* restore caller's lr */
Register r15
is also the pc
("program counter"). In most architectures, the program counter points to the "next" instruction to be executed. For historical reasons, the Arm6 pc
points to two instructions beyond the current one. The pc
can be manipulated directly (for example, to call a function), but the recommended approach is to use instructions such as bl
that manipulate the link register.
Arm6 has the usual assortment of instructions for arithmetic (e.g., add, subtract, multiply, divide), logic (e.g., compare, shift), control (e.g., branch, exit), and input/output (e.g., read, write). The results of comparisons and other operations are saved in the special-purpose register cpsr
("current processor status register"). For example, this register records whether an addition caused an overflow or whether two compared integer values are equal.
It is worth repeating that the Arm6 has exactly two basic data movement instructions: ldr
to load memory contents into a register and str
to store register contents in memory. Arm6 includes variations of the basic ldr
and str
instructions, but the load-store pattern of moving data between registers and memory remains the same.
A code example brings these architectural details to life. The next section introduces the hailstone program in assembly language.
The hstone program in Arm6 assembly
The above overview of Arm6 assembly is enough to introduce the full code example for hstoneS. For clarity, the assembly-language program hstoneS has essentially the same structure as the C program hstoneC: two functions, main
and collatz
, and mostly straight-line code execution in each function. The behavior of the two programs is the same.
Here is the source code for hstoneS:
.data /* data versus code */
.balign 4 /* alignment on 4-byte boundaries */
/* labels (addresses) for user input, formatters, etc. */
num: .int 0 /* 4-byte integer */
steps: .int 0 /* another for the result */
prompt: .asciz "Integer > 0: " /* zero-terminated ASCII string */
format: .asciz "%u" /* %u for "unsigned" */
report: .asciz "From %u to 1 takes %u steps.\n"
.text /* code: 'text' in the sense of 'read only' */
.global main /* program's entry point must be global */
.extern [printf][7] /* library function */
.extern [scanf][8] /* ditto */
collatz: /** collatz function **/
mov r0, #0 /* r0 is the step counter */
loop_start: /** collatz loop **/
cmp r1, #1 /* are we done? (num == 1?) */
beq collatz_end /* if so, return to main */
and r2, r1, #1 /* odd-even test for r1 (num) */
cmp r2, #0 /* even? */
moveq r1, r1, LSR #1 /* even: divide by 2 via a 1-bit right shift */
addne r1, r1, r1, LSL #1 /* odd: multiply by adding and 1-bit left shift */
addne r1, #1 /* odd: add the 1 for (3 * num) + 1 */
add r0, r0, #1 /* increment counter by 1 */
b loop_start /* loop again */
collatz_end:
bx lr /* return to caller (main) */
main:
push {lr} /* save link register to stack */
/* prompt for and read user input */
ldr r0, =prompt /* format string's address into r0 */
bl [printf][7] /* call printf, with r0 as only argument */
ldr r0, =format /* format string for scanf */
ldr r1, =num /* address of num into r1 */
bl [scanf][8] /* call scanf */
ldr r1, =num /* address of num into r1 */
ldr r1, [r1] /* value at the address into r1 */
bl collatz /* call collatz with r1 as the argument */
/* demo a store */
ldr r3, =steps /* load memory address into r3 */
str r0, [r3] /* store hailstone steps at mem[r3] */
/* setup report */
mov r2, r0 /* r0 holds hailstone steps: copy into r2 */
ldr r1, =num /* get user's input again */
ldr r1, [r1] /* dereference address to get value */
ldr r0, =report /* format string for report into r0 */
bl [printf][7] /* print report */
pop {lr} /* return to caller */
Arm6 assembly supports documentation in either C style (the slash-star and star-slash syntax used here) or one-line comments introduced by the @ sign. The hstoneS program, like its C counterpart, has two functions:
- The program's entry point is the
main
function, which is identified by the labelmain:
; this label marks where the function's first instruction is found. In Arm6 assembly, the entry point must be declared as global:
`.global main`
In C, a function's name is the address of the code block that makes up the function's body, and a C function is extern
(global) by default. It is unsurprising how much C and assembly language resemble one another; indeed, C is portable assembly language.
- The
collatz
function expects one argument, which is implemented by the registerr1
to hold the user's input of an unsigned integer value (e.g., 9). This function updates registerr1
until it equals 1, keeping count of the steps involved with registerr0
, which thereby serves as the function's return value.
An early and interesting code segment in main
involves the call to the library function scanf
, a high-level input function that scans a value from the standard input (by default, the keyboard) and converts this value to the desired data type, in this case, a 4-byte unsigned integer. Here is the code segment in full:
ldr r0, =format /* address of format string into r0 */
ldr r1, =num /* address of num into r1 */
bl [scanf][8] /* call scanf (bl = branch with link) */
ldr r1, =num /* address of num into r1 */
ldr r1, [r1] /* value at the address into r1 */
bl collatz /* call collatz with r1 as the argument */
Two labels (addresses) are in play: format
and num
, both of which are defined in the .data
section at the top of the program:
num: .int 0
format: .asciz "%u"
The label num:
is the memory address of a 4-byte integer value, initialized to zero; the label format:
points to a null-terminated (the "z" in "asciz" for zero) string "%u"
, which specifies that the scanned input should be converted to an unsigned integer. Accordingly, the first two instructions in the code segment load the address of the format string (=format
) into register r0
and the address for the scanned number (=num
) into register r1
. Note that each label now starts with an equals sign ("assign address") and that the colon is dropped at the end of each label. The library function scanf
can take an arbitrary number of arguments, but the first (which scanf
expects in register r0
) should be the "address" of a format string. In this example, the second argument to scanf
is the "address" at which to save the scanned integer.
The last three instructions in the code segment highlight important assembly details. The first ldr
instruction loads the address of the memory-based integer (=num
) into register r1
. However, the collatz
function expects the value stored at this address, not the address itself; hence, the address is dereferenced to get the value:
ldr r1, =num /* load address into r1 */
ldr r1, [r1] /* dereference to get value */
The square brackets specify memory and r1
holds a memory address. The expression [r1]
thus evaluates to the value stored in memory at address r1
. The example underscores that registers can hold addresses and values stored at addresses: register r1
first holds an address and then the value stored at this address.
When the collatz
function returns to main
, this function first performs a store operation:
ldr r3, =steps /* steps is a memory address */
str r0, [r3] /* store r0 value at mem[r3] */
The label steps:
is from the .data
section and register r0
holds the steps computed in the collatz
function. The str
instruction thus saves to memory the length of the hailstone sequence. In a ldr
instruction, the first operand (a register) is the target for the load; but in a str
operation, the first operand (also a register) is the source for the store. In both cases, the second operand is a memory location.
Some additional work in main
sets up the final report:
mov r2, r0 /* save count in r2 */
ldr r1, =num /* recover user input */
ldr r1, [r1] /* dereference r1 */
ldr r0, =report /* r0 points to format string */
bl [printf][7] /* print report */
In the collatz
function, register r0
tracks how many steps are needed to reach 1 from the user's input, but the library function printf
expects its first argument (the address of a format string) to be stored in register r0
. The return value in register r0
is therefore copied into register r2
with the mov
instruction. The address of the format string for printf
is then stored in register r0
.
The argument to the collatz
function is the scanned input, which is stored in register r1
; but this register is updated in the collatz
loop unless the value happens to be 1 at the start. Accordingly, the address num:
is again copied into r1
and then dereferenced to get the user's original input. This value becomes the second argument to printf
, the starting value of the hailstone sequence. With this setup in place, main
calls printf
with the bl
("branch with link") instruction.
At the very start of the collatz
loop, the program checks whether the sequence has hit a 1:
cmp r1, #1
beq collatz_end
If register r1
has 1 as its value, there is a branch (beq
for "branch if equal") to the end of the collatz
function, which means a return to its caller main
with register r0
as the return value—the number of steps in the hailstone sequence.
The collatz
function introduces new features and opcodes, which illustrate how efficient assembly code can be. The assembly code, like the C code, checks for N's parity, with register r1
as N:
and r2, r1, #1 /* r2 = r1 & 1 */
cmp r2, #0 /* is the result even? */
The result of the bitwise AND operation on register r1
and 1 is stored in register r2
. If the least-significant (rightmost) bit of register r2
is a 1, then N (register r1
) is odd, and otherwise, it is even. The result of this comparison (saved in the special-purpose register cpsr
) is used automatically in forthcoming instructions such as moveq
("move if equal") and addne
("add if not equal").
The assembly code, like the C code, now avoids an explicit if-else construct. This code segment has the same effect as an if-else test, but the code is more efficient in that no branching is involved—the code executes in straight-line fashion because the conditional tests are built into the instruction opcodes:
moveq r1, r1, LSR #1 /* right-shift 1 bit if even */
addne r1, r1, r1, LSL #1 /* left-shift 1 bit and add otherwise */
addne r1, #1 /* add 1 for the + 1 in N = 3N + 1 */
The moveq
(eq
for "if equal") instruction checks the outcome of the earlier cmp
test, which determines whether the current value of register r1
(N) is even or odd. If the value in register r1
is even, this value must be updated to its half, which is done by a 1-bit right-shift (LSR #1
). In general, right-shifting an integer is more efficient than explicitly dividing it by two. For example, suppose that register r1
currently holds 4, whose least significant four bits are:
`...0100 ## 4 in binary`
Shifting right by 1 bit yields:
`...0010 ## 2 in binary`
The LSR
stands for "logical shift right" and contrasts with ASR
for "arithmetic shift right." An arithmetic shift is sign-preserving (most significant bit of 1 for negative and 0 for non-negative), whereas a logical shift is not, but the hailstone programs deal exclusively with unsigned (hence, non-negative) values. In a logical shift, the shifted bits are replaced by zeros.
If register r1
holds a value with odd parity, similar straight-line code occurs:
addne r1, r1, r1, LSL #1 /* r1 = r1 * 3 */
addne r1, #1 /* r1 = r1 + 1 */
The two addne
instructions (ne
for "if not equal") execute only if the earlier check for parity indicates an odd value. The first addne
instruction does multiplication through a 1-bit left-shift and addition. In general, shifting and adding are more efficient than explicitly multiplying. The second addne
then adds 1 to register r1
so that the update is from N to 3N+1.
Assembling the hstoneS program
The assembly source code for the hstoneS program needs to be translated ("assembled") into a binary object module, which is then linked with appropriate libraries to become executable. The simplest approach is to use the GNU C compiler in the same way as it is used to compile a C program such as hstoneC:
`% gcc -o hstoneS hstoneS.s`
This command does the assembling and linking.
A slightly more efficient approach is to use the as utility that ships with the GNU toolset. This approach separates the assembling and the linking. Here is the assembling step:
`% as -o hstoneS.o hstoneS.s ## assemble`
The extension .o
is traditional for object modules. The system utility ld then could be used for the linking, but an easier and equally efficient approach is to revert to the gcc command:
`% gcc -o hstoneS hstoneS.o ## link`
This approach highlights again that the C compiler handles any mix of C and assembly, whether source files or object modules.
Wrapping up with an explicit system call
The two hailstone programs use the high-level input/output functions scanf
and printf
. These functions are high-level in that they deal with formatted types (in this case, unsigned integers) rather than with raw bytes. In an embedded system, however, these functions might not be available; the low-level input/output functions read
and write
, which ultimately implement their high-level counterparts, then would be used instead. These two system functions are low-level in that they work with raw bytes.
In Arm6 assembly, a program explicitly calls a system function such as write
in an indirect manner—by invoking one of the aforementioned functions svc
or syscall
, for example:
calls calls
program------->svc------->write
The integer identifier for a particular system function (e.g., 4 identifies write
) goes into the appropriate register (register r7
for svc
and register r0
for syscall
). The code segments below illustrate, first with svc
and then with syscall
.
The two code segments write the traditional greeting to the standard output, which is the screen by default. The standard output has a file descriptor, a non-negative integer value that identifies it. The three predefined descriptors are:
standard input: 0 (keyboard by default)
standard output: 1 (screen by default)
standard error: 2 (screen by default)
Here is the code segment for a sample system call with svc
:
msg: .asciz "Hello, world!\n" /* greeting */
...
mov r0, #1 /* 1 = standard output */
ldr r1, =msg /* address of bytes to write */
mov r2, #14 /* message length (in bytes) */
mov r7, #4 /* write has 4 as an id */
svc #0 /* system call to write */
The function write
takes three arguments and, when called via svc
, the function's arguments go into the following registers:
r0
holds the target for the write operation, in this case, the standard output (#1
).r1
has the address of the byte(s) to write (=msg
).r2
specifies how many bytes are to be written (#14
).
In the case of the svc
instruction, register r7
identifies, with a non-negative integer (in this case, #4
), which system function to call. The svc
call returns zero (the #0
) to signal success but usually a negative value to signal an error.
The syscall
and svc
functions differ in detail, but using either to invoke a system function requires the same two steps:
- Specify the system function to be called (in
r7
forsvc
, inr0
forsyscall
). - Put the arguments for the system function in the appropriate registers, which differ between the
svc
andsyscall
variants.
Here is the syscall
example of invoking the write
function:
msg: .asciz "Hello, world!\n" /* greeting */
...
mov r1, #1 /* standard output */
ldr r2, =msg /* address of message */
mov r3, #14 /* byte count */
mov r0, #4 /* identifier for write */
syscall
C has a thin wrapper not only for the syscall
function but for the system functions read
and write
as well. The C wrapper for syscall
gives the gist at a high level:
syscall(SYS_write, /* 4 is the id for write */
STDOUT_FILENO, /* 1 is the standard output */
"Hello, world!\n", /* message */
14); /* byte length */
The direct approach in C uses the wrapper for the system write
function:
`write(STDOUT_FILENO, "Hello, world!\n", 14);`
via: https://opensource.com/article/20/10/arm6-assembly-language
作者:Marty Kalin 选题:lujun9972 译者:译者ID 校对:校对者ID