Assembly language
From Wikipedia, the free encyclopedia
Assembly language, commonly called assembly, asm or symbolic machine code, is a human-readable notation for the machine language that a specific computer architecture uses. Machine language, a pattern of bits encoding machine operations, is made readable by replacing the raw values with symbols called mnemonics. Assembly is derived from a similar representation called short code, whose programming 'language' was of the same name. Contrast this with speed code / 'speedcoding'
For example, a computer with the appropriate processor will understand this x86/IA-32 machine language:
10110000 01100001
For programmers, however, it is easier to remember the equivalent assembly language representation:
mov al, 061h
which means to move the hexadecimal value 61 (97 decimal) into the processor register with the name "al". The mnemonic "mov" is short for "move", and a comma-separated list of arguments or parameters follows it; this is a typical assembly language statement.
Transforming assembly into machine language is accomplished by an assembler, and the reverse by a disassembler. Unlike in high-level languages, there is usually a 1-to-1 correspondence between simple assembly statements and machine language instructions. However, in some cases an assembler may provide pseudoinstructions which expand into several machine language instructions to provide commonly needed functionality. For example, for a machine that lacks a "branch if greater or equal" instruction, an assembler may provide a pseudoinstruction that expands to the machine's "set if less than" and "branch if zero (on the result of the set instruction)". Most full-featured assemblers also allow programmers to define their own pseudoinstructions of this sort, usually called macros.
Every computer architecture has its own machine language, and therefore its own assembly language. Computers differ by the number and type of operations that they support. They may also have different sizes and numbers of registers, and different representations of data types in storage. While all general-purpose computers are able to carry out essentially the same functionality, the way they do it differs, and the corresponding assembly language must reflect these differences.
In addition, multiple sets of mnemonics or assembly-language syntax may exist for a single instruction set. In these cases, the most popular one is usually that used by the manufacturer in their documentation.
Contents |
Machine instructions
Instructions in assembly language are generally very simple, unlike in a high-level language. Any instruction that references memory (for data or as a jump target) will also have an addressing mode to determine how to calculate the required memory address. More complex operations must be built up out of these simple operations. Some operations available in most instruction sets include:
- moving
- set a register (a temporary "scratchpad" location in the CPU itself) to a fixed constant value
- move data from a memory location to a register, or vice versa. This is done to obtain the data to perform a computation on it later, or to store the result of a computation.
- read and write data from hardware devices
- computing
- add, subtract, multiply, or divide the values of two registers, placing the result in a register
- perform bitwise operations, taking the conjunction/disjunction (and/or) of corresponding bits in a pair of registers, or the negation (not) of each bit in a register
- compare two values in registers (for example, to see if one is less, or if they are equal)
- affecting program flow
- jump to another location in the program and execute instructions there
- jump to another location if a certain condition holds
- jump to another location, but save the location of the next instruction as a point to return to (a call)
Some computers include one or more "complex" instructions in their instruction set. A single "complex" instruction does something that may take many instructions on other computers. Such instructions are typified by instructions that take multiple steps, may issue to multiple functional units, or otherwise appear to be a design exception to the simplest instructions which are implemented for the given processor. Some examples of such instructions include:
- saving many registers on the stack at once
- moving large blocks of memory
- complex and/or floating-point arithmetic (sine, cosine, square root, etc.)
- performing an atomic test-and-set instruction
- instructions that combine ALU with an operand from memory rather than a register
A complex instruction type that has become particularly popular recently is the SIMD operation or vector instruction, an operation that performs the same arithmetic operation on multiple pieces of data at the same time. SIMD instructions allow easy parallelization of algorithms commonly involved in sound, image, and video processing. Various SIMD implementations have been brought to market under trade names such as MMX, 3DNow! and AltiVec.
The design of instruction sets is a complex issue, with a simpler instruction set (generally grouped under the concept RISC) perhaps offering the potential for higher speeds, while a more complex one (traditionally called CISC) may offer particularly fast implementations of common performance-demanding tasks, may use memory (and thus cache) more efficiently, and be somewhat easier to program directly in assembly. See instruction set for a fuller discussion of this issue.
Assembly language directives
In addition to codes for machine instructions, assembly languages have additional directives for assembling blocks of data (e.g. numbers and strings) and for assigning mnemonic symbols (often called labels) to address locations for code (such as subroutine entry points) or data areas. Symbols can also typically be assigned specific values or even the results of simple calculations. These all add to the assembler's capacity to allow the programmer to write code that is easier to read and maintain.
Like most computer languages, comments can be added to assembly source code; these often provide useful additional information to human readers of the code, but are ignored by the assembler and so may be used freely. Assembly source code without comments and meaningful symbols and data definitions, such as that generated by compilers or disassemblers, is quite difficult to read.
Most assemblers have an embedded macro language to make it easier to generate code or data. For example, with 8-bit processors it is common to use a macro that increments or decrements a 16-bit quantity stored in two consecutive bytes, a common operation which would normally require three or four instructions on, for example, the 6502. On IBM mainframes, it is common to use a macro for currency formatting, typically a sequence of four instructions revolving around the Edit and Mark (EDMK) instruction. Most processor architectures have idiomatic instruction sequences of this type, and many assemblers even have built-in macros for the most common ones. Some assemblers have quite sophisticated macro languages, allowing for example conditionals and even "looping", i.e. assembling some number of instructions repeatedly while varying some operand on each repetition, thereby allowing easy generation of "unrolled" loops.
Usage of assembly language
Historically, a large number of programs have been written entirely in assembly language. A classic example was the early IBM PC spreadsheet program Lotus 123. Even into the 1990s, the majority of console video games were written in assembly language, including most games written for the Mega Drive/Genesis and the Super Nintendo Entertainment System. The popular arcade game NBA Jam (1993) was also coded entirely using assembly language.
There is some debate over the continued usefulness of assembly language. It is often said that modern compilers can render higher-level languages into codes that run as fast as hand-written assembly, but counter-examples can be made, and there is no clear consensus on this topic. It is reasonably certain that, given the increase in complexity of modern processors, effective hand-optimization is increasingly difficult and requires a great deal of knowledge.
However, some discrete calculations can still be rendered into faster running code with assembly, and some low-level programming is actually easier to do with assembly. Some system-dependent tasks performed by operating systems simply cannot be expressed in high-level languages. In particular, assembly is often used in writing the low level interaction between the operating system and the hardware, for instance in device drivers. Many compilers also render high-level languages into assembly first before fully compiling, allowing the assembly code to be viewed for debugging and optimization purposes.
It's also common, especially in relatively low-level languages such as C, to be able to embed assembly language into the source code with special syntax. Programs using such facilities, such as the Linux kernel, often construct abstractions where different assembly language is used on each platform the program supports, but it is called by portable code through a uniform interface.
Many embedded systems are also programmed in assembly to obtain the absolute maximum functionality out of what is often very limited computational resources, though this is gradually changing in some areas as more powerful chips become available for the same minimal cost.
Another common area of assembly language use is in the system BIOS of a computer. This low-level code is used to initialize and test the system hardware prior to booting the OS and is stored in ROM. Once a certain level of hardware initialization has taken place, code written in higher level languages can be used, but almost always the code running immediately after power is applied is written in assembly language.
Assembly language is also valuable in reverse engineering, since many programs are distributed only in machine code form, and machine code is usually easy to translate into assembly language and carefully examine in this form, but very difficult to translate into a higher-level language. Tools such as the Interactive Disassembler make extensive use of disassembly for such a purpose.
MenuetOS, a floppy-based operating system with a fully functional GUI, is written entirely in assembly. A 64bit version is also available. The author claims that only through assembly language could he produce his system in less than 1.4 megabytes.
While the modern role of assembly differs greatly from the past in that most software developers do not use assembly coding for entire projects anymore, it is still a very valuable tool to use when writing frequently-accessed components of an application or an operating system; a commonly cited example is an operating system's boot loader, which is almost always written entirely in assembly language for compactness and speed.
Regardless of whether a programmer will actually use assembly in day-to-day work, a case can be made that any serious programmer should learn at least one assembly language to better understand how computers work (and to appreciate all the work high-level languages save).
Example listing of assembly language source code
Addr | Label | Instruction | Object code[1] |
---|---|---|---|
.begin | |||
.org 2048 | |||
a_start | .equ 3000 | ||
2048 | ld [length],&r1 | 11000010 00000000 00101000 00101100 | |
2052 | ld [address],%r2 | 11000100 00000000 00101000 00110000 | |
2056 | addcc %r3,%r0,%r3 | 10000110 10001000 11000000 00000000 | |
2060 | loop: | addcc %r1,%r1,%r0 | 10000000 10001000 01000000 00000001 |
2064 | be done | 00000010 10000000 00000000 00000110 | |
2068 | addcc %r1,-4,%r1 | 10000010 10000000 01111111 11111100 | |
2072 | addcc %r1,%r2,%r4 | 10001000 10000000 01000000 00000010 | |
2076 | ld %r4,%r5 | 11001010 00000001 00000000 00000000 | |
2080 | ba loop | 00010000 10111111 11111111 11111011 | |
2084 | addcc %r3,%r5,%r3 | 10000110 10000000 11000000 00000101 | |
2088 | done: | jmpl %r15+4,%r0 | 10000001 11000011 11100000 00000100 |
2092 | length: | 20 | 00000000 00000000 00000000 00010100 |
2096 | address: | a_start | 00000000 00000000 00001011 10111000 |
.org a_start | |||
3000 | a: | 25 | 00000000 00000000 00000000 00011001 |
3004 | -10 | 11111111 11111111 11111111 11110110 | |
3008 | 33 | 00000000 00000000 00000000 00100001 | |
3012 | -5 | 11111111 11111111 11111111 11111011 | |
3016 | 7 | 00000000 00000000 00000000 00000111 | |
.end |
Example of a selection of instructions (for a virtual computer[2]) with the corresponding address in memory where each instruction will be placed. These addresses are not static, see memory management. Accompanying each instruction is the generated (by the assembler) object code that coincides with the virtual computer's architecture (or ISA).
See also
References
- ^ Murdocca, Miles J.; Vincent P. Heuring (2000). Principles of Computer Architecture. Prentice-Hall. ISBN 0-201-43664-7.
- ^ Principles of Computer Architecture (POCA) – ARCTools virtual computer available for download to execute referenced code, accessed August 24, 2005
Books
- The Art of Assembly Language Programming, by Randall Hyde
- Computer-Books.us, Online Assembly Language Books
- PC Assembly Language by Dr Paul Carter; *PC Assembly Tutorial using NASM and GCC by Paul Carter
- Programming from the Ground Up by Jonathan Bartlett
- The x86 ASM Book by the ASM Community
External links
- Free MASM Source Code
- The ASM Community Messageboard
- MenuetOS - hobby Operating System for the PC written entirely in 64bit assembly language
- List of resources; books, websites, newsgroups, and IRC channels
- Linux Assembly
- Unix Assembly Language Programming
- PPR: Learning Assembly Language
- CodeTeacher
- Assembly Language Programming Examples
- Typed Assembly Language (TAL)
- Authoring Windows Applications In Assembly Language
- RosAsm assembler/ RosAsm assembly Forum