In this tutorial I will list some techniques to understand a basic Linux program. I will use a simple assembly program that uses nasm syntax. Common program format in Linux uses GNU assembler syntax. Therefore, this tutorial does not cover understanding that syntax (which I might do in future). Lets get started.
For simplicity I will list the code here as helloWorld.asm
; for 32-bit ; nasm -felf32 helloWorld.asm ; ld -melf_i386 helloWorld.o -o helloWorld ; for 64-bit ; nasm -felf64 helloWorld.asm ; ld helloWorld.o -o helloWorld ; global _start section .data string: db 'Hello, world !!', 10 ; or 0xA for newline stringLength: equ $-string ; length section .bss section .text _start: mov edx, stringLength ; Arg4: string length mov ecx, string ; Arg3: string start address mov ebx, 1 ; stdout mov eax, 4 ; write(2) int 0x80 ; linux/bsd system call mov ebx, 0 ; return 0 mov eax, 1 ; exit(2) int 0x80
Comment section basically explains the function of important lines of above program. In the code “global _start” specifies which function to start with. In this case it is _start. For now assume label ( _start:) are like functions or variable names (string:). The variable stringLength is the length of the string. It subtracts initial address of string from address of current location ($). In data section data are placed sequentially, therefore it is easy to calculate the lengths. Here, .bss section is not used. In the .text section our main code resides. Here _start is just like int main() of a c program. We are not using gcc (as evident from us using ld without using c library to link the code). If it were a c like program our _start label had to be changed to main. In this label we are sequentially setting values in edx, ecx, ebx and eax registers. A standard write function needs these arguments. And as you might have guessed, they are in reverse order.
ssize_t write(int fd, const void *buf, size_t count);
As can be seen from above code file descriptor of stdout is 1 and syscall number for write is 4. When these values are set in reverse order the 0×80 interrupt can be called to do the operation. In this case writing the sequential values of the string (until the length) of address pointed by register ecx to stdout.
If you are on a 64-bit computer (like mine), there are many switches than can be used to generate the final output. I have included the commands in the initial section of above code. After you run one set of above commands (either the 32-bit one or the 64-bit one) you will get the desired program as output. As a matter of fact you can generate both 32-bit and 64-bit programs using above commands on a 64-bit Linux distribution.
In addition to the program output we will also disassemble the program using objdump. Run following commands to generate two files.
$ objdump -D helloWorld > codedump $ objdump -Dslx helloWorld > completeDump
Thus we created two files that dumps the program instruction/structure. Following is the oputput of codedump file.
helloWorld: file format elf32-i386 Disassembly of section .text: 08048080 <_start>: 8048080: ba 10 00 00 00 mov $0x10,%edx 8048085: b9 a4 90 04 08 mov $0x80490a4,%ecx 804808a: bb 01 00 00 00 mov $0x1,%ebx 804808f: b8 04 00 00 00 mov $0x4,%eax 8048094: cd 80 int $0x80 8048096: bb 00 00 00 00 mov $0x0,%ebx 804809b: b8 01 00 00 00 mov $0x1,%eax 80480a0: cd 80 int $0x80 Disassembly of section .data: 080490a4 <string>: 80490a4: 48 dec %eax 80490a5: 65 gs 80490a6: 6c insb (%dx),%es:(%edi) 80490a7: 6c insb (%dx),%es:(%edi) 80490a8: 6f outsl %ds:(%esi),(%dx) 80490a9: 2c 20 sub $0x20,%al 80490ab: 77 6f ja 804911c <__bss_start+0x68> 80490ad: 72 6c jb 804911b <__bss_start+0x67> 80490af: 64 20 21 and %ah,%fs:(%ecx) 80490b2: 21 0a and %ecx,(%edx)
I also recommend you to load the helloWorld program in a software called GHex and see in which section above code resides in the program. Following is the output (focus on the code after the highlighted section).
If you compare this image and above code, you can see the 5 bytes in above code sequentially in the figure. The byte size can also be identified by the address on the left side of above code. If you run this program in gdb you can also see the address and the commands in the _start segment using following commands.
$ gdb (gdb) break _start (gdb) r (gdb) disass Dump of assembler code for function _start: => 0x08048080 <+0>: mov $0x10,%edx 0x08048085 <+5>: mov $0x80490a4,%ecx 0x0804808a <+10>: mov $0x1,%ebx 0x0804808f <+15>: mov $0x4,%eax 0x08048094 <+20>: int $0x80 0x08048096 <+22>: mov $0x0,%ebx 0x0804809b <+27>: mov $0x1,%eax 0x080480a0 <+32>: int $0x80 End of assembler dump.
I suggest you keep a calculator handy if you would like to convert hexadecimal into decimal to resolve some values or addresses, and reference ASCII printable characters section of ASCII Table. Anyways, lets compare the output of our objdump codedump file and above screenshot of Ghex program. Take for example the first line inside _start (8048080: ba 10 00 00 00 mov $0×10,%edx). Here 8048080 is the address where this command exists, ba refers to “mov edx” in intel syntax or “mov , %edx” in AT&T syntax. Therefore ba means mov edx. Next four bytes after ba is the value 10. If lets say you want to store 12345678 in edx the command would be b9 76 56 34 12. To understand why b9 is mov edx, you have to understand following table.
b8 is opcode for mov EAX. To get other mov opcodes you have to add following values to b8.
EAX = 0 ECX = 1 EDX = 2 EBX = 3 ESP = 4 EBP = 5 ESI = 6 EDI = 7
Therefore this justifies (bb 01 00 00 00 as mov $0×1,%ebx) in above code. From above code we can also infer cd as the opcode for int command. If you watch the image carefully you can see the sequence 48 65 6C 6C 6F … This code is the ascii code for our “Hello, world !!’, 10” string. For example 48 is hexadecimal value of uppercase H, 65 is hexadecimal value of lowercase e and so on. If you look at string label in the data section, you can identify label does not get a memory address. In fact the first command that is executed in the label is the starting address. In our case it is 8048080. Now if you revisit the start label once more, you will see this address being loaded into ecx register using mov $0x80490a4,%ecx command.
Although understanding the assembly syntax of a simple Linux program (GAS syntax) is not very difficult, it heavily uses stack push and pop to branch and store arguments on a stack before executing commands like printf from standard c library. I hope in this tutorial, I was able to ready you to investigate further into this field. I recommend you refer the materials I listed in the reference section.