596 lines
16 KiB
Markdown
596 lines
16 KiB
Markdown
# twasm
|
|
|
|
this will be a self hosted, very minimal subset of nasm-style 64 bit asm
|
|
|
|
### goals
|
|
|
|
I want to compile Bootler and Twasm with the Twasm assembler
|
|
|
|
### reading
|
|
|
|
- [instructions](https://www.felixcloutier.com/x86/)
|
|
- [opcodes,ModR/M,SIB](http://ref.x86asm.net/coder64.html) (no secure site available)
|
|
- [calling conventions](https://wiki.osdev.org/Calling_Conventions); I try to use System V
|
|
|
|
### tokeniser
|
|
|
|
whitespace is ignored for the sake of readability; it can go between pretty much anything
|
|
|
|
```
|
|
------------------------
|
|
tokeniser
|
|
------------------------
|
|
byte(s) -> next byte(s)
|
|
------------------------
|
|
Newline -> Label
|
|
-> Newline
|
|
-> Komment
|
|
-> Operator
|
|
-> Directive
|
|
|
|
Label -> Newline
|
|
|
|
Komment -> Newline
|
|
|
|
Operator -> Newline
|
|
-> Komment
|
|
-> Operand
|
|
|
|
Operand -> Newline
|
|
-> Komment
|
|
-> Comma
|
|
|
|
Comma -> Operand
|
|
|
|
Directive -> Newline
|
|
-> Komment
|
|
-> Operator
|
|
------------------------
|
|
```
|
|
|
|
### memory map
|
|
|
|
```
|
|
+------ 0x00100000 ------+
|
|
| hardware, bios stuff |
|
|
+------ 0x00080000 ------+
|
|
| output binary |
|
|
+------ 0x00070000 ------+
|
|
| token table |
|
|
+------ 0x00060000 ------+
|
|
| test arena |
|
|
+------ 0x00050000 ------+
|
|
| label table |
|
|
+------ 0x00040000 ------+
|
|
| awaiting label table |
|
|
+------ 0x00030000 ------+
|
|
| stack (rsp) |
|
|
+------------------------+
|
|
| input |
|
|
+------------------------+
|
|
| assembler |
|
|
+------ 0x00010000 ------+
|
|
| bootloader, bios, etc. |
|
|
+------------------------+
|
|
```
|
|
|
|
each word represents a token on the token table.
|
|
|
|
#### token table (TT)
|
|
|
|
each token gets loaded into the token table with the following form:
|
|
|
|
```
|
|
2 bytes
|
|
+----------+
|
|
| 15 0 |
|
|
+----------+
|
|
| token id |
|
|
+----------+
|
|
```
|
|
|
|
#### label table (LT)
|
|
|
|
label definitions are stored and recalled from this table. The memory addresses are relative to the start of the program
|
|
|
|
```
|
|
16 bytes
|
|
+----------+---------+
|
|
| 127 96 | 95 64 |
|
|
+----------+---------+
|
|
| reserved | address |
|
|
+----------+---------+
|
|
| 63 0 |
|
|
+--------------------+
|
|
| hash |
|
|
+--------------------+
|
|
```
|
|
|
|
#### awaiting label table (ALT)
|
|
|
|
forward references are stored in this table to be filled in after assembly is otherwise complete. The memory addresses are relative to the start of the program
|
|
|
|
```
|
|
16 bytes
|
|
+----------+----------+------------------+---------+
|
|
| 127 101 | 100 | 99 96 | 95 64 |
|
|
+----------+----------+------------------+---------+
|
|
| reserved | abs flag | # bytes reserved | address |
|
|
+----------+----------+------------------+---------+
|
|
| 63 0 |
|
|
+--------------------------------------------------+
|
|
| hash |
|
|
+--------------------------------------------------+
|
|
```
|
|
|
|
### internal data structures
|
|
|
|
#### `tokens.[operators|registers]`
|
|
|
|
contains tokens by their type. Intended to be searched by token name to get the token's ID.
|
|
|
|
each entry is in the following form:
|
|
|
|
```
|
|
6 bytes
|
|
+----------+--------------------------------+
|
|
| 47 32 | 31 0 |
|
|
+----------+--------------------------------+
|
|
| token ID | string without null terminator |
|
|
+----------+--------------------------------+
|
|
|
|
```
|
|
|
|
note that tokens longer than 4 bytes are problematic :/
|
|
|
|
#### `tokens.by_id`
|
|
|
|
contains some tokens with their metadata. Some tokens have embedded information (`0x10XX` for instance). Those do not have entries in this table, being handled instead inside the assemble function itself.
|
|
|
|
metadata about some tokens in the following form:
|
|
|
|
```
|
|
4 bytes
|
|
+----------------+----------+-------+----------+
|
|
| 31 24 | 23 20 | 19 16 | 15 0 |
|
|
+----------------+----------+-------+----------+
|
|
| typed metadata | reserved | type | token ID |
|
|
+----------------+----------+-------+----------+
|
|
```
|
|
|
|
the `type` hex digit is defined as the following:
|
|
|
|
| hex | meaning | examples |
|
|
|-----|-----------------|-|
|
|
| 0x0 | ignored | |
|
|
| 0x1 | operator | `mov`, `hlt` |
|
|
| 0x2 | register | `rsp`, `al` |
|
|
| 0x3 | pseudo-operator | `db` |
|
|
| 0xF | unknown | any token ID not represented in the lookup table |
|
|
|
|
type metadata for the different types is as follows:
|
|
|
|
```
|
|
1 byte
|
|
+----------+
|
|
| type 0x0 |
|
|
+----------+
|
|
| 31 24 |
|
|
+----------+
|
|
| reserved |
|
|
+----------+
|
|
```
|
|
|
|
```
|
|
1 byte
|
|
+-------------------------------+
|
|
| type 0x1 |
|
|
+----------+--------------------+
|
|
| 31 26 | 25 24 |
|
|
+----------+--------------------+
|
|
| reserved | number of operands |
|
|
+----------+--------------------+
|
|
```
|
|
|
|
```
|
|
1 byte
|
|
+------------------------------+
|
|
| type 0x2 |
|
|
+----------+-----------+-------+
|
|
| 31 29 | 28 26 | 25 24 |
|
|
+----------+-----------+-------+
|
|
| reserved | reg value | width |
|
|
+----------+-----------+-------+
|
|
|
|
; reg is the value that cooresponds to the register in the ModR/M byte
|
|
|
|
; width:
|
|
00b ; 8 bit
|
|
01b ; 16 bit
|
|
10b ; 32 bit
|
|
11b ; 64 bit
|
|
```
|
|
|
|
```
|
|
1 byte
|
|
+----------+
|
|
| type 0x3 |
|
|
+----------+
|
|
| 31 24 |
|
|
+----------+
|
|
| reserved |
|
|
+----------+
|
|
```
|
|
|
|
#### `opcodes.by_id`
|
|
|
|
entries are as follows:
|
|
|
|
```
|
|
16 bytes
|
|
+------------------------------+
|
|
| 0 operand operators |
|
|
+---------+--------------------+
|
|
| 127 120 | 119 96 |
|
|
+---------+--------------------+
|
|
| flags | reserved |
|
|
+---------+--------------------+
|
|
| 95 64 |
|
|
+------------------------------+
|
|
| reserved |
|
|
+------------------------------+
|
|
| 63 32 |
|
|
+------------------------------+
|
|
| reserved |
|
|
+----------+--------+----------+
|
|
| 31 24 | 23 16 | 15 0 |
|
|
+----------+--------+----------+
|
|
| reserved | opcode | token ID |
|
|
+----------+--------+----------+
|
|
|
|
16 bytes
|
|
+------------------------------------------+
|
|
| 1 operand operators |
|
|
+----------+----------+----------+---------+
|
|
| 127 120 | 119 112 | 111 104 | 103 96 |
|
|
+----------+----------+----------+---------+
|
|
| flags | reserved | flags5 | flags4 |
|
|
+----------+----------+----------+---------+
|
|
| 95 88 | 87 80 | 79 72 | 71 64 |
|
|
+----------+----------+----------+---------+
|
|
| flags3 | flags2 | reserved | flags0 |
|
|
+----------+----------+----------+---------+
|
|
| 63 56 | 55 48 | 47 40 | 39 32 |
|
|
+----------+----------+----------+---------+
|
|
| opcode | opcode | opcode | opcode |
|
|
| dst=rel8 | dst=rel | dst=imm8 | dst=imm |
|
|
+----------+----------+----------+---------+
|
|
| 31 24 | 23 16 | 15 0 |
|
|
+----------+----------+--------------------+
|
|
| reserved | opcode | token ID |
|
|
| | dst=r/m | |
|
|
+----------+----------+--------------------+
|
|
|
|
16 bytes
|
|
+-----------------------------------------------+
|
|
| 2 operand operators |
|
|
+---------+-------------------------------------+
|
|
| 127 120 | 119 96 |
|
|
+---------+-------------------------------------+
|
|
| flags | reserved |
|
|
+---------+----------+--------------------------+
|
|
| 95 88 | 87 80 | 79 64 |
|
|
+---------+----------+--------------------------+
|
|
| flags3 | flags2 | reserved |
|
|
+---------+----------+-------+-------+----------+
|
|
| 63 48 | 47 40 | 39 32 |
|
|
+--------------------+---------------+----------+
|
|
| reserved | opcode | opcode |
|
|
| | dst=r/m | dst=r/m |
|
|
| | src=imm8 | src=imm |
|
|
+---------+----------+---------------+----------+
|
|
| 31 24 | 23 16 | 15 0 |
|
|
+---------+----------+--------------------------+
|
|
| opcode | opcode | token ID |
|
|
| dst=r | dst=r/m | |
|
|
| src=r/m | src=r | |
|
|
+---------+----------+--------------------------+
|
|
|
|
1 byte
|
|
+-----------------+
|
|
| flags byte |
|
|
+----------+------+
|
|
| 95 89 | 88 |
|
|
+----------+------+
|
|
| reserved | 8bit |
|
|
+----------+------+
|
|
|
|
1 byte
|
|
+----------------------------------------------------+
|
|
| flagsX byte |
|
|
+----------+-----------+-------------+---------------+
|
|
| 7 5 | 4 | 3 | 2 0 |
|
|
+----------+-----------+-------------+---------------+
|
|
| reserved | no ModR/M | 0x0F prefix | operator flag |
|
|
+----------+-----------+-------------+---------------+
|
|
|
|
; flags key:
|
|
8bit ; tte has opcodes for r/m8 and r8 instead of r/m and r respectively
|
|
|
|
; flagsX key:
|
|
no ModR/M ; there is no ModR/M byte for this opcode
|
|
0x0F prefix ; there is a 0x0F prefix for this opcode
|
|
operator flag ; contents of `reg` if applicable
|
|
|
|
; key:
|
|
r/m ; r/m 16/32/64
|
|
r/m8 ; r/m 8
|
|
r ; r 16/32/64
|
|
r8 ; r 8
|
|
imm ; imm 16/32
|
|
imm8 ; imm 8
|
|
rel ; rel 16/32
|
|
rel8 ; rel 8
|
|
```
|
|
|
|
note much room to expand. If an opcode doesn't exist, it should be 0x00
|
|
|
|
### token IDs
|
|
|
|
supported tokens are listed below
|
|
|
|
| token | id | notes |
|
|
|-------|--------|-|
|
|
| rax | 0x0000 | register |
|
|
| rbx | 0x0001 | register |
|
|
| rcx | 0x0002 | register |
|
|
| rdx | 0x0003 | register |
|
|
| rsi | 0x0004 | register |
|
|
| rdi | 0x0005 | register |
|
|
| rsp | 0x0006 | register |
|
|
| rbp | 0x0007 | register |
|
|
| r8 | 0x0008 | unimplemented |
|
|
| r9 | 0x0009 | unimplemented |
|
|
| r10 | 0x000A | unimplemented |
|
|
| r11 | 0x000B | unimplemented |
|
|
| r12 | 0x000C | unimplemented |
|
|
| r13 | 0x000D | unimplemented |
|
|
| r14 | 0x000E | unimplemented |
|
|
| r15 | 0x000F | unimplemented |
|
|
| eax | 0x0010 | register |
|
|
| ebx | 0x0011 | register |
|
|
| ecx | 0x0012 | register |
|
|
| edx | 0x0013 | register |
|
|
| esi | 0x0014 | register |
|
|
| edi | 0x0015 | register |
|
|
| esp | 0x0016 | register |
|
|
| ebp | 0x0017 | register |
|
|
| r8d | 0x0018 | unimplemented |
|
|
| r9d | 0x0019 | unimplemented |
|
|
| r10d | 0x001A | unimplemented |
|
|
| r11d | 0x001B | unimplemented |
|
|
| r12d | 0x001C | unimplemented |
|
|
| r13d | 0x001D | unimplemented |
|
|
| r14d | 0x001E | unimplemented |
|
|
| r15d | 0x001F | unimplemented |
|
|
| ax | 0x0020 | register |
|
|
| bx | 0x0021 | register |
|
|
| cx | 0x0022 | register |
|
|
| dx | 0x0023 | register |
|
|
| si | 0x0024 | register |
|
|
| di | 0x0025 | register |
|
|
| sp | 0x0026 | register |
|
|
| bp | 0x0027 | register |
|
|
| r8w | 0x0028 | unimplemented |
|
|
| r9w | 0x0029 | unimplemented |
|
|
| r10w | 0x002A | unimplemented |
|
|
| r11w | 0x002B | unimplemented |
|
|
| r12w | 0x002C | unimplemented |
|
|
| r13w | 0x002D | unimplemented |
|
|
| r14w | 0x002E | unimplemented |
|
|
| r15w | 0x002F | unimplemented |
|
|
| al | 0x0030 | register |
|
|
| bl | 0x0031 | register |
|
|
| cl | 0x0032 | register |
|
|
| dl | 0x0033 | register |
|
|
| sil | 0x0034 | register |
|
|
| dil | 0x0035 | register |
|
|
| spl | 0x0036 | register |
|
|
| bpl | 0x0037 | register |
|
|
| r8b | 0x0038 | unimplemented |
|
|
| r9b | 0x0039 | unimplemented |
|
|
| r10b | 0x003A | unimplemented |
|
|
| r11b | 0x003B | unimplemented |
|
|
| r12b | 0x003C | unimplemented |
|
|
| r13b | 0x003D | unimplemented |
|
|
| r14b | 0x003E | unimplemented |
|
|
| r15b | 0x003F | unimplemented |
|
|
| ah | 0x0040 | unimplemented |
|
|
| bh | 0x0041 | unimplemented |
|
|
| ch | 0x0042 | unimplemented |
|
|
| dh | 0x0043 | unimplemented |
|
|
| cs | 0x0044 | unimplemented |
|
|
| ds | 0x0045 | unimplemented |
|
|
| es | 0x0046 | unimplemented |
|
|
| fs | 0x0047 | unimplemented |
|
|
| gs | 0x0048 | unimplemented |
|
|
| ss | 0x0049 | unimplemented |
|
|
| cr0 | 0x004A | unimplemented |
|
|
| cr2 | 0x004B | unimplemented |
|
|
| cr3 | 0x004C | unimplemented |
|
|
| cr4 | 0x004D | unimplemented |
|
|
| cr8 | 0x004E | unimplemented |
|
|
| hlt | 0x004F | operator |
|
|
| int3 | 0x0050 | operator |
|
|
| | 0x0051 | deprecated; formerly `[`. Now `0x10XX` is used. |
|
|
| | 0x0052 | deprecated; formerly `]`. |
|
|
| xor | 0x0053 | operator |
|
|
| inc | 0x0054 | operator |
|
|
| dec | 0x0055 | operator |
|
|
| mov | 0x0056 | operator |
|
|
| add | 0x0057 | operator |
|
|
| sub | 0x0058 | operator |
|
|
| call | 0x0059 | operator |
|
|
| ret | 0x005A | operator |
|
|
| cmp | 0x005B | operator |
|
|
| jmp | 0x005C | operator |
|
|
| je | 0x005D | operator |
|
|
| jne | 0x005E | operator |
|
|
| push | 0x005F | operator |
|
|
| pop | 0x0060 | operator |
|
|
| out | 0x0061 | operator |
|
|
| db | 0x0100 | pseudo-operator |
|
|
| | 0x10XX | some memory address; `XX` is as specified below |
|
|
| | 0x20XX | some constant; `XX` is as specified below |
|
|
| | 0x3XXX | some label; `XXX` is its entry index in the label table |
|
|
| | 0xFEXX | used to pass some raw value `XX` in place of a token id to a couple of functions that mention this as a feature. If the function doesn't mention it, it will lead to undefined behaviour |
|
|
| | 0xFFFF | unrecognised token |
|
|
|
|
values of `XX` in `0x10XX`:
|
|
|
|
| XX | description |
|
|
|------|-------------|
|
|
| 0x00 | following word is the token ID of some register |
|
|
|
|
values of `XX` in `0x20XX`:
|
|
|
|
| XX | description |
|
|
|------|-------------|
|
|
| 0x00 | following 8 bytes are the constant's value |
|
|
|
|
### example program
|
|
|
|
#### program in assembly
|
|
|
|
this program doesn't do anything useful, it's just a test
|
|
|
|
```nasm
|
|
xor eax, eax
|
|
inc rax ; inline comment
|
|
; one line comment
|
|
mov rdx, [rax]
|
|
mov [rax], rdx
|
|
hlt
|
|
|
|
```
|
|
|
|
#### tokenization
|
|
|
|
```nasm
|
|
0x0053 ; xor
|
|
0x0010 ; eax
|
|
0x0010 ; eax
|
|
0x0054 ; inc
|
|
0x0000 ; rax
|
|
0x0056 ; mov
|
|
0x0003 ; rdx
|
|
0x1000 ; memory address: register
|
|
0x0000 ; rax
|
|
0x0056 ; mov
|
|
0x1000 ; memory address: register
|
|
0x0000 ; rax
|
|
0x0003 ; rdx
|
|
0x004F ; hlt
|
|
```
|
|
|
|
#### nasm output with the above example program, bits 64
|
|
|
|
```nasm
|
|
0x31 ; XOR r/m16/32/64 r16/32/64
|
|
0xC0 ; ModR/M byte
|
|
; mod 11b ; directly address the following:
|
|
; reg 000b ; EAX
|
|
; r/m 000b ; EAX
|
|
|
|
0x48 ; 64 Bit Operand Size prefix
|
|
0xFF ; with `reg` from ModR/M byte 000b:
|
|
; INC r/m16/32/64
|
|
0xC0 ; ModR/M byte
|
|
; mod 11b ; direct addressing
|
|
; reg 000b ; RAX
|
|
; r/m 000b ; RAX
|
|
|
|
0x48 ; 64 Bit Operand Size prefix
|
|
0x8B ; MOV r16/32/64 r/m16/32/64
|
|
0x10 ; ModR/M byte
|
|
; mod 00b ; indirect addressing, no displacement
|
|
; reg 010b ; RDX
|
|
; r/m 000b ; [RAX]
|
|
|
|
0x48 ; 64 Bit Operand Size prefix
|
|
0x89 ; MOV r/m16/32/64 r16/32/64
|
|
0x10 ; ModR/M byte
|
|
; mod 00b ; indirect addressing, no displacement
|
|
; reg 010b ; RDX
|
|
; r/m 000b ; [RAX]
|
|
|
|
0xF4 ; HLT
|
|
```
|
|
|
|
#### program output with the function `print`, each comma-seperated `db` value put onto its own line
|
|
|
|
editted output of `x/512xb 0x00070000` in [gdb](https://www.sourceware.org/gdb/)
|
|
|
|
the following is somewhat correct! I just need to a) null-terminate 8-byte chars and b) define all the addresses currently represented as `0xff 0xff 0xff 0xff`
|
|
|
|
```
|
|
0x48 0xff 0xf2
|
|
0x48 0xff 0xf0
|
|
0x48 0xff 0xf6
|
|
0xc7 0xc2 0xf8 0x03 0x00 0x00
|
|
0x8a 0x06
|
|
0x80 0xf8 0x00
|
|
0x0f 0x84 0xff 0xff 0xff 0xff
|
|
0x66 0xee
|
|
0x48 0xff 0xc6
|
|
0xe9 0xff 0xff 0xff 0xff
|
|
0x48 0x8f 0xc6
|
|
0x48 0x8f 0xc0
|
|
0x48 0x8f 0xc2
|
|
0xc3
|
|
0x48 0xff 0xf6
|
|
0xc7 0xc6 0xff 0xff 0xff 0xff
|
|
0xe8 0xff 0xff 0xff 0xff
|
|
0x48 0x8f 0xc6
|
|
0xe9 0xff 0xff 0xff 0xff
|
|
0x48 0xff 0xf6
|
|
0xc7 0xc6 0xff 0xff 0xff 0xff
|
|
0xe8 0xff 0xff 0xff 0xff
|
|
0x48 0x8f 0xc6
|
|
0xe9 0xff 0xff 0xff 0xff
|
|
0x48 0xff 0xf6
|
|
0xc7 0xc6 0xff 0xff 0xff 0xff
|
|
0xe8 0xff 0xff 0xff 0xff
|
|
0x48 0x8f 0xc6
|
|
0xe9 0xff 0xff 0xff 0xff
|
|
0x48 0xff 0xf6
|
|
0xc7 0xc6 0xff 0xff 0xff 0xff
|
|
0xe8 0xff 0xff 0xff 0xff
|
|
0x48 0x8f 0xc6
|
|
0xe9 0xff 0xff 0xff 0xff
|
|
0x1b 0x00 0x00 0x00 0x00 0x00 0x00 0x00
|
|
0x5b 0x33 0x36 0x6d 0x00 0x00 0x00 0x00
|
|
0x7b 0x44 0x45 0x42 0x55 0x47 0x5d 0x3a
|
|
0x1b 0x00 0x00 0x00 0x00 0x00 0x00 0x00
|
|
0x5b 0x30 0x6d 0x00 0x00 0x00 0x00 0x00
|
|
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
|
|
0x1b 0x00 0x00 0x00 0x00 0x00 0x00 0x00
|
|
0x5b 0x31 0x3b 0x33 0x31 0x6d 0x00 0x00
|
|
0x7b 0x45 0x52 0x52 0x4f 0x52 0x5d 0x3a
|
|
0x1b 0x00 0x00 0x00 0x00 0x00 0x00 0x00
|
|
0x5b 0x30 0x6d 0x00 0x00 0x00 0x00 0x00
|
|
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
|
|
0x1b 0x00 0x00 0x00 0x00 0x00 0x00 0x00
|
|
0x5b 0x31 0x3b 0x33 0x33 0x6d 0x00 0x00
|
|
0x5b 0x54 0x45 0x53 0x54 0x5d 0x3a 0x20
|
|
0x1b 0x00 0x00 0x00 0x00 0x00 0x00 0x00
|
|
0x5b 0x30 0x6d 0x00 0x00 0x00 0x00 0x00
|
|
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
|
|
0x1b 0x00 0x00 0x00 0x00 0x00 0x00 0x00
|
|
0x5b 0x31 0x3b 0x33 0x35 0x6d 0x00 0x00
|
|
0x5b 0x57 0x41 0x52 0x4e 0x5d 0x3a 0x20
|
|
0x1b 0x00 0x00 0x00 0x00 0x00 0x00 0x00
|
|
0x5b 0x30 0x6d 0x00 0x00 0x00 0x00 0x00
|
|
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
|
|
```
|