405 lines
8.4 KiB
Markdown
405 lines
8.4 KiB
Markdown
# twasm
|
|
|
|
this will be a self hosted, very minimal subset of nasm-style 64 bit asm
|
|
|
|
### goals
|
|
|
|
I want to compile Bootler and Twasm with the Twasm assembler
|
|
|
|
### reading
|
|
|
|
- [instructions](https://www.felixcloutier.com/x86/)
|
|
- [opcodes,ModR/M,SIB](http://ref.x86asm.net/coder64.html) (no secure site available)
|
|
- [calling conventions](https://wiki.osdev.org/Calling_Conventions); I try to use System V
|
|
|
|
### tokeniser
|
|
|
|
whitespace is ignored for the sake of readability; it can go between pretty much anything
|
|
|
|
```
|
|
------------------------
|
|
tokeniser
|
|
------------------------
|
|
byte(s) -> next byte(s)
|
|
------------------------
|
|
Newline -> Newline
|
|
-> Komment
|
|
-> Operator
|
|
-> Directive
|
|
|
|
Komment -> Newline
|
|
|
|
Operator -> Newline
|
|
-> Komment
|
|
-> Operand
|
|
|
|
Operand -> Newline
|
|
-> Komment
|
|
-> Comma
|
|
|
|
Comma -> Operand
|
|
|
|
Directive -> Newline
|
|
-> Komment
|
|
-> Operator
|
|
------------------------
|
|
```
|
|
|
|
not yet implemented:
|
|
|
|
```
|
|
------------------------
|
|
operand parser
|
|
------------------------
|
|
byte(s) -> next byte(s)
|
|
------------------------
|
|
START -> '['
|
|
-> Register
|
|
-> Constant
|
|
|
|
'[' -> Register
|
|
-> Constant
|
|
|
|
']' -> END
|
|
|
|
Register -> IF #[, ']'
|
|
-> Operator
|
|
|
|
Constant -> IF #[, ']'
|
|
-> Operator
|
|
|
|
Operator -> IF NOT #R, Register
|
|
-> Constant
|
|
------------------------
|
|
:R: = whether a register has been found
|
|
:[: = whether a '[' has been found
|
|
------------------------
|
|
```
|
|
|
|
### memory map
|
|
|
|
```
|
|
+------ 0x00100000 ------+
|
|
| hardware, bios stuff |
|
|
+------ 0x00080000 ------+
|
|
| output binary |
|
|
+------ 0x00070000 ------+
|
|
| token table |
|
|
+------ 0x00060000 ------+
|
|
| test arena |
|
|
+------ 0x00050000 ------+
|
|
| stack (rsp) |
|
|
+------------------------+
|
|
| input |
|
|
+------------------------+
|
|
| assembler |
|
|
+------ 0x00010000 ------+
|
|
| bootloader, bios, etc. |
|
|
+------------------------+
|
|
```
|
|
|
|
each word represents a token on the token table.
|
|
|
|
#### token table (TT)
|
|
|
|
each token gets loaded into the token table with the following form:
|
|
|
|
```
|
|
+----------+
|
|
| 15 0 |
|
|
+----------+
|
|
| token id |
|
|
+----------+
|
|
```
|
|
|
|
### internal data structures
|
|
|
|
#### `tokens.[operators|registers]`
|
|
|
|
contains tokens by their type. Intended to be searched by token name to get the token's ID.
|
|
|
|
each entry is in the following form:
|
|
|
|
```
|
|
+----------+--------------------------------+
|
|
| 47 32 | 31 0 |
|
|
+----------+--------------------------------+
|
|
| token ID | string without null terminator |
|
|
+----------+--------------------------------+
|
|
|
|
```
|
|
|
|
example implementation:
|
|
|
|
```nasm
|
|
tokens
|
|
.registers:
|
|
dd "r8"
|
|
dw 0x0008
|
|
.by_name3: ; this is required for futureproofness; the caller can use this to
|
|
; find the size of registers.by_name2
|
|
```
|
|
|
|
note that tokens longer than 4 bytes are problematic :/
|
|
|
|
#### `tokens.by_id`
|
|
|
|
contains some tokens with their metadata. Some tokens have embedded information (`0x10XX` for instance). Those will not have entries in this table, being handled instead inside the assemble function itself.
|
|
|
|
metadata about some tokens in the following form:
|
|
|
|
```
|
|
+----------------+----------+-------+----------+
|
|
| 31 24 | 23 20 | 19 16 | 15 0 |
|
|
+----------------+----------+-------+----------+
|
|
| typed metadata | reserved | type | token ID |
|
|
+----------------+----------+-------+----------+
|
|
```
|
|
|
|
the `type` hex digit is defined as the following:
|
|
|
|
| hex | meaning | examples |
|
|
|-----|----------|-|
|
|
| 0x0 | ignored | `; this entire comment is 1 token` |
|
|
| 0x1 | operator | `mov`, `hlt` |
|
|
| 0x2 | register | `rsp`, `al` |
|
|
| 0xF | unknown | any token ID not represented in the lookup table |
|
|
|
|
type metadata for the different types is as follows:
|
|
|
|
```
|
|
+----------+
|
|
| type 0x0 |
|
|
+----------+
|
|
| 31 24 |
|
|
+----------+
|
|
| reserved |
|
|
+----------+
|
|
```
|
|
|
|
```
|
|
+-------------------------------+
|
|
| type 0x1 |
|
|
+----------+--------------------+
|
|
| 31 26 | 25 24 |
|
|
+----------+--------------------+
|
|
| reserved | number of operands |
|
|
+----------+--------------------+
|
|
```
|
|
|
|
```
|
|
+------------------------------+
|
|
| type 0x2 |
|
|
+----------+-----------+-------+
|
|
| 31 29 | 28 26 | 25 24 |
|
|
+----------+-----------+-------+
|
|
| reserved | reg value | width |
|
|
+----------+-----------+-------+
|
|
|
|
; reg is the value that cooresponds to the register in the ModR/M byte
|
|
|
|
; width:
|
|
00b ; 8 bit
|
|
01b ; 16 bit
|
|
10b ; 32 bit
|
|
11b ; 64 bit
|
|
```
|
|
|
|
#### `opcodes.by_id`
|
|
|
|
entries are as follows:
|
|
|
|
```
|
|
+----------+--------+----------+
|
|
| 31 24 | 23 16 | 15 0 |
|
|
+----------+--------+----------+
|
|
| reserved | opcode | token ID |
|
|
+----------+--------+----------+
|
|
```
|
|
|
|
note the lack of support for multiple-byte opcodes or multiple opcodes for one token ID; these features will likely be added at some point after the parser accumulates too much jank.
|
|
|
|
### token IDs
|
|
|
|
supported tokens are listed below
|
|
|
|
| token | id | notes |
|
|
|-------|--------|-|
|
|
| rax | 0x0000 | |
|
|
| rbx | 0x0001 | |
|
|
| rcx | 0x0002 | |
|
|
| rdx | 0x0003 | |
|
|
| rsi | 0x0004 | |
|
|
| rdi | 0x0005 | |
|
|
| rsp | 0x0006 | |
|
|
| rbp | 0x0007 | |
|
|
| r8 | 0x0008 | |
|
|
| r9 | 0x0009 | |
|
|
| r10 | 0x000A | |
|
|
| r11 | 0x000B | |
|
|
| r12 | 0x000C | |
|
|
| r13 | 0x000D | |
|
|
| r14 | 0x000E | |
|
|
| r15 | 0x000F | |
|
|
| eax | 0x0010 | |
|
|
| ebx | 0x0011 | |
|
|
| ecx | 0x0012 | |
|
|
| edx | 0x0013 | |
|
|
| esi | 0x0014 | |
|
|
| edi | 0x0015 | |
|
|
| esp | 0x0016 | |
|
|
| ebp | 0x0017 | |
|
|
| r8d | 0x0018 | |
|
|
| r9d | 0x0019 | |
|
|
| r10d | 0x001A | |
|
|
| r11d | 0x001B | |
|
|
| r12d | 0x001C | |
|
|
| r13d | 0x001D | |
|
|
| r14d | 0x001E | |
|
|
| r15d | 0x001F | |
|
|
| ax | 0x0020 | |
|
|
| bx | 0x0021 | |
|
|
| cx | 0x0022 | |
|
|
| dx | 0x0023 | |
|
|
| si | 0x0024 | |
|
|
| di | 0x0025 | |
|
|
| sp | 0x0026 | |
|
|
| bp | 0x0027 | |
|
|
| r8w | 0x0028 | |
|
|
| r9w | 0x0029 | |
|
|
| r10w | 0x002A | |
|
|
| r11w | 0x002B | |
|
|
| r12w | 0x002C | |
|
|
| r13w | 0x002D | |
|
|
| r14w | 0x002E | |
|
|
| r15w | 0x002F | |
|
|
| al | 0x0030 | |
|
|
| bl | 0x0031 | |
|
|
| cl | 0x0032 | |
|
|
| dl | 0x0033 | |
|
|
| sil | 0x0034 | |
|
|
| dil | 0x0035 | |
|
|
| spl | 0x0036 | |
|
|
| bpl | 0x0037 | |
|
|
| r8b | 0x0038 | |
|
|
| r9b | 0x0039 | |
|
|
| r10b | 0x003A | |
|
|
| r11b | 0x003B | |
|
|
| r12b | 0x003C | |
|
|
| r13b | 0x003D | |
|
|
| r14b | 0x003E | |
|
|
| r15b | 0x003F | |
|
|
| ah | 0x0040 | |
|
|
| bh | 0x0041 | |
|
|
| ch | 0x0042 | |
|
|
| dh | 0x0043 | |
|
|
| cs | 0x0044 | |
|
|
| ds | 0x0045 | |
|
|
| es | 0x0046 | |
|
|
| fs | 0x0047 | |
|
|
| gs | 0x0048 | |
|
|
| ss | 0x0049 | |
|
|
| cr0 | 0x004A | |
|
|
| cr2 | 0x004B | |
|
|
| cr3 | 0x004C | |
|
|
| cr4 | 0x004D | |
|
|
| cr8 | 0x004E | |
|
|
| hlt | 0x004F | |
|
|
| int3 | 0x0050 | |
|
|
| | 0x0051 | deprecated; formerly `[`. Now `0x10XX` is used. |
|
|
| | 0x0052 | deprecated; formerly `]`. |
|
|
| xor | 0x0053 | |
|
|
| inc | 0x0054 | |
|
|
| dec | 0x0055 | |
|
|
| mov | 0x0056 | |
|
|
| add | 0x0057 | |
|
|
| sub | 0x0058 | |
|
|
| call | 0x0059 | |
|
|
| ret | 0x005A | |
|
|
| cmp | 0x005B | |
|
|
| je | 0x005C | |
|
|
| jne | 0x005D | |
|
|
| jge | 0x005E | |
|
|
| jg | 0x005F | |
|
|
| jle | 0x0060 | |
|
|
| jl | 0x0061 | |
|
|
| | 0x10XX | some memory address; `XX` is as specified below |
|
|
| | 0xFFFF | unrecognised token |
|
|
|
|
values of `XX` in `0x10XX`:
|
|
|
|
| XX | description |
|
|
|------|-------------|
|
|
| 0x00 | following byte is the token ID of some register |
|
|
|
|
### example program
|
|
|
|
#### program in assembly
|
|
|
|
this program doesn't do anything useful, it's just a test
|
|
|
|
```nasm
|
|
xor eax, eax
|
|
inc rax
|
|
mov [ rax ], rdx
|
|
hlt
|
|
|
|
```
|
|
|
|
#### tokenization
|
|
|
|
```nasm
|
|
0x0053 ; xor
|
|
0xFE20 ; space
|
|
0x0010 ; eax
|
|
0xFE2C ; comma
|
|
0xFE20 ; space
|
|
0x0010 ; eax
|
|
0xFE0A ; newline
|
|
0x0054 ; inc
|
|
0xFE20 ; space
|
|
0x0000 ; rax
|
|
0xFE0A ; newline
|
|
0x0056 ; mov
|
|
0xFE20 ; space
|
|
0x1004 ; open bracket (4)
|
|
0xFE20 ; space |1
|
|
0x0000 ; rax |2
|
|
0xFE20 ; space |3
|
|
0x0052 ; close bracket |4
|
|
0xFE2C ; comma
|
|
0xFE20 ; space
|
|
0x0003 ; rdx
|
|
0xFE0A ; newline
|
|
0x004F ; hlt
|
|
0xFE0A ; newline
|
|
0xFE00 ; null terminator
|
|
```
|
|
|
|
#### nasm output with the above example program, bits 64
|
|
|
|
```nasm
|
|
0x31 ; XOR r/m16/32/64 r16/32/64
|
|
0xC0 ; ModR/M byte
|
|
; mod 11b ; directly address the following:
|
|
; reg 000b ; EAX
|
|
; r/m 000b ; EAX
|
|
|
|
0x48 ; 64 Bit Operand Size prefix
|
|
0xFF ; with `reg` from ModR/M byte 000b:
|
|
; INC r/m16/32/64
|
|
0xC0 ; ModR/M byte
|
|
; mod 11b ; direct addressing
|
|
; reg 000b ; RAX
|
|
; r/m 000b ; RAX
|
|
|
|
0x48 ; 64 Bit Operand Size prefix
|
|
0x89 ; MOV r/m16/32/64 r16/32/64
|
|
0x10 ; ModR/M byte
|
|
; mod 00b ; indirect addressing, no displacement
|
|
; reg 010b ; RDX
|
|
; r/m 000b ; [RAX]
|
|
|
|
0xF4 ; HLT
|
|
```
|