Files
bootler/twasm/README.md
2026-03-21 21:42:50 +01:00

405 lines
8.4 KiB
Markdown

# twasm
this will be a self hosted, very minimal subset of nasm-style 64 bit asm
### goals
I want to compile Bootler and Twasm with the Twasm assembler
### reading
- [instructions](https://www.felixcloutier.com/x86/)
- [opcodes,ModR/M,SIB](http://ref.x86asm.net/coder64.html) (no secure site available)
- [calling conventions](https://wiki.osdev.org/Calling_Conventions); I try to use System V
### tokeniser
whitespace is ignored for the sake of readability; it can go between pretty much anything
```
------------------------
tokeniser
------------------------
byte(s) -> next byte(s)
------------------------
Newline -> Newline
-> Komment
-> Operator
-> Directive
Komment -> Newline
Operator -> Newline
-> Komment
-> Operand
Operand -> Newline
-> Komment
-> Comma
Comma -> Operand
Directive -> Newline
-> Komment
-> Operator
------------------------
```
not yet implemented:
```
------------------------
operand parser
------------------------
byte(s) -> next byte(s)
------------------------
START -> '['
-> Register
-> Constant
'[' -> Register
-> Constant
']' -> END
Register -> IF #[, ']'
-> Operator
Constant -> IF #[, ']'
-> Operator
Operator -> IF NOT #R, Register
-> Constant
------------------------
:R: = whether a register has been found
:[: = whether a '[' has been found
------------------------
```
### memory map
```
+------ 0x00100000 ------+
| hardware, bios stuff |
+------ 0x00080000 ------+
| output binary |
+------ 0x00070000 ------+
| token table |
+------ 0x00060000 ------+
| test arena |
+------ 0x00050000 ------+
| stack (rsp) |
+------------------------+
| input |
+------------------------+
| assembler |
+------ 0x00010000 ------+
| bootloader, bios, etc. |
+------------------------+
```
each word represents a token on the token table.
#### token table (TT)
each token gets loaded into the token table with the following form:
```
+----------+
| 15 0 |
+----------+
| token id |
+----------+
```
### internal data structures
#### `tokens.[operators|registers]`
contains tokens by their type. Intended to be searched by token name to get the token's ID.
each entry is in the following form:
```
+----------+--------------------------------+
| 47 32 | 31 0 |
+----------+--------------------------------+
| token ID | string without null terminator |
+----------+--------------------------------+
```
example implementation:
```nasm
tokens
.registers:
dd "r8"
dw 0x0008
.by_name3: ; this is required for futureproofness; the caller can use this to
; find the size of registers.by_name2
```
note that tokens longer than 4 bytes are problematic :/
#### `tokens.by_id`
contains some tokens with their metadata. Some tokens have embedded information (`0x10XX` for instance). Those will not have entries in this table, being handled instead inside the assemble function itself.
metadata about some tokens in the following form:
```
+----------------+----------+-------+----------+
| 31 24 | 23 20 | 19 16 | 15 0 |
+----------------+----------+-------+----------+
| typed metadata | reserved | type | token ID |
+----------------+----------+-------+----------+
```
the `type` hex digit is defined as the following:
| hex | meaning | examples |
|-----|----------|-|
| 0x0 | ignored | `; this entire comment is 1 token` |
| 0x1 | operator | `mov`, `hlt` |
| 0x2 | register | `rsp`, `al` |
| 0xF | unknown | any token ID not represented in the lookup table |
type metadata for the different types is as follows:
```
+----------+
| type 0x0 |
+----------+
| 31 24 |
+----------+
| reserved |
+----------+
```
```
+-------------------------------+
| type 0x1 |
+----------+--------------------+
| 31 26 | 25 24 |
+----------+--------------------+
| reserved | number of operands |
+----------+--------------------+
```
```
+------------------------------+
| type 0x2 |
+----------+-----------+-------+
| 31 29 | 28 26 | 25 24 |
+----------+-----------+-------+
| reserved | reg value | width |
+----------+-----------+-------+
; reg is the value that cooresponds to the register in the ModR/M byte
; width:
00b ; 8 bit
01b ; 16 bit
10b ; 32 bit
11b ; 64 bit
```
#### `opcodes.by_id`
entries are as follows:
```
+----------+--------+----------+
| 31 24 | 23 16 | 15 0 |
+----------+--------+----------+
| reserved | opcode | token ID |
+----------+--------+----------+
```
note the lack of support for multiple-byte opcodes or multiple opcodes for one token ID; these features will likely be added at some point after the parser accumulates too much jank.
### token IDs
supported tokens are listed below
| token | id | notes |
|-------|--------|-|
| rax | 0x0000 | |
| rbx | 0x0001 | |
| rcx | 0x0002 | |
| rdx | 0x0003 | |
| rsi | 0x0004 | |
| rdi | 0x0005 | |
| rsp | 0x0006 | |
| rbp | 0x0007 | |
| r8 | 0x0008 | |
| r9 | 0x0009 | |
| r10 | 0x000A | |
| r11 | 0x000B | |
| r12 | 0x000C | |
| r13 | 0x000D | |
| r14 | 0x000E | |
| r15 | 0x000F | |
| eax | 0x0010 | |
| ebx | 0x0011 | |
| ecx | 0x0012 | |
| edx | 0x0013 | |
| esi | 0x0014 | |
| edi | 0x0015 | |
| esp | 0x0016 | |
| ebp | 0x0017 | |
| r8d | 0x0018 | |
| r9d | 0x0019 | |
| r10d | 0x001A | |
| r11d | 0x001B | |
| r12d | 0x001C | |
| r13d | 0x001D | |
| r14d | 0x001E | |
| r15d | 0x001F | |
| ax | 0x0020 | |
| bx | 0x0021 | |
| cx | 0x0022 | |
| dx | 0x0023 | |
| si | 0x0024 | |
| di | 0x0025 | |
| sp | 0x0026 | |
| bp | 0x0027 | |
| r8w | 0x0028 | |
| r9w | 0x0029 | |
| r10w | 0x002A | |
| r11w | 0x002B | |
| r12w | 0x002C | |
| r13w | 0x002D | |
| r14w | 0x002E | |
| r15w | 0x002F | |
| al | 0x0030 | |
| bl | 0x0031 | |
| cl | 0x0032 | |
| dl | 0x0033 | |
| sil | 0x0034 | |
| dil | 0x0035 | |
| spl | 0x0036 | |
| bpl | 0x0037 | |
| r8b | 0x0038 | |
| r9b | 0x0039 | |
| r10b | 0x003A | |
| r11b | 0x003B | |
| r12b | 0x003C | |
| r13b | 0x003D | |
| r14b | 0x003E | |
| r15b | 0x003F | |
| ah | 0x0040 | |
| bh | 0x0041 | |
| ch | 0x0042 | |
| dh | 0x0043 | |
| cs | 0x0044 | |
| ds | 0x0045 | |
| es | 0x0046 | |
| fs | 0x0047 | |
| gs | 0x0048 | |
| ss | 0x0049 | |
| cr0 | 0x004A | |
| cr2 | 0x004B | |
| cr3 | 0x004C | |
| cr4 | 0x004D | |
| cr8 | 0x004E | |
| hlt | 0x004F | |
| int3 | 0x0050 | |
| | 0x0051 | deprecated; formerly `[`. Now `0x10XX` is used. |
| | 0x0052 | deprecated; formerly `]`. |
| xor | 0x0053 | |
| inc | 0x0054 | |
| dec | 0x0055 | |
| mov | 0x0056 | |
| add | 0x0057 | |
| sub | 0x0058 | |
| call | 0x0059 | |
| ret | 0x005A | |
| cmp | 0x005B | |
| je | 0x005C | |
| jne | 0x005D | |
| jge | 0x005E | |
| jg | 0x005F | |
| jle | 0x0060 | |
| jl | 0x0061 | |
| | 0x10XX | some memory address; `XX` is as specified below |
| | 0xFFFF | unrecognised token |
values of `XX` in `0x10XX`:
| XX | description |
|------|-------------|
| 0x00 | following byte is the token ID of some register |
### example program
#### program in assembly
this program doesn't do anything useful, it's just a test
```nasm
xor eax, eax
inc rax
mov [ rax ], rdx
hlt
```
#### tokenization
```nasm
0x0053 ; xor
0xFE20 ; space
0x0010 ; eax
0xFE2C ; comma
0xFE20 ; space
0x0010 ; eax
0xFE0A ; newline
0x0054 ; inc
0xFE20 ; space
0x0000 ; rax
0xFE0A ; newline
0x0056 ; mov
0xFE20 ; space
0x1004 ; open bracket (4)
0xFE20 ; space |1
0x0000 ; rax |2
0xFE20 ; space |3
0x0052 ; close bracket |4
0xFE2C ; comma
0xFE20 ; space
0x0003 ; rdx
0xFE0A ; newline
0x004F ; hlt
0xFE0A ; newline
0xFE00 ; null terminator
```
#### nasm output with the above example program, bits 64
```nasm
0x31 ; XOR r/m16/32/64 r16/32/64
0xC0 ; ModR/M byte
; mod 11b ; directly address the following:
; reg 000b ; EAX
; r/m 000b ; EAX
0x48 ; 64 Bit Operand Size prefix
0xFF ; with `reg` from ModR/M byte 000b:
; INC r/m16/32/64
0xC0 ; ModR/M byte
; mod 11b ; direct addressing
; reg 000b ; RAX
; r/m 000b ; RAX
0x48 ; 64 Bit Operand Size prefix
0x89 ; MOV r/m16/32/64 r16/32/64
0x10 ; ModR/M byte
; mod 00b ; indirect addressing, no displacement
; reg 010b ; RDX
; r/m 000b ; [RAX]
0xF4 ; HLT
```