Files
bootler/twasm/README.md
2026-04-04 10:49:40 +02:00

596 lines
16 KiB
Markdown

# twasm
this will be a self hosted, very minimal subset of nasm-style 64 bit asm
### goals
I want to compile Bootler and Twasm with the Twasm assembler
### reading
- [instructions](https://www.felixcloutier.com/x86/)
- [opcodes,ModR/M,SIB](http://ref.x86asm.net/coder64.html) (no secure site available)
- [calling conventions](https://wiki.osdev.org/Calling_Conventions); I try to use System V
### tokeniser
whitespace is ignored for the sake of readability; it can go between pretty much anything
```
------------------------
tokeniser
------------------------
byte(s) -> next byte(s)
------------------------
Newline -> Label
-> Newline
-> Komment
-> Operator
-> Directive
Label -> Newline
Komment -> Newline
Operator -> Newline
-> Komment
-> Operand
Operand -> Newline
-> Komment
-> Comma
Comma -> Operand
Directive -> Newline
-> Komment
-> Operator
------------------------
```
### memory map
```
+------ 0x00100000 ------+
| hardware, bios stuff |
+------ 0x00080000 ------+
| output binary |
+------ 0x00070000 ------+
| token table |
+------ 0x00060000 ------+
| test arena |
+------ 0x00050000 ------+
| label table |
+------ 0x00040000 ------+
| awaiting label table |
+------ 0x00030000 ------+
| stack (rsp) |
+------------------------+
| input |
+------------------------+
| assembler |
+------ 0x00010000 ------+
| bootloader, bios, etc. |
+------------------------+
```
each word represents a token on the token table.
#### token table (TT)
each token gets loaded into the token table with the following form:
```
2 bytes
+----------+
| 15 0 |
+----------+
| token id |
+----------+
```
#### label table (LT)
label definitions are stored and recalled from this table. The memory addresses are relative to the start of the program
```
16 bytes
+----------+---------+
| 127 96 | 95 64 |
+----------+---------+
| reserved | address |
+----------+---------+
| 63 0 |
+--------------------+
| hash |
+--------------------+
```
#### awaiting label table (ALT)
forward references are stored in this table to be filled in after assembly is otherwise complete. The memory addresses are relative to the start of the program
```
16 bytes
+----------+----------+------------------+---------+
| 127 101 | 100 | 99 96 | 95 64 |
+----------+----------+------------------+---------+
| reserved | abs flag | # bytes reserved | address |
+----------+----------+------------------+---------+
| 63 0 |
+--------------------------------------------------+
| hash |
+--------------------------------------------------+
```
### internal data structures
#### `tokens.[operators|registers]`
contains tokens by their type. Intended to be searched by token name to get the token's ID.
each entry is in the following form:
```
6 bytes
+----------+--------------------------------+
| 47 32 | 31 0 |
+----------+--------------------------------+
| token ID | string without null terminator |
+----------+--------------------------------+
```
note that tokens longer than 4 bytes are problematic :/
#### `tokens.by_id`
contains some tokens with their metadata. Some tokens have embedded information (`0x10XX` for instance). Those do not have entries in this table, being handled instead inside the assemble function itself.
metadata about some tokens in the following form:
```
4 bytes
+----------------+----------+-------+----------+
| 31 24 | 23 20 | 19 16 | 15 0 |
+----------------+----------+-------+----------+
| typed metadata | reserved | type | token ID |
+----------------+----------+-------+----------+
```
the `type` hex digit is defined as the following:
| hex | meaning | examples |
|-----|-----------------|-|
| 0x0 | ignored | |
| 0x1 | operator | `mov`, `hlt` |
| 0x2 | register | `rsp`, `al` |
| 0x3 | pseudo-operator | `db` |
| 0xF | unknown | any token ID not represented in the lookup table |
type metadata for the different types is as follows:
```
1 byte
+----------+
| type 0x0 |
+----------+
| 31 24 |
+----------+
| reserved |
+----------+
```
```
1 byte
+-------------------------------+
| type 0x1 |
+----------+--------------------+
| 31 26 | 25 24 |
+----------+--------------------+
| reserved | number of operands |
+----------+--------------------+
```
```
1 byte
+------------------------------+
| type 0x2 |
+----------+-----------+-------+
| 31 29 | 28 26 | 25 24 |
+----------+-----------+-------+
| reserved | reg value | width |
+----------+-----------+-------+
; reg is the value that cooresponds to the register in the ModR/M byte
; width:
00b ; 8 bit
01b ; 16 bit
10b ; 32 bit
11b ; 64 bit
```
```
1 byte
+----------+
| type 0x3 |
+----------+
| 31 24 |
+----------+
| reserved |
+----------+
```
#### `opcodes.by_id`
entries are as follows:
```
16 bytes
+------------------------------+
| 0 operand operators |
+---------+--------------------+
| 127 120 | 119 96 |
+---------+--------------------+
| flags | reserved |
+---------+--------------------+
| 95 64 |
+------------------------------+
| reserved |
+------------------------------+
| 63 32 |
+------------------------------+
| reserved |
+----------+--------+----------+
| 31 24 | 23 16 | 15 0 |
+----------+--------+----------+
| reserved | opcode | token ID |
+----------+--------+----------+
16 bytes
+------------------------------------------+
| 1 operand operators |
+----------+----------+----------+---------+
| 127 120 | 119 112 | 111 104 | 103 96 |
+----------+----------+----------+---------+
| flags | reserved | flags5 | flags4 |
+----------+----------+----------+---------+
| 95 88 | 87 80 | 79 72 | 71 64 |
+----------+----------+----------+---------+
| flags3 | flags2 | reserved | flags0 |
+----------+----------+----------+---------+
| 63 56 | 55 48 | 47 40 | 39 32 |
+----------+----------+----------+---------+
| opcode | opcode | opcode | opcode |
| dst=rel8 | dst=rel | dst=imm8 | dst=imm |
+----------+----------+----------+---------+
| 31 24 | 23 16 | 15 0 |
+----------+----------+--------------------+
| reserved | opcode | token ID |
| | dst=r/m | |
+----------+----------+--------------------+
16 bytes
+-----------------------------------------------+
| 2 operand operators |
+---------+-------------------------------------+
| 127 120 | 119 96 |
+---------+-------------------------------------+
| flags | reserved |
+---------+----------+--------------------------+
| 95 88 | 87 80 | 79 64 |
+---------+----------+--------------------------+
| flags3 | flags2 | reserved |
+---------+----------+-------+-------+----------+
| 63 48 | 47 40 | 39 32 |
+--------------------+---------------+----------+
| reserved | opcode | opcode |
| | dst=r/m | dst=r/m |
| | src=imm8 | src=imm |
+---------+----------+---------------+----------+
| 31 24 | 23 16 | 15 0 |
+---------+----------+--------------------------+
| opcode | opcode | token ID |
| dst=r | dst=r/m | |
| src=r/m | src=r | |
+---------+----------+--------------------------+
1 byte
+-----------------+
| flags byte |
+----------+------+
| 95 89 | 88 |
+----------+------+
| reserved | 8bit |
+----------+------+
1 byte
+----------------------------------------------------+
| flagsX byte |
+----------+-----------+-------------+---------------+
| 7 5 | 4 | 3 | 2 0 |
+----------+-----------+-------------+---------------+
| reserved | no ModR/M | 0x0F prefix | operator flag |
+----------+-----------+-------------+---------------+
; flags key:
8bit ; tte has opcodes for r/m8 and r8 instead of r/m and r respectively
; flagsX key:
no ModR/M ; there is no ModR/M byte for this opcode
0x0F prefix ; there is a 0x0F prefix for this opcode
operator flag ; contents of `reg` if applicable
; key:
r/m ; r/m 16/32/64
r/m8 ; r/m 8
r ; r 16/32/64
r8 ; r 8
imm ; imm 16/32
imm8 ; imm 8
rel ; rel 16/32
rel8 ; rel 8
```
note much room to expand. If an opcode doesn't exist, it should be 0x00
### token IDs
supported tokens are listed below
| token | id | notes |
|-------|--------|-|
| rax | 0x0000 | register |
| rbx | 0x0001 | register |
| rcx | 0x0002 | register |
| rdx | 0x0003 | register |
| rsi | 0x0004 | register |
| rdi | 0x0005 | register |
| rsp | 0x0006 | register |
| rbp | 0x0007 | register |
| r8 | 0x0008 | unimplemented |
| r9 | 0x0009 | unimplemented |
| r10 | 0x000A | unimplemented |
| r11 | 0x000B | unimplemented |
| r12 | 0x000C | unimplemented |
| r13 | 0x000D | unimplemented |
| r14 | 0x000E | unimplemented |
| r15 | 0x000F | unimplemented |
| eax | 0x0010 | register |
| ebx | 0x0011 | register |
| ecx | 0x0012 | register |
| edx | 0x0013 | register |
| esi | 0x0014 | register |
| edi | 0x0015 | register |
| esp | 0x0016 | register |
| ebp | 0x0017 | register |
| r8d | 0x0018 | unimplemented |
| r9d | 0x0019 | unimplemented |
| r10d | 0x001A | unimplemented |
| r11d | 0x001B | unimplemented |
| r12d | 0x001C | unimplemented |
| r13d | 0x001D | unimplemented |
| r14d | 0x001E | unimplemented |
| r15d | 0x001F | unimplemented |
| ax | 0x0020 | register |
| bx | 0x0021 | register |
| cx | 0x0022 | register |
| dx | 0x0023 | register |
| si | 0x0024 | register |
| di | 0x0025 | register |
| sp | 0x0026 | register |
| bp | 0x0027 | register |
| r8w | 0x0028 | unimplemented |
| r9w | 0x0029 | unimplemented |
| r10w | 0x002A | unimplemented |
| r11w | 0x002B | unimplemented |
| r12w | 0x002C | unimplemented |
| r13w | 0x002D | unimplemented |
| r14w | 0x002E | unimplemented |
| r15w | 0x002F | unimplemented |
| al | 0x0030 | register |
| bl | 0x0031 | register |
| cl | 0x0032 | register |
| dl | 0x0033 | register |
| sil | 0x0034 | register |
| dil | 0x0035 | register |
| spl | 0x0036 | register |
| bpl | 0x0037 | register |
| r8b | 0x0038 | unimplemented |
| r9b | 0x0039 | unimplemented |
| r10b | 0x003A | unimplemented |
| r11b | 0x003B | unimplemented |
| r12b | 0x003C | unimplemented |
| r13b | 0x003D | unimplemented |
| r14b | 0x003E | unimplemented |
| r15b | 0x003F | unimplemented |
| ah | 0x0040 | unimplemented |
| bh | 0x0041 | unimplemented |
| ch | 0x0042 | unimplemented |
| dh | 0x0043 | unimplemented |
| cs | 0x0044 | unimplemented |
| ds | 0x0045 | unimplemented |
| es | 0x0046 | unimplemented |
| fs | 0x0047 | unimplemented |
| gs | 0x0048 | unimplemented |
| ss | 0x0049 | unimplemented |
| cr0 | 0x004A | unimplemented |
| cr2 | 0x004B | unimplemented |
| cr3 | 0x004C | unimplemented |
| cr4 | 0x004D | unimplemented |
| cr8 | 0x004E | unimplemented |
| hlt | 0x004F | operator |
| int3 | 0x0050 | operator |
| | 0x0051 | deprecated; formerly `[`. Now `0x10XX` is used. |
| | 0x0052 | deprecated; formerly `]`. |
| xor | 0x0053 | operator |
| inc | 0x0054 | operator |
| dec | 0x0055 | operator |
| mov | 0x0056 | operator |
| add | 0x0057 | operator |
| sub | 0x0058 | operator |
| call | 0x0059 | operator |
| ret | 0x005A | operator |
| cmp | 0x005B | operator |
| jmp | 0x005C | operator |
| je | 0x005D | operator |
| jne | 0x005E | operator |
| push | 0x005F | operator |
| pop | 0x0060 | operator |
| out | 0x0061 | operator |
| db | 0x0100 | pseudo-operator |
| | 0x10XX | some memory address; `XX` is as specified below |
| | 0x20XX | some constant; `XX` is as specified below |
| | 0x3XXX | some label; `XXX` is its entry index in the label table |
| | 0xFEXX | used to pass some raw value `XX` in place of a token id to a couple of functions that mention this as a feature. If the function doesn't mention it, it will lead to undefined behaviour |
| | 0xFFFF | unrecognised token |
values of `XX` in `0x10XX`:
| XX | description |
|------|-------------|
| 0x00 | following word is the token ID of some register |
values of `XX` in `0x20XX`:
| XX | description |
|------|-------------|
| 0x00 | following 8 bytes are the constant's value |
### example program
#### program in assembly
this program doesn't do anything useful, it's just a test
```nasm
xor eax, eax
inc rax ; inline comment
; one line comment
mov rdx, [rax]
mov [rax], rdx
hlt
```
#### tokenization
```nasm
0x0053 ; xor
0x0010 ; eax
0x0010 ; eax
0x0054 ; inc
0x0000 ; rax
0x0056 ; mov
0x0003 ; rdx
0x1000 ; memory address: register
0x0000 ; rax
0x0056 ; mov
0x1000 ; memory address: register
0x0000 ; rax
0x0003 ; rdx
0x004F ; hlt
```
#### nasm output with the above example program, bits 64
```nasm
0x31 ; XOR r/m16/32/64 r16/32/64
0xC0 ; ModR/M byte
; mod 11b ; directly address the following:
; reg 000b ; EAX
; r/m 000b ; EAX
0x48 ; 64 Bit Operand Size prefix
0xFF ; with `reg` from ModR/M byte 000b:
; INC r/m16/32/64
0xC0 ; ModR/M byte
; mod 11b ; direct addressing
; reg 000b ; RAX
; r/m 000b ; RAX
0x48 ; 64 Bit Operand Size prefix
0x8B ; MOV r16/32/64 r/m16/32/64
0x10 ; ModR/M byte
; mod 00b ; indirect addressing, no displacement
; reg 010b ; RDX
; r/m 000b ; [RAX]
0x48 ; 64 Bit Operand Size prefix
0x89 ; MOV r/m16/32/64 r16/32/64
0x10 ; ModR/M byte
; mod 00b ; indirect addressing, no displacement
; reg 010b ; RDX
; r/m 000b ; [RAX]
0xF4 ; HLT
```
#### program output with the function `print`, each comma-seperated `db` value put onto its own line
editted output of `x/512xb 0x00070000` in [gdb](https://www.sourceware.org/gdb/)
the following is somewhat correct! I just need to a) null-terminate 8-byte chars and b) define all the addresses currently represented as `0xff 0xff 0xff 0xff`
```
0x48 0xff 0xf2
0x48 0xff 0xf0
0x48 0xff 0xf6
0xc7 0xc2 0xf8 0x03 0x00 0x00
0x8a 0x06
0x80 0xf8 0x00
0x0f 0x84 0xff 0xff 0xff 0xff
0x66 0xee
0x48 0xff 0xc6
0xe9 0xff 0xff 0xff 0xff
0x48 0x8f 0xc6
0x48 0x8f 0xc0
0x48 0x8f 0xc2
0xc3
0x48 0xff 0xf6
0xc7 0xc6 0xff 0xff 0xff 0xff
0xe8 0xff 0xff 0xff 0xff
0x48 0x8f 0xc6
0xe9 0xff 0xff 0xff 0xff
0x48 0xff 0xf6
0xc7 0xc6 0xff 0xff 0xff 0xff
0xe8 0xff 0xff 0xff 0xff
0x48 0x8f 0xc6
0xe9 0xff 0xff 0xff 0xff
0x48 0xff 0xf6
0xc7 0xc6 0xff 0xff 0xff 0xff
0xe8 0xff 0xff 0xff 0xff
0x48 0x8f 0xc6
0xe9 0xff 0xff 0xff 0xff
0x48 0xff 0xf6
0xc7 0xc6 0xff 0xff 0xff 0xff
0xe8 0xff 0xff 0xff 0xff
0x48 0x8f 0xc6
0xe9 0xff 0xff 0xff 0xff
0x1b 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x5b 0x33 0x36 0x6d 0x00 0x00 0x00 0x00
0x7b 0x44 0x45 0x42 0x55 0x47 0x5d 0x3a
0x1b 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x5b 0x30 0x6d 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x1b 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x5b 0x31 0x3b 0x33 0x31 0x6d 0x00 0x00
0x7b 0x45 0x52 0x52 0x4f 0x52 0x5d 0x3a
0x1b 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x5b 0x30 0x6d 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x1b 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x5b 0x31 0x3b 0x33 0x33 0x6d 0x00 0x00
0x5b 0x54 0x45 0x53 0x54 0x5d 0x3a 0x20
0x1b 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x5b 0x30 0x6d 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x1b 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x5b 0x31 0x3b 0x33 0x35 0x6d 0x00 0x00
0x5b 0x57 0x41 0x52 0x4e 0x5d 0x3a 0x20
0x1b 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x5b 0x30 0x6d 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
```