bootler/twasm/README.md

# twasm

this will be a self hosted, very minimal subset of nasm-style 64 bit asm

### goals

I want to compile Bootler and Twasm with the Twasm assembler

### reading

- [instructions](https://www.felixcloutier.com/x86/)
- [opcodes,ModR/M,SIB](http://ref.x86asm.net/coder64.html) (no secure site available)
- [calling conventions](https://wiki.osdev.org/Calling_Conventions); I try to use System V

### tokeniser

whitespace is ignored for the sake of readability; it can go between pretty much anything

```
------------------------
tokeniser
------------------------
byte(s) -> next byte(s)
------------------------
Newline   -> Newline
          -> Komment
          -> Operator
          -> Directive

Komment   -> Newline

Operator  -> Newline
          -> Komment
          -> Operand

Operand   -> Newline
          -> Komment
          -> Comma

Comma     -> Operand

Directive -> Newline
          -> Komment
          -> Operator
------------------------
```

not yet implemented:

```
------------------------
operand parser
------------------------
byte(s) -> next byte(s)
------------------------
START    -> '['
         -> Register
         -> Constant

'['      -> Register
         -> Constant

']'      -> END

Register -> IF #[, ']'
         -> Operator

Constant -> IF #[, ']'
         -> Operator

Operator -> IF NOT #R, Register
         -> Constant
------------------------
:R: = whether a register has been found
:[: = whether a '[' has been found
------------------------
```

### memory map

```
+------ 0x00100000 ------+
| hardware, bios stuff   |
+------ 0x00080000 ------+
| output binary          |
+------ 0x00070000 ------+
| token table            |
+------ 0x00060000 ------+
| test arena             |
+------ 0x00050000 ------+
| stack (rsp)            |
+------------------------+
| input                  |
+------------------------+
| assembler              |
+------ 0x00010000 ------+
| bootloader, bios, etc. |
+------------------------+
```

each word represents a token on the token table.

#### token table (TT)

each token gets loaded into the token table with the following form:

```
+----------+
| 15     0 |
+----------+
| token id |
+----------+
```

### internal data structures

#### `tokens.[operators|registers]`

contains tokens by their type. Intended to be searched by token name to get the token's ID.

each entry is in the following form:

```
+----------+--------------------------------+
| 47    32 | 31                           0 |
+----------+--------------------------------+
| token ID | string without null terminator |
+----------+--------------------------------+

```

example implementation:

```nasm
tokens
  .registers:
    dd "r8"
    dw 0x0008
  .by_name3: ; this is required for futureproofness; the caller can use this to
             ; find the size of registers.by_name2
```

note that tokens longer than 4 bytes are problematic :/

#### `tokens.by_id`

contains some tokens with their metadata. Some tokens have embedded information (`0x10XX` for instance). Those will not have entries in this table, being handled instead inside the assemble function itself.

metadata about some tokens in the following form:

```
+----------------+----------+-------+----------+
| 31          24 | 23    20 | 19 16 | 15     0 |
+----------------+----------+-------+----------+
| typed metadata | reserved | type  | token ID |
+----------------+----------+-------+----------+
```

the `type` hex digit is defined as the following:

| hex | meaning  | examples |
|-----|----------|-|
| 0x0 | ignored  | `; this entire comment is 1 token` |
| 0x1 | operator | `mov`, `hlt` |
| 0x2 | register | `rsp`, `al` |
| 0xF | unknown  | any token ID not represented in the lookup table |

type metadata for the different types is as follows:

```
+----------+
| type 0x0 |
+----------+
| 31    24 |
+----------+
| reserved |
+----------+
```

```
+-------------------------------+
| type 0x1                      |
+----------+--------------------+
| 31    26 | 25              24 |
+----------+--------------------+
| reserved | number of operands |
+----------+--------------------+
```

```
+------------------------------+
| type 0x2                     |
+----------+-----------+-------+
| 31    29 | 28     26 | 25 24 |
+----------+-----------+-------+
| reserved | reg value | width |
+----------+-----------+-------+

; reg is the value that cooresponds to the register in the ModR/M byte

; width:
00b ; 8 bit
01b ; 16 bit
10b ; 32 bit
11b ; 64 bit
```

#### `opcodes.by_id`

entries are as follows:

```
+----------+--------+----------+
| 31    24 | 23  16 | 15     0 |
+----------+--------+----------+
| reserved | opcode | token ID |
+----------+--------+----------+
```

note the lack of support for multiple-byte opcodes or multiple opcodes for one token ID; these features will likely be added at some point after the parser accumulates too much jank.

### token IDs

supported tokens are listed below

| token | id     | notes |
|-------|--------|-|
| rax   | 0x0000 | |
| rbx   | 0x0001 | |
| rcx   | 0x0002 | |
| rdx   | 0x0003 | |
| rsi   | 0x0004 | |
| rdi   | 0x0005 | |
| rsp   | 0x0006 | |
| rbp   | 0x0007 | |
| r8    | 0x0008 | |
| r9    | 0x0009 | |
| r10   | 0x000A | |
| r11   | 0x000B | |
| r12   | 0x000C | |
| r13   | 0x000D | |
| r14   | 0x000E | |
| r15   | 0x000F | |
| eax   | 0x0010 | |
| ebx   | 0x0011 | |
| ecx   | 0x0012 | |
| edx   | 0x0013 | |
| esi   | 0x0014 | |
| edi   | 0x0015 | |
| esp   | 0x0016 | |
| ebp   | 0x0017 | |
| r8d   | 0x0018 | |
| r9d   | 0x0019 | |
| r10d  | 0x001A | |
| r11d  | 0x001B | |
| r12d  | 0x001C | |
| r13d  | 0x001D | |
| r14d  | 0x001E | |
| r15d  | 0x001F | |
| ax    | 0x0020 | |
| bx    | 0x0021 | |
| cx    | 0x0022 | |
| dx    | 0x0023 | |
| si    | 0x0024 | |
| di    | 0x0025 | |
| sp    | 0x0026 | |
| bp    | 0x0027 | |
| r8w   | 0x0028 | |
| r9w   | 0x0029 | |
| r10w  | 0x002A | |
| r11w  | 0x002B | |
| r12w  | 0x002C | |
| r13w  | 0x002D | |
| r14w  | 0x002E | |
| r15w  | 0x002F | |
| al    | 0x0030 | |
| bl    | 0x0031 | |
| cl    | 0x0032 | |
| dl    | 0x0033 | |
| sil   | 0x0034 | |
| dil   | 0x0035 | |
| spl   | 0x0036 | |
| bpl   | 0x0037 | |
| r8b   | 0x0038 | |
| r9b   | 0x0039 | |
| r10b  | 0x003A | |
| r11b  | 0x003B | |
| r12b  | 0x003C | |
| r13b  | 0x003D | |
| r14b  | 0x003E | |
| r15b  | 0x003F | |
| ah    | 0x0040 | |
| bh    | 0x0041 | |
| ch    | 0x0042 | |
| dh    | 0x0043 | |
| cs    | 0x0044 | |
| ds    | 0x0045 | |
| es    | 0x0046 | |
| fs    | 0x0047 | |
| gs    | 0x0048 | |
| ss    | 0x0049 | |
| cr0   | 0x004A | |
| cr2   | 0x004B | |
| cr3   | 0x004C | |
| cr4   | 0x004D | |
| cr8   | 0x004E | |
| hlt   | 0x004F | |
| int3  | 0x0050 | |
|       | 0x0051 | deprecated; formerly `[`. Now `0x10XX` is used. |
|       | 0x0052 | deprecated; formerly `]`. |
| xor   | 0x0053 | |
| inc   | 0x0054 | |
| dec   | 0x0055 | |
| mov   | 0x0056 | |
| add   | 0x0057 | |
| sub   | 0x0058 | |
| call  | 0x0059 | |
| ret   | 0x005A | |
| cmp   | 0x005B | |
| je    | 0x005C | |
| jne   | 0x005D | |
| jge   | 0x005E | |
| jg    | 0x005F | |
| jle   | 0x0060 | |
| jl    | 0x0061 | |
|       | 0x10XX | some memory address; `XX` is as specified below |
|       | 0xFFFF | unrecognised token |

values of `XX` in `0x10XX`:

|  XX  | description |
|------|-------------|
| 0x00 | following byte is the token ID of some register |

### example program

#### program in assembly

this program doesn't do anything useful, it's just a test

```nasm
xor eax, eax
inc rax
mov [ rax ], rdx
hlt

```

#### tokenization

```nasm
0x0053 ; xor
0xFE20 ; space
0x0010 ; eax
0xFE2C ; comma
0xFE20 ; space
0x0010 ; eax
0xFE0A ; newline
0x0054 ; inc
0xFE20 ; space
0x0000 ; rax
0xFE0A ; newline
0x0056 ; mov
0xFE20 ; space
0x1004 ; open bracket (4)
0xFE20 ; space         |1
0x0000 ; rax           |2
0xFE20 ; space         |3
0x0052 ; close bracket |4
0xFE2C ; comma
0xFE20 ; space
0x0003 ; rdx
0xFE0A ; newline
0x004F ; hlt
0xFE0A ; newline
0xFE00 ; null terminator
```

#### nasm output with the above example program, bits 64

```nasm
0x31 ; XOR r/m16/32/64 r16/32/64
0xC0 ; ModR/M byte
     ; mod 11b  ; directly address the following:
     ; reg 000b ; EAX
     ; r/m 000b ; EAX

0x48 ; 64 Bit Operand Size prefix
0xFF ; with `reg` from ModR/M byte 000b:
     ; INC r/m16/32/64
0xC0 ; ModR/M byte
     ; mod 11b  ; direct addressing
     ; reg 000b ; RAX
     ; r/m 000b ; RAX

0x48 ; 64 Bit Operand Size prefix
0x89 ; MOV r/m16/32/64 r16/32/64
0x10 ; ModR/M byte
     ; mod 00b  ; indirect addressing, no displacement
     ; reg 010b ; RDX
     ; r/m 000b ; [RAX]

0xF4 ; HLT
```