Files
bootler/twasm
2026-03-23 23:37:39 +01:00
..
2026-03-23 23:37:39 +01:00
2026-03-12 14:12:33 +01:00
2026-03-23 23:37:39 +01:00

twasm

this will be a self hosted, very minimal subset of nasm-style 64 bit asm

goals

I want to compile Bootler and Twasm with the Twasm assembler

reading

tokeniser

whitespace is ignored for the sake of readability; it can go between pretty much anything

------------------------
tokeniser
------------------------
byte(s) -> next byte(s)
------------------------
Newline   -> Newline
          -> Komment
          -> Operator
          -> Directive

Komment   -> Newline

Operator  -> Newline
          -> Komment
          -> Operand

Operand   -> Newline
          -> Komment
          -> Comma

Comma     -> Operand

Directive -> Newline
          -> Komment
          -> Operator
------------------------

not yet implemented:

------------------------
operand parser
------------------------
byte(s) -> next byte(s)
------------------------
START    -> '['
         -> Register
         -> Constant

'['      -> Register
         -> Constant

']'      -> END

Register -> IF #[, ']'
         -> Operator

Constant -> IF #[, ']'
         -> Operator

Operator -> IF NOT #R, Register
         -> Constant
------------------------
:R: = whether a register has been found
:[: = whether a '[' has been found
------------------------

memory map

+------ 0x00100000 ------+
| hardware, bios stuff   |
+------ 0x00080000 ------+
| output binary          |
+------ 0x00070000 ------+
| token table            |
+------ 0x00060000 ------+
| test arena             |
+------ 0x00050000 ------+
| stack (rsp)            |
+------------------------+
| input                  |
+------------------------+
| assembler              |
+------ 0x00010000 ------+
| bootloader, bios, etc. |
+------------------------+

each word represents a token on the token table.

token table (TT)

each token gets loaded into the token table with the following form:

+----------+
| 15     0 |
+----------+
| token id |
+----------+

internal data structures

tokens.[operators|registers]

contains tokens by their type. Intended to be searched by token name to get the token's ID.

each entry is in the following form:

+----------+--------------------------------+
| 47    32 | 31                           0 |
+----------+--------------------------------+
| token ID | string without null terminator |
+----------+--------------------------------+

example implementation:

tokens
  .registers:
    dd "r8"
    dw 0x0008
  .by_name3: ; this is required for futureproofness; the caller can use this to
             ; find the size of registers.by_name2

note that tokens longer than 4 bytes are problematic :/

tokens.by_id

contains some tokens with their metadata. Some tokens have embedded information (0x10XX for instance). Those will not have entries in this table, being handled instead inside the assemble function itself.

metadata about some tokens in the following form:

+----------------+----------+-------+----------+
| 31          24 | 23    20 | 19 16 | 15     0 |
+----------------+----------+-------+----------+
| typed metadata | reserved | type  | token ID |
+----------------+----------+-------+----------+

the type hex digit is defined as the following:

hex meaning examples
0x0 ignored ; this entire comment is 1 token
0x1 operator mov, hlt
0x2 register rsp, al
0xF unknown any token ID not represented in the lookup table

type metadata for the different types is as follows:

+----------+
| type 0x0 |
+----------+
| 31    24 |
+----------+
| reserved |
+----------+
+-------------------------------+
| type 0x1                      |
+----------+--------------------+
| 31    26 | 25              24 |
+----------+--------------------+
| reserved | number of operands |
+----------+--------------------+
+------------------------------+
| type 0x2                     |
+----------+-----------+-------+
| 31    29 | 28     26 | 25 24 |
+----------+-----------+-------+
| reserved | reg value | width |
+----------+-----------+-------+

; reg is the value that cooresponds to the register in the ModR/M byte

; width:
00b ; 8 bit
01b ; 16 bit
10b ; 32 bit
11b ; 64 bit

opcodes.by_id

entries are as follows:

+-----------------+-----------------+----------+
| 31           24 | 23           16 | 15     0 |
+-----------------+-----------------+----------+
| dest=reg opcode | dest=r/m opcode | token ID |
+-----------------+-----------------+----------+

note the lack of support for multiple-byte opcodes or multiple opcodes for one token ID; these features will likely be added at some point after the parser accumulates too much jank.

token IDs

supported tokens are listed below

token id notes
rax 0x0000
rbx 0x0001
rcx 0x0002
rdx 0x0003
rsi 0x0004
rdi 0x0005
rsp 0x0006
rbp 0x0007
r8 0x0008
r9 0x0009
r10 0x000A
r11 0x000B
r12 0x000C
r13 0x000D
r14 0x000E
r15 0x000F
eax 0x0010
ebx 0x0011
ecx 0x0012
edx 0x0013
esi 0x0014
edi 0x0015
esp 0x0016
ebp 0x0017
r8d 0x0018
r9d 0x0019
r10d 0x001A
r11d 0x001B
r12d 0x001C
r13d 0x001D
r14d 0x001E
r15d 0x001F
ax 0x0020
bx 0x0021
cx 0x0022
dx 0x0023
si 0x0024
di 0x0025
sp 0x0026
bp 0x0027
r8w 0x0028
r9w 0x0029
r10w 0x002A
r11w 0x002B
r12w 0x002C
r13w 0x002D
r14w 0x002E
r15w 0x002F
al 0x0030
bl 0x0031
cl 0x0032
dl 0x0033
sil 0x0034
dil 0x0035
spl 0x0036
bpl 0x0037
r8b 0x0038
r9b 0x0039
r10b 0x003A
r11b 0x003B
r12b 0x003C
r13b 0x003D
r14b 0x003E
r15b 0x003F
ah 0x0040
bh 0x0041
ch 0x0042
dh 0x0043
cs 0x0044
ds 0x0045
es 0x0046
fs 0x0047
gs 0x0048
ss 0x0049
cr0 0x004A
cr2 0x004B
cr3 0x004C
cr4 0x004D
cr8 0x004E
hlt 0x004F
int3 0x0050
0x0051 deprecated; formerly [. Now 0x10XX is used.
0x0052 deprecated; formerly ].
xor 0x0053
inc 0x0054
dec 0x0055
mov 0x0056
add 0x0057
sub 0x0058
call 0x0059
ret 0x005A
cmp 0x005B
je 0x005C
jne 0x005D
jge 0x005E
jg 0x005F
jle 0x0060
jl 0x0061
0x10XX some memory address; XX is as specified below
0xFFFF unrecognised token

values of XX in 0x10XX:

XX description
0x00 following byte is the token ID of some register

example program

program in assembly

this program doesn't do anything useful, it's just a test

xor eax, eax
inc rax
mov [ rax ], rdx
hlt

tokenization

0x0053 ; xor
0xFE20 ; space
0x0010 ; eax
0xFE2C ; comma
0xFE20 ; space
0x0010 ; eax
0xFE0A ; newline
0x0054 ; inc
0xFE20 ; space
0x0000 ; rax
0xFE0A ; newline
0x0056 ; mov
0xFE20 ; space
0x1004 ; open bracket (4)
0xFE20 ; space         |1
0x0000 ; rax           |2
0xFE20 ; space         |3
0x0052 ; close bracket |4
0xFE2C ; comma
0xFE20 ; space
0x0003 ; rdx
0xFE0A ; newline
0x004F ; hlt
0xFE0A ; newline
0xFE00 ; null terminator

nasm output with the above example program, bits 64

0x31 ; XOR r/m16/32/64 r16/32/64
0xC0 ; ModR/M byte
     ; mod 11b  ; directly address the following:
     ; reg 000b ; EAX
     ; r/m 000b ; EAX

0x48 ; 64 Bit Operand Size prefix
0xFF ; with `reg` from ModR/M byte 000b:
     ; INC r/m16/32/64
0xC0 ; ModR/M byte
     ; mod 11b  ; direct addressing
     ; reg 000b ; RAX
     ; r/m 000b ; RAX

0x48 ; 64 Bit Operand Size prefix
0x89 ; MOV r/m16/32/64 r16/32/64
0x10 ; ModR/M byte
     ; mod 00b  ; indirect addressing, no displacement
     ; reg 010b ; RDX
     ; r/m 000b ; [RAX]

0xF4 ; HLT