Files
bootler/twasm
2026-03-25 21:37:40 +01:00
..
2026-03-25 21:37:40 +01:00
2026-03-12 14:12:33 +01:00
2026-03-25 21:14:34 +01:00

twasm

this will be a self hosted, very minimal subset of nasm-style 64 bit asm

goals

I want to compile Bootler and Twasm with the Twasm assembler

reading

tokeniser

whitespace is ignored for the sake of readability; it can go between pretty much anything

------------------------
tokeniser
------------------------
byte(s) -> next byte(s)
------------------------
Newline   -> Newline
          -> Komment
          -> Operator
          -> Directive

Komment   -> Newline

Operator  -> Newline
          -> Komment
          -> Operand

Operand   -> Newline
          -> Komment
          -> Comma

Comma     -> Operand

Directive -> Newline
          -> Komment
          -> Operator
------------------------

not yet implemented:

------------------------
operand parser
------------------------
byte(s) -> next byte(s)
------------------------
START    -> '['
         -> Register
         -> Constant

'['      -> Register
         -> Constant

']'      -> END

Register -> IF #[, ']'
         -> Operator

Constant -> IF #[, ']'
         -> Operator

Operator -> IF NOT #R, Register
         -> Constant
------------------------
:R: = whether a register has been found
:[: = whether a '[' has been found
------------------------

memory map

+------ 0x00100000 ------+
| hardware, bios stuff   |
+------ 0x00080000 ------+
| output binary          |
+------ 0x00070000 ------+
| token table            |
+------ 0x00060000 ------+
| test arena             |
+------ 0x00050000 ------+
| stack (rsp)            |
+------------------------+
| input                  |
+------------------------+
| assembler              |
+------ 0x00010000 ------+
| bootloader, bios, etc. |
+------------------------+

each word represents a token on the token table.

token table (TT)

each token gets loaded into the token table with the following form:

+----------+
| 15     0 |
+----------+
| token id |
+----------+

internal data structures

tokens.[operators|registers]

contains tokens by their type. Intended to be searched by token name to get the token's ID.

each entry is in the following form:

+----------+--------------------------------+
| 47    32 | 31                           0 |
+----------+--------------------------------+
| token ID | string without null terminator |
+----------+--------------------------------+

example implementation:

tokens
  .registers:
    dd "r8"
    dw 0x0008
  .by_name3: ; this is required for futureproofness; the caller can use this to
             ; find the size of registers.by_name2

note that tokens longer than 4 bytes are problematic :/

tokens.by_id

contains some tokens with their metadata. Some tokens have embedded information (0x10XX for instance). Those will not have entries in this table, being handled instead inside the assemble function itself.

metadata about some tokens in the following form:

+----------------+----------+-------+----------+
| 31          24 | 23    20 | 19 16 | 15     0 |
+----------------+----------+-------+----------+
| typed metadata | reserved | type  | token ID |
+----------------+----------+-------+----------+

the type hex digit is defined as the following:

hex meaning examples
0x0 ignored
0x1 operator mov, hlt
0x2 register rsp, al
0xF unknown any token ID not represented in the lookup table

type metadata for the different types is as follows:

+----------+
| type 0x0 |
+----------+
| 31    24 |
+----------+
| reserved |
+----------+
+-------------------------------+
| type 0x1                      |
+----------+--------------------+
| 31    26 | 25              24 |
+----------+--------------------+
| reserved | number of operands |
+----------+--------------------+
+------------------------------+
| type 0x2                     |
+----------+-----------+-------+
| 31    29 | 28     26 | 25 24 |
+----------+-----------+-------+
| reserved | reg value | width |
+----------+-----------+-------+

; reg is the value that cooresponds to the register in the ModR/M byte

; width:
00b ; 8 bit
01b ; 16 bit
10b ; 32 bit
11b ; 64 bit

opcodes.by_id

entries are as follows:

+------------------------------+
| 0 operand operators          |
+------------------------------+
| 127                       96 |
+------------------------------+
| reserved                     |
+------------------------------+
| 95                        64 |
+------------------------------+
| reserved                     |
+------------------------------+
| 63                        32 |
+------------------------------+
| reserved                     |
+----------+--------+----------+
| 31    24 | 23  16 | 15     0 |
+----------+--------+----------+
| reserved | opcode | token ID |
+----------+--------+----------+

+-------------------------------------------------------------+
| 1 operand operators                                         |
+-------------------------------------------------------------+
| 127                                                      96 |
+-------------------------------------------------------------+
| reserved                                                    |
+----------+-------+-------+-------+-------+----------+-------+
| 95    88 | 87 84 | 83 80 | 79 76 | 75 72 | 71    68 | 67 64 |
+----------+-------+-------+-------+-------+----------+-------+
| reserved | op5&8 | op4&8 | op3&8 | op2&8 | reserved | op0&8 |
+----------+-------+-------+-------+-------+----------+-------+
| 63    56 | 55         48 | 47         40 | 39            32 |
+----------+---------------+---------------+------------------+
| opcode   | opcode        | opcode        | opcode           |
| dst=rel8 | dest=rel      | dst=imm8      | dst=imm          |
+----------+---------------+---------------+------------------+
| 31    24 | 23         16 | 15                             0 |
+----------+---------------+----------------------------------+
| reserved | opcode        | token ID                         |
|          | dst=r/m       |                                  |
+----------+---------------+----------------------------------+

+----------------------------------------------+
| 2 operand operators                          |
+----------------------------------------------+
| 127                                       96 |
+----------------------------------------------+
| reserved                                     |
+-------------------+-------+-------+----------+
| 95             80 | 79 76 | 75 72 | 71    64 |
+-------------------+-------+-------+----------+
| reserved          | op3&8 | op2&8 | reserved |
+-------------------+-------+-------+----------+
| 63             48 | 47         40 | 39    32 |
+-------------------+---------------+----------+
| reserved          | opcode        | opcode   |
|                   | dst=r/m       | dst=r/m  |
|                   | src=imm8      | src=imm  |
+---------+---------+---------------+----------+
| 31   24 | 23   16 | 15                    0  |
+---------+---------+--------------------------+
| opcode  | opcode  | token ID                 |
| dst=r   | dst=r/m |                          |
| src=r/m | src=r   |                          |
+---------+---------+--------------------------+

; key:
r/m  ; r/m 16/32/64
r    ; r   16/32/64
imm  ; imm 16/32
imm8 ; imm 8
rel  ; rel 16/32
rel8 ; rel 8

opX&8 ; low 8 bits are the operator flag that goes with opcode at offset X from
      ; the first opcode in the table entry

note much room to expand. If an opcode doesn't exist, it should be 0x00

token IDs

supported tokens are listed below

token id notes
rax 0x0000
rbx 0x0001
rcx 0x0002
rdx 0x0003
rsi 0x0004
rdi 0x0005
rsp 0x0006
rbp 0x0007
r8 0x0008 unimplemented
r9 0x0009 unimplemented
r10 0x000A unimplemented
r11 0x000B unimplemented
r12 0x000C unimplemented
r13 0x000D unimplemented
r14 0x000E unimplemented
r15 0x000F unimplemented
eax 0x0010
ebx 0x0011
ecx 0x0012
edx 0x0013
esi 0x0014
edi 0x0015
esp 0x0016
ebp 0x0017
r8d 0x0018 unimplemented
r9d 0x0019 unimplemented
r10d 0x001A unimplemented
r11d 0x001B unimplemented
r12d 0x001C unimplemented
r13d 0x001D unimplemented
r14d 0x001E unimplemented
r15d 0x001F unimplemented
ax 0x0020 unimplemented
bx 0x0021 unimplemented
cx 0x0022 unimplemented
dx 0x0023 unimplemented
si 0x0024 unimplemented
di 0x0025 unimplemented
sp 0x0026 unimplemented
bp 0x0027 unimplemented
r8w 0x0028 unimplemented
r9w 0x0029 unimplemented
r10w 0x002A unimplemented
r11w 0x002B unimplemented
r12w 0x002C unimplemented
r13w 0x002D unimplemented
r14w 0x002E unimplemented
r15w 0x002F unimplemented
al 0x0030 unimplemented
bl 0x0031 unimplemented
cl 0x0032 unimplemented
dl 0x0033 unimplemented
sil 0x0034 unimplemented
dil 0x0035 unimplemented
spl 0x0036 unimplemented
bpl 0x0037 unimplemented
r8b 0x0038 unimplemented
r9b 0x0039 unimplemented
r10b 0x003A unimplemented
r11b 0x003B unimplemented
r12b 0x003C unimplemented
r13b 0x003D unimplemented
r14b 0x003E unimplemented
r15b 0x003F unimplemented
ah 0x0040 unimplemented
bh 0x0041 unimplemented
ch 0x0042 unimplemented
dh 0x0043 unimplemented
cs 0x0044 unimplemented
ds 0x0045 unimplemented
es 0x0046 unimplemented
fs 0x0047 unimplemented
gs 0x0048 unimplemented
ss 0x0049 unimplemented
cr0 0x004A unimplemented
cr2 0x004B unimplemented
cr3 0x004C unimplemented
cr4 0x004D unimplemented
cr8 0x004E unimplemented
hlt 0x004F
int3 0x0050
0x0051 deprecated; formerly [. Now 0x10XX is used.
0x0052 deprecated; formerly ].
xor 0x0053
inc 0x0054
dec 0x0055
mov 0x0056
add 0x0057
sub 0x0058
call 0x0059
ret 0x005A
cmp 0x005B
0x10XX some memory address; XX is as specified below
0xFEXX used to pass some raw value XX in place of a token id
0xFFFF unrecognised token

values of XX in 0x10XX:

XX description
0x00 following byte is the token ID of some register

example program

program in assembly

this program doesn't do anything useful, it's just a test

xor eax, eax
inc rax ; inline comment
; one line comment
mov rdx, [rax]
mov [rax], rdx
hlt

tokenization

0x0053 ; xor
0x0010 ; eax
0x0010 ; eax
0x0054 ; inc
0x0000 ; rax
0x0056 ; mov
0x0003 ; rdx
0x1000 ; memory address: register
0x0000 ; rax
0x0056 ; mov
0x1000 ; memory address: register
0x0000 ; rax
0x0003 ; rdx
0x004F ; hlt

nasm output with the above example program, bits 64

0x31 ; XOR r/m16/32/64 r16/32/64
0xC0 ; ModR/M byte
     ; mod 11b  ; directly address the following:
     ; reg 000b ; EAX
     ; r/m 000b ; EAX

0x48 ; 64 Bit Operand Size prefix
0xFF ; with `reg` from ModR/M byte 000b:
     ; INC r/m16/32/64
0xC0 ; ModR/M byte
     ; mod 11b  ; direct addressing
     ; reg 000b ; RAX
     ; r/m 000b ; RAX

0x48 ; 64 Bit Operand Size prefix
0x8B ; MOV r16/32/64 r/m16/32/64
0x10 ; ModR/M byte
     ; mod 00b  ; indirect addressing, no displacement
     ; reg 010b ; RDX
     ; r/m 000b ; [RAX]

0x48 ; 64 Bit Operand Size prefix
0x89 ; MOV r/m16/32/64 r16/32/64
0x10 ; ModR/M byte
     ; mod 00b  ; indirect addressing, no displacement
     ; reg 010b ; RDX
     ; r/m 000b ; [RAX]

0xF4 ; HLT