Files
bootler/twasm/README.md

14 KiB

twasm

this will be a self hosted, very minimal subset of nasm-style 64 bit asm

goals

I want to compile Bootler and Twasm with the Twasm assembler

reading

tokeniser

whitespace is ignored for the sake of readability; it can go between pretty much anything

------------------------
tokeniser
------------------------
byte(s) -> next byte(s)
------------------------
Newline   -> Label
          -> Newline
          -> Komment
          -> Operator
          -> Directive

Label     -> Newline

Komment   -> Newline

Operator  -> Newline
          -> Komment
          -> Operand

Operand   -> Newline
          -> Komment
          -> Comma

Comma     -> Operand

Directive -> Newline
          -> Komment
          -> Operator
------------------------

memory map

+------ 0x00100000 ------+
| hardware, bios stuff   |
+------ 0x00080000 ------+
| output binary          |
+------ 0x00070000 ------+
| token table            |
+------ 0x00060000 ------+
| test arena             |
+------ 0x00050000 ------+
| label table            |
+------ 0x00040000 ------+
| awaiting label table   |
+------ 0x00030000 ------+
| stack (rsp)            |
+------------------------+
| input                  |
+------------------------+
| assembler              |
+------ 0x00010000 ------+
| bootloader, bios, etc. |
+------------------------+

each word represents a token on the token table.

token table (TT)

each token gets loaded into the token table with the following form:

2 bytes
+----------+
| 15     0 |
+----------+
| token id |
+----------+

label table (LT)

label definitions are stored and recalled from this table. The memory addresses are relative to the start of the program

16 bytes
+----------+---------+
| 127   96 | 95   64 |
+----------+---------+
| reserved | address |
+----------+---------+
| 63               0 |
+--------------------+
| hash               |
+--------------------+

awaiting label table (ALT)

forward references are stored in this table to be filled in after assembly is otherwise complete. The memory addresses are relative to the start of the program

16 bytes
+----------+----------+------------------+---------+
| 127  101 |      100 | 99            96 | 95   64 |
+----------+----------+------------------+---------+
| reserved | abs flag | # bytes reserved | address |
+----------+----------+------------------+---------+
| 63                                             0 |
+--------------------------------------------------+
| hash                                             |
+--------------------------------------------------+

internal data structures

tokens.[operators|registers]

contains tokens by their type. Intended to be searched by token name to get the token's ID.

each entry is in the following form:

6 bytes
+----------+--------------------------------+
| 47    32 | 31                           0 |
+----------+--------------------------------+
| token ID | string without null terminator |
+----------+--------------------------------+

note that tokens longer than 4 bytes are problematic :/

tokens.by_id

contains some tokens with their metadata. Some tokens have embedded information (0x10XX for instance). Those do not have entries in this table, being handled instead inside the assemble function itself.

metadata about some tokens in the following form:

4 bytes
+----------------+----------+-------+----------+
| 31          24 | 23    20 | 19 16 | 15     0 |
+----------------+----------+-------+----------+
| typed metadata | reserved | type  | token ID |
+----------------+----------+-------+----------+

the type hex digit is defined as the following:

hex meaning examples
0x0 ignored
0x1 operator mov, hlt
0x2 register rsp, al
0x3 pseudo-operator db
0xF unknown any token ID not represented in the lookup table

type metadata for the different types is as follows:

1 byte
+----------+
| type 0x0 |
+----------+
| 31    24 |
+----------+
| reserved |
+----------+
1 byte
+-------------------------------+
| type 0x1                      |
+----------+--------------------+
| 31    26 | 25              24 |
+----------+--------------------+
| reserved | number of operands |
+----------+--------------------+
1 byte
+------------------------------+
| type 0x2                     |
+----------+-----------+-------+
| 31    29 | 28     26 | 25 24 |
+----------+-----------+-------+
| reserved | reg value | width |
+----------+-----------+-------+

; reg is the value that cooresponds to the register in the ModR/M byte

; width:
00b ; 8 bit
01b ; 16 bit
10b ; 32 bit
11b ; 64 bit
1 byte
+----------+
| type 0x3 |
+----------+
| 31    24 |
+----------+
| reserved |
+----------+

opcodes.by_id

entries are as follows:

16 bytes
+------------------------------+
| 0 operand operators          |
+------------------------------+
| 127                       96 |
+------------------------------+
| reserved                     |
+------------------------------+
| 95                        64 |
+------------------------------+
| reserved                     |
+------------------------------+
| 63                        32 |
+------------------------------+
| reserved                     |
+----------+--------+----------+
| 31    24 | 23  16 | 15     0 |
+----------+--------+----------+
| reserved | opcode | token ID |
+----------+--------+----------+

16 bytes
+-------------------------------------------------------------+
| 1 operand operators                                         |
+-------------------------------------------------------------+
| 127                                                      96 |
+-------------------------------------------------------------+
| reserved                                                    |
+----------+-------+-------+-------+-------+----------+-------+
| 95    88 | 87 84 | 83 80 | 79 76 | 75 72 | 71    68 | 67 64 |
+----------+-------+-------+-------+-------+----------+-------+
| reserved | op5&8 | op4&8 | op3&8 | op2&8 | reserved | op0&8 |
+----------+-------+-------+-------+-------+----------+-------+
| 63    56 | 55         48 | 47         40 | 39            32 |
+----------+---------------+---------------+------------------+
| opcode   | opcode        | opcode        | opcode           |
| dst=rel8 | dst=rel       | dst=imm8      | dst=imm          |
+----------+---------------+---------------+------------------+
| 31    24 | 23         16 | 15                             0 |
+----------+---------------+----------------------------------+
| reserved | opcode        | token ID                         |
|          | dst=r/m       |                                  |
+----------+---------------+----------------------------------+

16 bytes
+-----------------------------------------------+
| 2 operand operators                           |
+-----------------------------------------------+
| 127                                        96 |
+-----------------------------------------------+
| reserved                                      |
+---------+----------+-------+-------+----------+
| 95   88 | 87    80 | 79 76 | 75 72 | 71    64 |
+---------+----------+-------+-------+----------+
| flags   | reserved | op3&8 | op2&8 | reserved |
+---------+----------+-------+-------+----------+
| 63              48 | 47         40 | 39    32 |
+--------------------+---------------+----------+
| reserved           | opcode        | opcode   |
|                    | dst=r/m       | dst=r/m  |
|                    | src=imm8      | src=imm  |
+---------+----------+---------------+----------+
| 31   24 | 23    16 | 15                    0  |
+---------+----------+--------------------------+
| opcode  | opcode   | token ID                 |
| dst=r   | dst=r/m  |                          |
| src=r/m | src=r    |                          |
+---------+----------+--------------------------+

1 byte
+-----------------+
| flags byte      |
+----------+------+
| 95    89 |  88  |
+----------+------+
| reserved | 8bit |
+----------+------+

; flags key:
8bit ; tte has opcodes for r/m8 and r8 instead of r/m and r respectively

; key:
r/m  ; r/m 16/32/64
r/m8 ; r/m 8
r    ; r   16/32/64
r8   ; r   8
imm  ; imm 16/32
imm8 ; imm 8
rel  ; rel 16/32
rel8 ; rel 8

opX&8 ; low 8 bits are the operator flag that goes with opcode at offset X from
      ; the first opcode in the table entry. High bit is (somewhat confusingly)
      ; a flag for whether or not the operator comes with an `0F` prefix

note much room to expand. If an opcode doesn't exist, it should be 0x00

token IDs

supported tokens are listed below

token id notes
rax 0x0000 register
rbx 0x0001 register
rcx 0x0002 register
rdx 0x0003 register
rsi 0x0004 register
rdi 0x0005 register
rsp 0x0006 register
rbp 0x0007 register
r8 0x0008 unimplemented
r9 0x0009 unimplemented
r10 0x000A unimplemented
r11 0x000B unimplemented
r12 0x000C unimplemented
r13 0x000D unimplemented
r14 0x000E unimplemented
r15 0x000F unimplemented
eax 0x0010 register
ebx 0x0011 register
ecx 0x0012 register
edx 0x0013 register
esi 0x0014 register
edi 0x0015 register
esp 0x0016 register
ebp 0x0017 register
r8d 0x0018 unimplemented
r9d 0x0019 unimplemented
r10d 0x001A unimplemented
r11d 0x001B unimplemented
r12d 0x001C unimplemented
r13d 0x001D unimplemented
r14d 0x001E unimplemented
r15d 0x001F unimplemented
ax 0x0020 register
bx 0x0021 register
cx 0x0022 register
dx 0x0023 register
si 0x0024 register
di 0x0025 register
sp 0x0026 register
bp 0x0027 register
r8w 0x0028 unimplemented
r9w 0x0029 unimplemented
r10w 0x002A unimplemented
r11w 0x002B unimplemented
r12w 0x002C unimplemented
r13w 0x002D unimplemented
r14w 0x002E unimplemented
r15w 0x002F unimplemented
al 0x0030 register
bl 0x0031 register
cl 0x0032 register
dl 0x0033 register
sil 0x0034 register
dil 0x0035 register
spl 0x0036 register
bpl 0x0037 register
r8b 0x0038 unimplemented
r9b 0x0039 unimplemented
r10b 0x003A unimplemented
r11b 0x003B unimplemented
r12b 0x003C unimplemented
r13b 0x003D unimplemented
r14b 0x003E unimplemented
r15b 0x003F unimplemented
ah 0x0040 unimplemented
bh 0x0041 unimplemented
ch 0x0042 unimplemented
dh 0x0043 unimplemented
cs 0x0044 unimplemented
ds 0x0045 unimplemented
es 0x0046 unimplemented
fs 0x0047 unimplemented
gs 0x0048 unimplemented
ss 0x0049 unimplemented
cr0 0x004A unimplemented
cr2 0x004B unimplemented
cr3 0x004C unimplemented
cr4 0x004D unimplemented
cr8 0x004E unimplemented
hlt 0x004F operator
int3 0x0050 operator
0x0051 deprecated; formerly [. Now 0x10XX is used.
0x0052 deprecated; formerly ].
xor 0x0053 operator
inc 0x0054 operator
dec 0x0055 operator
mov 0x0056 operator
add 0x0057 operator
sub 0x0058 operator
call 0x0059 operator
ret 0x005A operator
cmp 0x005B operator
jmp 0x005C operator
je 0x005D operator
jne 0x005E operator
push 0x005F operator
pop 0x0060 operator
out 0x0061 operator
db 0x0100 pseudo-operator
0x10XX some memory address; XX is as specified below
0x20XX some constant; XX is as specified below
0x3XXX some label; XXX is its entry index in the label table
0xFEXX used to pass some raw value XX in place of a token id to a couple of functions that mention this as a feature. If the function doesn't mention it, it will lead to undefined behaviour
0xFFFF unrecognised token

values of XX in 0x10XX:

XX description
0x00 following word is the token ID of some register

values of XX in 0x20XX:

XX description
0x00 following 8 bytes are the constant's value

example program

program in assembly

this program doesn't do anything useful, it's just a test

xor eax, eax
inc rax ; inline comment
; one line comment
mov rdx, [rax]
mov [rax], rdx
hlt

tokenization

0x0053 ; xor
0x0010 ; eax
0x0010 ; eax
0x0054 ; inc
0x0000 ; rax
0x0056 ; mov
0x0003 ; rdx
0x1000 ; memory address: register
0x0000 ; rax
0x0056 ; mov
0x1000 ; memory address: register
0x0000 ; rax
0x0003 ; rdx
0x004F ; hlt

nasm output with the above example program, bits 64

0x31 ; XOR r/m16/32/64 r16/32/64
0xC0 ; ModR/M byte
     ; mod 11b  ; directly address the following:
     ; reg 000b ; EAX
     ; r/m 000b ; EAX

0x48 ; 64 Bit Operand Size prefix
0xFF ; with `reg` from ModR/M byte 000b:
     ; INC r/m16/32/64
0xC0 ; ModR/M byte
     ; mod 11b  ; direct addressing
     ; reg 000b ; RAX
     ; r/m 000b ; RAX

0x48 ; 64 Bit Operand Size prefix
0x8B ; MOV r16/32/64 r/m16/32/64
0x10 ; ModR/M byte
     ; mod 00b  ; indirect addressing, no displacement
     ; reg 010b ; RDX
     ; r/m 000b ; [RAX]

0x48 ; 64 Bit Operand Size prefix
0x89 ; MOV r/m16/32/64 r16/32/64
0x10 ; ModR/M byte
     ; mod 00b  ; indirect addressing, no displacement
     ; reg 010b ; RDX
     ; r/m 000b ; [RAX]

0xF4 ; HLT