Andromeda/bootler

Fork 0

Files

History

andromeda b607bd13f3 IT SORTA WORKS LESGOOO

2026-03-23 23:37:39 +01:00

asm

IT SORTA WORKS LESGOOO

2026-03-23 23:37:39 +01:00

package.nix

increase/fix warnings

2026-03-12 14:12:33 +01:00

README.md

IT SORTA WORKS LESGOOO

2026-03-23 23:37:39 +01:00

README.md

twasm

this will be a self hosted, very minimal subset of nasm-style 64 bit asm

goals

I want to compile Bootler and Twasm with the Twasm assembler

reading

instructions
opcodes,ModR/M,SIB (no secure site available)
calling conventions; I try to use System V

tokeniser

whitespace is ignored for the sake of readability; it can go between pretty much anything

------------------------
tokeniser
------------------------
byte(s) -> next byte(s)
------------------------
Newline   -> Newline
          -> Komment
          -> Operator
          -> Directive

Komment   -> Newline

Operator  -> Newline
          -> Komment
          -> Operand

Operand   -> Newline
          -> Komment
          -> Comma

Comma     -> Operand

Directive -> Newline
          -> Komment
          -> Operator
------------------------

not yet implemented:

------------------------
operand parser
------------------------
byte(s) -> next byte(s)
------------------------
START    -> '['
         -> Register
         -> Constant

'['      -> Register
         -> Constant

']'      -> END

Register -> IF #[, ']'
         -> Operator

Constant -> IF #[, ']'
         -> Operator

Operator -> IF NOT #R, Register
         -> Constant
------------------------
:R: = whether a register has been found
:[: = whether a '[' has been found
------------------------

memory map

+------ 0x00100000 ------+
| hardware, bios stuff   |
+------ 0x00080000 ------+
| output binary          |
+------ 0x00070000 ------+
| token table            |
+------ 0x00060000 ------+
| test arena             |
+------ 0x00050000 ------+
| stack (rsp)            |
+------------------------+
| input                  |
+------------------------+
| assembler              |
+------ 0x00010000 ------+
| bootloader, bios, etc. |
+------------------------+

each word represents a token on the token table.

token table (TT)

each token gets loaded into the token table with the following form:

+----------+
| 15     0 |
+----------+
| token id |
+----------+

internal data structures

`tokens.[operators|registers]`

contains tokens by their type. Intended to be searched by token name to get the token's ID.

each entry is in the following form:

+----------+--------------------------------+
| 47    32 | 31                           0 |
+----------+--------------------------------+
| token ID | string without null terminator |
+----------+--------------------------------+

example implementation:

tokens
  .registers:
    dd "r8"
    dw 0x0008
  .by_name3: ; this is required for futureproofness; the caller can use this to
             ; find the size of registers.by_name2

note that tokens longer than 4 bytes are problematic :/

`tokens.by_id`

contains some tokens with their metadata. Some tokens have embedded information (0x10XX for instance). Those will not have entries in this table, being handled instead inside the assemble function itself.

metadata about some tokens in the following form:

+----------------+----------+-------+----------+
| 31          24 | 23    20 | 19 16 | 15     0 |
+----------------+----------+-------+----------+
| typed metadata | reserved | type  | token ID |
+----------------+----------+-------+----------+

the type hex digit is defined as the following:

hex	meaning	examples
0x0	ignored	`; this entire comment is 1 token`
0x1	operator	`mov`, `hlt`
0x2	register	`rsp`, `al`
0xF	unknown	any token ID not represented in the lookup table

type metadata for the different types is as follows:

+----------+
| type 0x0 |
+----------+
| 31    24 |
+----------+
| reserved |
+----------+

+-------------------------------+
| type 0x1                      |
+----------+--------------------+
| 31    26 | 25              24 |
+----------+--------------------+
| reserved | number of operands |
+----------+--------------------+

+------------------------------+
| type 0x2                     |
+----------+-----------+-------+
| 31    29 | 28     26 | 25 24 |
+----------+-----------+-------+
| reserved | reg value | width |
+----------+-----------+-------+

; reg is the value that cooresponds to the register in the ModR/M byte

; width:
00b ; 8 bit
01b ; 16 bit
10b ; 32 bit
11b ; 64 bit

`opcodes.by_id`

entries are as follows:

+-----------------+-----------------+----------+
| 31           24 | 23           16 | 15     0 |
+-----------------+-----------------+----------+
| dest=reg opcode | dest=r/m opcode | token ID |
+-----------------+-----------------+----------+

note the lack of support for multiple-byte opcodes or multiple opcodes for one token ID; these features will likely be added at some point after the parser accumulates too much jank.

token IDs

supported tokens are listed below

token	id	notes
rax	0x0000
rbx	0x0001
rcx	0x0002
rdx	0x0003
rsi	0x0004
rdi	0x0005
rsp	0x0006
rbp	0x0007
r8	0x0008
r9	0x0009
r10	0x000A
r11	0x000B
r12	0x000C
r13	0x000D
r14	0x000E
r15	0x000F
eax	0x0010
ebx	0x0011
ecx	0x0012
edx	0x0013
esi	0x0014
edi	0x0015
esp	0x0016
ebp	0x0017
r8d	0x0018
r9d	0x0019
r10d	0x001A
r11d	0x001B
r12d	0x001C
r13d	0x001D
r14d	0x001E
r15d	0x001F
ax	0x0020
bx	0x0021
cx	0x0022
dx	0x0023
si	0x0024
di	0x0025
sp	0x0026
bp	0x0027
r8w	0x0028
r9w	0x0029
r10w	0x002A
r11w	0x002B
r12w	0x002C
r13w	0x002D
r14w	0x002E
r15w	0x002F
al	0x0030
bl	0x0031
cl	0x0032
dl	0x0033
sil	0x0034
dil	0x0035
spl	0x0036
bpl	0x0037
r8b	0x0038
r9b	0x0039
r10b	0x003A
r11b	0x003B
r12b	0x003C
r13b	0x003D
r14b	0x003E
r15b	0x003F
ah	0x0040
bh	0x0041
ch	0x0042
dh	0x0043
cs	0x0044
ds	0x0045
es	0x0046
fs	0x0047
gs	0x0048
ss	0x0049
cr0	0x004A
cr2	0x004B
cr3	0x004C
cr4	0x004D
cr8	0x004E
hlt	0x004F
int3	0x0050
	0x0051	deprecated; formerly `[`. Now `0x10XX` is used.
	0x0052	deprecated; formerly `]`.
xor	0x0053
inc	0x0054
dec	0x0055
mov	0x0056
add	0x0057
sub	0x0058
call	0x0059
ret	0x005A
cmp	0x005B
je	0x005C
jne	0x005D
jge	0x005E
jg	0x005F
jle	0x0060
jl	0x0061
	0x10XX	some memory address; `XX` is as specified below
	0xFFFF	unrecognised token

values of XX in 0x10XX:

XX	description
0x00	following byte is the token ID of some register

example program

program in assembly

this program doesn't do anything useful, it's just a test

xor eax, eax
inc rax
mov [ rax ], rdx
hlt

tokenization

0x0053 ; xor
0xFE20 ; space
0x0010 ; eax
0xFE2C ; comma
0xFE20 ; space
0x0010 ; eax
0xFE0A ; newline
0x0054 ; inc
0xFE20 ; space
0x0000 ; rax
0xFE0A ; newline
0x0056 ; mov
0xFE20 ; space
0x1004 ; open bracket (4)
0xFE20 ; space         |1
0x0000 ; rax           |2
0xFE20 ; space         |3
0x0052 ; close bracket |4
0xFE2C ; comma
0xFE20 ; space
0x0003 ; rdx
0xFE0A ; newline
0x004F ; hlt
0xFE0A ; newline
0xFE00 ; null terminator

nasm output with the above example program, bits 64

0x31 ; XOR r/m16/32/64 r16/32/64
0xC0 ; ModR/M byte
     ; mod 11b  ; directly address the following:
     ; reg 000b ; EAX
     ; r/m 000b ; EAX

0x48 ; 64 Bit Operand Size prefix
0xFF ; with `reg` from ModR/M byte 000b:
     ; INC r/m16/32/64
0xC0 ; ModR/M byte
     ; mod 11b  ; direct addressing
     ; reg 000b ; RAX
     ; r/m 000b ; RAX

0x48 ; 64 Bit Operand Size prefix
0x89 ; MOV r/m16/32/64 r16/32/64
0x10 ; ModR/M byte
     ; mod 00b  ; indirect addressing, no displacement
     ; reg 010b ; RDX
     ; r/m 000b ; [RAX]

0xF4 ; HLT