# twasm this will be a self hosted, very minimal subset of nasm-style 64 bit asm ### goals I want to compile Bootler and Twasm with the Twasm assembler ### reading - [instructions](https://www.felixcloutier.com/x86/) - [opcodes,ModR/M,SIB](http://ref.x86asm.net/coder64.html) (no secure site available) - [calling conventions](https://wiki.osdev.org/Calling_Conventions); I try to use System V ### tokeniser whitespace is ignored for the sake of readability; it can go between pretty much anything ``` ------------------------ tokeniser ------------------------ byte(s) -> next byte(s) ------------------------ Newline -> Newline -> Komment -> Operator -> Directive Komment -> Newline Operator -> Newline -> Komment -> Operand Operand -> Newline -> Komment -> Comma Comma -> Operand Directive -> Newline -> Komment -> Operator ------------------------ ``` not yet implemented: ``` ------------------------ operand parser ------------------------ byte(s) -> next byte(s) ------------------------ START -> '[' -> Register -> Constant '[' -> Register -> Constant ']' -> END Register -> IF #[, ']' -> Operator Constant -> IF #[, ']' -> Operator Operator -> IF NOT #R, Register -> Constant ------------------------ :R: = whether a register has been found :[: = whether a '[' has been found ------------------------ ``` ### memory map ``` +------ 0x00100000 ------+ | hardware, bios stuff | +------ 0x00080000 ------+ | output binary | +------ 0x00070000 ------+ | token table | +------ 0x00060000 ------+ | test arena | +------ 0x00050000 ------+ | stack (rsp) | +------------------------+ | input | +------------------------+ | assembler | +------ 0x00010000 ------+ | bootloader, bios, etc. | +------------------------+ ``` each word represents a token on the token table. #### token table (TT) each token gets loaded into the token table with the following form: ``` +----------+ | 15 0 | +----------+ | token id | +----------+ ``` ### internal data structures #### `tokens.[operators|registers]` contains tokens by their type. Intended to be searched by token name to get the token's ID. each entry is in the following form: ``` +----------+--------------------------------+ | 47 32 | 31 0 | +----------+--------------------------------+ | token ID | string without null terminator | +----------+--------------------------------+ ``` example implementation: ```nasm tokens .registers: dd "r8" dw 0x0008 .by_name3: ; this is required for futureproofness; the caller can use this to ; find the size of registers.by_name2 ``` note that tokens longer than 4 bytes are problematic :/ #### `tokens.by_id` contains some tokens with their metadata. Some tokens have embedded information (`0x10XX` for instance). Those will not have entries in this table, being handled instead inside the assemble function itself. metadata about some tokens in the following form: ``` +----------------+----------+-------+----------+ | 31 24 | 23 20 | 19 16 | 15 0 | +----------------+----------+-------+----------+ | typed metadata | reserved | type | token ID | +----------------+----------+-------+----------+ ``` the `type` hex digit is defined as the following: | hex | meaning | examples | |-----|----------|-| | 0x0 | ignored | `; this entire comment is 1 token` | | 0x1 | operator | `mov`, `hlt` | | 0x2 | register | `rsp`, `al` | | 0xF | unknown | any token ID not represented in the lookup table | type metadata for the different types is as follows: ``` +----------+ | type 0x0 | +----------+ | 31 24 | +----------+ | reserved | +----------+ ``` ``` +-------------------------------+ | type 0x1 | +----------+--------------------+ | 31 26 | 25 24 | +----------+--------------------+ | reserved | number of operands | +----------+--------------------+ ``` ``` +------------------------------+ | type 0x2 | +----------+-----------+-------+ | 31 29 | 28 26 | 25 24 | +----------+-----------+-------+ | reserved | reg value | width | +----------+-----------+-------+ ; reg is the value that cooresponds to the register in the ModR/M byte ; width: 00b ; 8 bit 01b ; 16 bit 10b ; 32 bit 11b ; 64 bit ``` #### `opcodes.by_id` entries are as follows: ``` +-----------------+-----------------+----------+ | 31 24 | 23 16 | 15 0 | +-----------------+-----------------+----------+ | dest=reg opcode | dest=r/m opcode | token ID | +-----------------+-----------------+----------+ ``` note the lack of support for multiple-byte opcodes or multiple opcodes for one token ID; these features will likely be added at some point after the parser accumulates too much jank. ### token IDs supported tokens are listed below | token | id | notes | |-------|--------|-| | rax | 0x0000 | | | rbx | 0x0001 | | | rcx | 0x0002 | | | rdx | 0x0003 | | | rsi | 0x0004 | | | rdi | 0x0005 | | | rsp | 0x0006 | | | rbp | 0x0007 | | | r8 | 0x0008 | | | r9 | 0x0009 | | | r10 | 0x000A | | | r11 | 0x000B | | | r12 | 0x000C | | | r13 | 0x000D | | | r14 | 0x000E | | | r15 | 0x000F | | | eax | 0x0010 | | | ebx | 0x0011 | | | ecx | 0x0012 | | | edx | 0x0013 | | | esi | 0x0014 | | | edi | 0x0015 | | | esp | 0x0016 | | | ebp | 0x0017 | | | r8d | 0x0018 | | | r9d | 0x0019 | | | r10d | 0x001A | | | r11d | 0x001B | | | r12d | 0x001C | | | r13d | 0x001D | | | r14d | 0x001E | | | r15d | 0x001F | | | ax | 0x0020 | | | bx | 0x0021 | | | cx | 0x0022 | | | dx | 0x0023 | | | si | 0x0024 | | | di | 0x0025 | | | sp | 0x0026 | | | bp | 0x0027 | | | r8w | 0x0028 | | | r9w | 0x0029 | | | r10w | 0x002A | | | r11w | 0x002B | | | r12w | 0x002C | | | r13w | 0x002D | | | r14w | 0x002E | | | r15w | 0x002F | | | al | 0x0030 | | | bl | 0x0031 | | | cl | 0x0032 | | | dl | 0x0033 | | | sil | 0x0034 | | | dil | 0x0035 | | | spl | 0x0036 | | | bpl | 0x0037 | | | r8b | 0x0038 | | | r9b | 0x0039 | | | r10b | 0x003A | | | r11b | 0x003B | | | r12b | 0x003C | | | r13b | 0x003D | | | r14b | 0x003E | | | r15b | 0x003F | | | ah | 0x0040 | | | bh | 0x0041 | | | ch | 0x0042 | | | dh | 0x0043 | | | cs | 0x0044 | | | ds | 0x0045 | | | es | 0x0046 | | | fs | 0x0047 | | | gs | 0x0048 | | | ss | 0x0049 | | | cr0 | 0x004A | | | cr2 | 0x004B | | | cr3 | 0x004C | | | cr4 | 0x004D | | | cr8 | 0x004E | | | hlt | 0x004F | | | int3 | 0x0050 | | | | 0x0051 | deprecated; formerly `[`. Now `0x10XX` is used. | | | 0x0052 | deprecated; formerly `]`. | | xor | 0x0053 | | | inc | 0x0054 | | | dec | 0x0055 | | | mov | 0x0056 | | | add | 0x0057 | | | sub | 0x0058 | | | call | 0x0059 | | | ret | 0x005A | | | cmp | 0x005B | | | je | 0x005C | | | jne | 0x005D | | | jge | 0x005E | | | jg | 0x005F | | | jle | 0x0060 | | | jl | 0x0061 | | | | 0x10XX | some memory address; `XX` is as specified below | | | 0xFFFF | unrecognised token | values of `XX` in `0x10XX`: | XX | description | |------|-------------| | 0x00 | following byte is the token ID of some register | ### example program #### program in assembly this program doesn't do anything useful, it's just a test ```nasm xor eax, eax inc rax ; inline comment ; one line comment mov rdx, [rax] mov [rax], rdx hlt ``` #### tokenization ```nasm 0x0053 ; xor 0x0010 ; eax 0x0010 ; eax 0x0054 ; inc 0x0000 ; rax 0x0056 ; mov 0x0003 ; rdx 0x1000 ; memory address: register 0x0000 ; rax 0x0056 ; mov 0x1000 ; memory address: register 0x0000 ; rax 0x0003 ; rdx 0x004F ; hlt ``` #### nasm output with the above example program, bits 64 ```nasm 0x31 ; XOR r/m16/32/64 r16/32/64 0xC0 ; ModR/M byte ; mod 11b ; directly address the following: ; reg 000b ; EAX ; r/m 000b ; EAX 0x48 ; 64 Bit Operand Size prefix 0xFF ; with `reg` from ModR/M byte 000b: ; INC r/m16/32/64 0xC0 ; ModR/M byte ; mod 11b ; direct addressing ; reg 000b ; RAX ; r/m 000b ; RAX 0x48 ; 64 Bit Operand Size prefix 0x8B ; MOV r16/32/64 r/m16/32/64 0x10 ; ModR/M byte ; mod 00b ; indirect addressing, no displacement ; reg 010b ; RDX ; r/m 000b ; [RAX] 0x48 ; 64 Bit Operand Size prefix 0x89 ; MOV r/m16/32/64 r16/32/64 0x10 ; ModR/M byte ; mod 00b ; indirect addressing, no displacement ; reg 010b ; RDX ; r/m 000b ; [RAX] 0xF4 ; HLT ```