# twasm this will be a self hosted, very minimal subset of nasm-style 64 bit asm ### goals I want to compile Bootler and Twasm with the Twasm assembler ### reading - [instructions](https://www.felixcloutier.com/x86/) - [opcodes,ModR/M,SIB](http://ref.x86asm.net/coder64.html) (no secure site available) - [calling conventions](https://wiki.osdev.org/Calling_Conventions); I try to use System V ### tokeniser whitespace is ignored for the sake of readability; it can go between pretty much anything ``` ------------------------ tokeniser ------------------------ byte(s) -> next byte(s) ------------------------ Newline -> Label -> Newline -> Komment -> Operator -> Directive Label -> Newline Komment -> Newline Operator -> Newline -> Komment -> Operand Operand -> Newline -> Komment -> Comma Comma -> Operand Directive -> Newline -> Komment -> Operator ------------------------ ``` ### memory map ``` +------ 0x00100000 ------+ | hardware, bios stuff | +------ 0x00080000 ------+ | output binary | +------ 0x00070000 ------+ | token table | +------ 0x00060000 ------+ | test arena | +------ 0x00050000 ------+ | label table | +------ 0x00040000 ------+ | awaiting label table | +------ 0x00030000 ------+ | stack (rsp) | +------------------------+ | input | +------------------------+ | assembler | +------ 0x00010000 ------+ | bootloader, bios, etc. | +------------------------+ ``` each word represents a token on the token table. #### token table (TT) each token gets loaded into the token table with the following form: ``` 2 bytes +----------+ | 15 0 | +----------+ | token id | +----------+ ``` #### label table (LT) label definitions are stored and recalled from this table. The memory addresses are relative to the start of the program ``` 16 bytes +---------+ | 127 64 | +---------+ | address | +---------+ | 63 0 | +---------+ | hash | +---------+ ``` #### awaiting label table (ALT) forward references are stored in this table to be filled in after assembly is otherwise complete. The memory addresses are relative to the start of the program ``` 16 bytes +----------+----------+------------------+---------+ | 127 105 | 104 104 | 103 96 | 95 64 | +----------+----------+------------------+---------+ | reserved | abs flag | # bytes reserved | address | +----------+----------+------------------+---------+ | 63 0 | +--------------------------------------------------+ | hash | +--------------------------------------------------+ ``` ### internal data structures #### `tokens.[operators|registers]` contains tokens by their type. Intended to be searched by token name to get the token's ID. each entry is in the following form: ``` 6 bytes +----------+--------------------------------+ | 47 32 | 31 0 | +----------+--------------------------------+ | token ID | string without null terminator | +----------+--------------------------------+ ``` note that tokens longer than 4 bytes are problematic :/ #### `tokens.by_id` contains some tokens with their metadata. Some tokens have embedded information (`0x10XX` for instance). Those do not have entries in this table, being handled instead inside the assemble function itself. metadata about some tokens in the following form: ``` 4 bytes +----------------+----------+-------+----------+ | 31 24 | 23 20 | 19 16 | 15 0 | +----------------+----------+-------+----------+ | typed metadata | reserved | type | token ID | +----------------+----------+-------+----------+ ``` the `type` hex digit is defined as the following: | hex | meaning | examples | |-----|-----------------|-| | 0x0 | ignored | | | 0x1 | operator | `mov`, `hlt` | | 0x2 | register | `rsp`, `al` | | 0x3 | pseudo-operator | `db` | | 0xF | unknown | any token ID not represented in the lookup table | type metadata for the different types is as follows: ``` 1 byte +----------+ | type 0x0 | +----------+ | 31 24 | +----------+ | reserved | +----------+ ``` ``` 1 byte +-------------------------------+ | type 0x1 | +----------+--------------------+ | 31 26 | 25 24 | +----------+--------------------+ | reserved | number of operands | +----------+--------------------+ ``` ``` 1 byte +------------------------------+ | type 0x2 | +----------+-----------+-------+ | 31 29 | 28 26 | 25 24 | +----------+-----------+-------+ | reserved | reg value | width | +----------+-----------+-------+ ; reg is the value that cooresponds to the register in the ModR/M byte ; width: 00b ; 8 bit 01b ; 16 bit 10b ; 32 bit 11b ; 64 bit ``` ``` 1 byte +----------+ | type 0x3 | +----------+ | 31 24 | +----------+ | reserved | +----------+ ``` #### `opcodes.by_id` entries are as follows: ``` 16 bytes +------------------------------+ | 0 operand operators | +------------------------------+ | 127 96 | +------------------------------+ | reserved | +------------------------------+ | 95 64 | +------------------------------+ | reserved | +------------------------------+ | 63 32 | +------------------------------+ | reserved | +----------+--------+----------+ | 31 24 | 23 16 | 15 0 | +----------+--------+----------+ | reserved | opcode | token ID | +----------+--------+----------+ 16 bytes +-------------------------------------------------------------+ | 1 operand operators | +-------------------------------------------------------------+ | 127 96 | +-------------------------------------------------------------+ | reserved | +----------+-------+-------+-------+-------+----------+-------+ | 95 88 | 87 84 | 83 80 | 79 76 | 75 72 | 71 68 | 67 64 | +----------+-------+-------+-------+-------+----------+-------+ | reserved | op5&8 | op4&8 | op3&8 | op2&8 | reserved | op0&8 | +----------+-------+-------+-------+-------+----------+-------+ | 63 56 | 55 48 | 47 40 | 39 32 | +----------+---------------+---------------+------------------+ | opcode | opcode | opcode | opcode | | dst=rel8 | dest=rel | dst=imm8 | dst=imm | +----------+---------------+---------------+------------------+ | 31 24 | 23 16 | 15 0 | +----------+---------------+----------------------------------+ | reserved | opcode | token ID | | | dst=r/m | | +----------+---------------+----------------------------------+ 16 bytes +----------------------------------------------+ | 2 operand operators | +----------------------------------------------+ | 127 96 | +----------------------------------------------+ | reserved | +-------------------+-------+-------+----------+ | 95 80 | 79 76 | 75 72 | 71 64 | +-------------------+-------+-------+----------+ | reserved | op3&8 | op2&8 | reserved | +-------------------+-------+-------+----------+ | 63 48 | 47 40 | 39 32 | +-------------------+---------------+----------+ | reserved | opcode | opcode | | | dst=r/m | dst=r/m | | | src=imm8 | src=imm | +---------+---------+---------------+----------+ | 31 24 | 23 16 | 15 0 | +---------+---------+--------------------------+ | opcode | opcode | token ID | | dst=r | dst=r/m | | | src=r/m | src=r | | +---------+---------+--------------------------+ ; key: r/m ; r/m 16/32/64 r ; r 16/32/64 imm ; imm 16/32 imm8 ; imm 8 rel ; rel 16/32 rel8 ; rel 8 opX&8 ; low 8 bits are the operator flag that goes with opcode at offset X from ; the first opcode in the table entry ``` note much room to expand. If an opcode doesn't exist, it should be 0x00 ### token IDs supported tokens are listed below | token | id | notes | |-------|--------|-| | rax | 0x0000 | | | rbx | 0x0001 | | | rcx | 0x0002 | | | rdx | 0x0003 | | | rsi | 0x0004 | | | rdi | 0x0005 | | | rsp | 0x0006 | | | rbp | 0x0007 | | | r8 | 0x0008 | unimplemented | | r9 | 0x0009 | unimplemented | | r10 | 0x000A | unimplemented | | r11 | 0x000B | unimplemented | | r12 | 0x000C | unimplemented | | r13 | 0x000D | unimplemented | | r14 | 0x000E | unimplemented | | r15 | 0x000F | unimplemented | | eax | 0x0010 | | | ebx | 0x0011 | | | ecx | 0x0012 | | | edx | 0x0013 | | | esi | 0x0014 | | | edi | 0x0015 | | | esp | 0x0016 | | | ebp | 0x0017 | | | r8d | 0x0018 | unimplemented | | r9d | 0x0019 | unimplemented | | r10d | 0x001A | unimplemented | | r11d | 0x001B | unimplemented | | r12d | 0x001C | unimplemented | | r13d | 0x001D | unimplemented | | r14d | 0x001E | unimplemented | | r15d | 0x001F | unimplemented | | ax | 0x0020 | | | bx | 0x0021 | | | cx | 0x0022 | | | dx | 0x0023 | | | si | 0x0024 | | | di | 0x0025 | | | sp | 0x0026 | | | bp | 0x0027 | | | r8w | 0x0028 | unimplemented | | r9w | 0x0029 | unimplemented | | r10w | 0x002A | unimplemented | | r11w | 0x002B | unimplemented | | r12w | 0x002C | unimplemented | | r13w | 0x002D | unimplemented | | r14w | 0x002E | unimplemented | | r15w | 0x002F | unimplemented | | al | 0x0030 | | | bl | 0x0031 | | | cl | 0x0032 | | | dl | 0x0033 | | | sil | 0x0034 | | | dil | 0x0035 | | | spl | 0x0036 | | | bpl | 0x0037 | | | r8b | 0x0038 | unimplemented | | r9b | 0x0039 | unimplemented | | r10b | 0x003A | unimplemented | | r11b | 0x003B | unimplemented | | r12b | 0x003C | unimplemented | | r13b | 0x003D | unimplemented | | r14b | 0x003E | unimplemented | | r15b | 0x003F | unimplemented | | ah | 0x0040 | unimplemented | | bh | 0x0041 | unimplemented | | ch | 0x0042 | unimplemented | | dh | 0x0043 | unimplemented | | cs | 0x0044 | unimplemented | | ds | 0x0045 | unimplemented | | es | 0x0046 | unimplemented | | fs | 0x0047 | unimplemented | | gs | 0x0048 | unimplemented | | ss | 0x0049 | unimplemented | | cr0 | 0x004A | unimplemented | | cr2 | 0x004B | unimplemented | | cr3 | 0x004C | unimplemented | | cr4 | 0x004D | unimplemented | | cr8 | 0x004E | unimplemented | | hlt | 0x004F | | | int3 | 0x0050 | | | | 0x0051 | deprecated; formerly `[`. Now `0x10XX` is used. | | | 0x0052 | deprecated; formerly `]`. | | xor | 0x0053 | | | inc | 0x0054 | | | dec | 0x0055 | | | mov | 0x0056 | | | add | 0x0057 | | | sub | 0x0058 | | | call | 0x0059 | | | ret | 0x005A | | | cmp | 0x005B | | | jmp | 0x005C | | | je | 0x005D | | | jne | 0x005E | | | push | 0x005F | | | pop | 0x0060 | | | out | 0x0061 | | | db | 0x0100 | pseudo-operator | | | 0x10XX | some memory address; `XX` is as specified below | | | 0x20XX | some constant; `XX` is as specified below | | | 0x3XXX | some label definition; `XXX` is its entry index in the label table | | | 0x4XXX | some label reference; `XXX` is its entry index in the label table | | 0xFEXX | used to pass some raw value `XX` in place of a token id to a couple of functions that mention this as a feature. If the function doesn't mention it, it will lead to undefined behaviour | | | 0xFFFF | unrecognised token | values of `XX` in `0x10XX`: | XX | description | |------|-------------| | 0x00 | following word is the token ID of some register | values of `XX` in `0x20XX`: | XX | description | |------|-------------| | 0x00 | following 8 bytes are the constant's value | ### example program #### program in assembly this program doesn't do anything useful, it's just a test ```nasm xor eax, eax inc rax ; inline comment ; one line comment mov rdx, [rax] mov [rax], rdx hlt ``` #### tokenization ```nasm 0x0053 ; xor 0x0010 ; eax 0x0010 ; eax 0x0054 ; inc 0x0000 ; rax 0x0056 ; mov 0x0003 ; rdx 0x1000 ; memory address: register 0x0000 ; rax 0x0056 ; mov 0x1000 ; memory address: register 0x0000 ; rax 0x0003 ; rdx 0x004F ; hlt ``` #### nasm output with the above example program, bits 64 ```nasm 0x31 ; XOR r/m16/32/64 r16/32/64 0xC0 ; ModR/M byte ; mod 11b ; directly address the following: ; reg 000b ; EAX ; r/m 000b ; EAX 0x48 ; 64 Bit Operand Size prefix 0xFF ; with `reg` from ModR/M byte 000b: ; INC r/m16/32/64 0xC0 ; ModR/M byte ; mod 11b ; direct addressing ; reg 000b ; RAX ; r/m 000b ; RAX 0x48 ; 64 Bit Operand Size prefix 0x8B ; MOV r16/32/64 r/m16/32/64 0x10 ; ModR/M byte ; mod 00b ; indirect addressing, no displacement ; reg 010b ; RDX ; r/m 000b ; [RAX] 0x48 ; 64 Bit Operand Size prefix 0x89 ; MOV r/m16/32/64 r16/32/64 0x10 ; ModR/M byte ; mod 00b ; indirect addressing, no displacement ; reg 010b ; RDX ; r/m 000b ; [RAX] 0xF4 ; HLT ```