some major architecture changes

2026-03-12 23:03:29 +01:00
parent bc19d760b9
commit 0ee8ff7914
3 changed files with 459 additions and 655 deletions
--- a/twasm/README.md
+++ b/twasm/README.md
@@ -12,6 +12,70 @@ I want to compile Bootler and Twasm with the Twasm assembler
 - [opcodes,ModR/M,SIB](http://ref.x86asm.net/coder64.html) (no secure site available)
 - [calling conventions](https://wiki.osdev.org/Calling_Conventions); I try to use System V

+### tokeniser
+
+whitespace is ignored for the sake of readability; it can go between pretty much anything
+
+```
+------------------------
+tokeniser
+------------------------
+byte(s) -> next byte(s)
+------------------------
+Newline   -> Newline
+          -> Komment
+          -> Operator
+          -> Directive
+
+Komment   -> Newline
+
+Operator  -> Newline
+          -> Komment
+          -> Operand
+
+Operand   -> Newline
+          -> Komment
+          -> Comma
+
+Comma     -> Operand
+
+Directive -> Newline
+          -> Komment
+          -> Operator
+------------------------
+```
+
+not yet implemented:
+
+```
+------------------------
+operand parser
+------------------------
+byte(s) -> next byte(s)
+------------------------
+START    -> '['
+         -> Register
+         -> Constant
+
+'['      -> Register
+         -> Constant
+
+']'      -> END
+
+Register -> IF #[, ']'
+         -> Operator
+
+Constant -> IF #[, ']'
+         -> Operator
+
+Operator -> IF NOT #R, Register
+         -> Constant
+------------------------
+:R: = whether a register has been found
+:[: = whether a '[' has been found
+------------------------
+```
+
 ### memory map

 ```
@@ -50,15 +114,15 @@ each token gets loaded into the token table with the following form:

 ### internal data structures

-#### `tokens.by_nameX`
+#### `tokens.[operators|registers]`

-contains all tokens of that length followed by their ID. For some non-empty `tokens.by_nameX`, it is true that `tokens.by_name<X+1> - tokens.by_nameX` is the size in bytes of `tokens.by_nameX`.
+contains tokens by their type. Intended to be searched by token name to get the token's ID.

 each entry is in the following form:

 ```
 +----------+--------------------------------+
-|[2 bytes] | 8 * token_length - 1         0 |
+| 47    32 | 31                           0 |
 +----------+--------------------------------+
 | token ID | string without null terminator |
 +----------+--------------------------------+
@@ -68,19 +132,16 @@ each entry is in the following form:
 example implementation:

 ```nasm
-tokens:
-  .by_name1:
-    db "+"
-    dw 0x0062
-    db "-"
-    dw 0x0063
-  .by_name2:
-    db "r8"
+tokens
+  .registers:
+    dd "r8"
    dw 0x0008
  .by_name3: ; this is required for futureproofness; the caller can use this to
-             ; find the size of tokens.by_name2
+             ; find the size of registers.by_name2
 ```

+note that tokens longer than 4 bytes are problematic :/
+
 #### `tokens.by_id`

 contains some tokens with their metadata. Some tokens have embedded information (`0x10XX` for instance). Those will not have entries in this table, being handled instead inside the assemble function itself.