tokenise labels and constants! Now assembly highkey fails but ok

This commit is contained in:
andromeda
2026-03-30 16:09:25 +02:00
parent b1e7d2e3d5
commit f789d49e8a
3 changed files with 342 additions and 112 deletions

View File

@@ -22,11 +22,14 @@ tokeniser
------------------------
byte(s) -> next byte(s)
------------------------
Newline -> Newline
Newline -> Label
-> Newline
-> Komment
-> Operator
-> Directive
Label -> Newline
Komment -> Newline
Operator -> Newline
@@ -45,37 +48,6 @@ Directive -> Newline
------------------------
```
not yet implemented:
```
------------------------
operand parser
------------------------
byte(s) -> next byte(s)
------------------------
START -> '['
-> Register
-> Constant
'[' -> Register
-> Constant
']' -> END
Register -> IF #[, ']'
-> Operator
Constant -> IF #[, ']'
-> Operator
Operator -> IF NOT #R, Register
-> Constant
------------------------
:R: = whether a register has been found
:[: = whether a '[' has been found
------------------------
```
### memory map
```
@@ -88,6 +60,10 @@ Operator -> IF NOT #R, Register
+------ 0x00060000 ------+
| test arena |
+------ 0x00050000 ------+
| label table |
+------ 0x00040000 ------+
| awaiting label table |
+------ 0x00030000 ------+
| stack (rsp) |
+------------------------+
| input |
@@ -105,6 +81,7 @@ each word represents a token on the token table.
each token gets loaded into the token table with the following form:
```
2 bytes
+----------+
| 15 0 |
+----------+
@@ -112,6 +89,40 @@ each token gets loaded into the token table with the following form:
+----------+
```
#### label table (LT)
label definitions are stored and recalled from this table. The memory addresses are relative to the start of the program
```
16 bytes
+---------+
| 127 64 |
+---------+
| address |
+---------+
| 63 0 |
+---------+
| hash |
+---------+
```
#### awaiting label table (ALT)
forward references are stored in this table to be filled in after assembly is otherwise complete. The memory addresses are relative to the start of the program
```
16 bytes
+----------+----------+------------------+---------+
| 127 105 | 104 104 | 103 96 | 95 64 |
+----------+----------+------------------+---------+
| reserved | abs flag | # bytes reserved | address |
+----------+----------+------------------+---------+
| 63 0 |
+--------------------------------------------------+
| hash |
+--------------------------------------------------+
```
### internal data structures
#### `tokens.[operators|registers]`
@@ -121,6 +132,7 @@ contains tokens by their type. Intended to be searched by token name to get the
each entry is in the following form:
```
6 bytes
+----------+--------------------------------+
| 47 32 | 31 0 |
+----------+--------------------------------+
@@ -129,26 +141,16 @@ each entry is in the following form:
```
example implementation:
```nasm
tokens
.registers:
dd "r8"
dw 0x0008
.by_name3: ; this is required for futureproofness; the caller can use this to
; find the size of registers.by_name2
```
note that tokens longer than 4 bytes are problematic :/
#### `tokens.by_id`
contains some tokens with their metadata. Some tokens have embedded information (`0x10XX` for instance). Those will not have entries in this table, being handled instead inside the assemble function itself.
contains some tokens with their metadata. Some tokens have embedded information (`0x10XX` for instance). Those do not have entries in this table, being handled instead inside the assemble function itself.
metadata about some tokens in the following form:
```
4 bytes
+----------------+----------+-------+----------+
| 31 24 | 23 20 | 19 16 | 15 0 |
+----------------+----------+-------+----------+
@@ -168,6 +170,7 @@ the `type` hex digit is defined as the following:
type metadata for the different types is as follows:
```
1 byte
+----------+
| type 0x0 |
+----------+
@@ -178,6 +181,7 @@ type metadata for the different types is as follows:
```
```
1 byte
+-------------------------------+
| type 0x1 |
+----------+--------------------+
@@ -188,6 +192,7 @@ type metadata for the different types is as follows:
```
```
1 byte
+------------------------------+
| type 0x2 |
+----------+-----------+-------+
@@ -210,6 +215,7 @@ type metadata for the different types is as follows:
entries are as follows:
```
16 bytes
+------------------------------+
| 0 operand operators |
+------------------------------+
@@ -230,6 +236,7 @@ entries are as follows:
| reserved | opcode | token ID |
+----------+--------+----------+
16 bytes
+-------------------------------------------------------------+
| 1 operand operators |
+-------------------------------------------------------------+
@@ -252,6 +259,7 @@ entries are as follows:
| | dst=r/m | |
+----------+---------------+----------------------------------+
16 bytes
+----------------------------------------------+
| 2 operand operators |
+----------------------------------------------+
@@ -389,14 +397,23 @@ supported tokens are listed below
| ret | 0x005A | |
| cmp | 0x005B | |
| | 0x10XX | some memory address; `XX` is as specified below |
| | 0xFEXX | used to pass some raw value `XX` in place of a token id |
| | 0x20XX | some constant; `XX` is as specified below |
| | 0x3XXX | some label definition; `XXX` is its entry index in the label table |
| | 0x4XXX | some label reference; `XXX` is its entry index in the label table
| | 0xFEXX | used to pass some raw value `XX` in place of a token id to a couple of functions that mention this as a feature. If the function doesn't mention it, it will lead to undefined behaviour |
| | 0xFFFF | unrecognised token |
values of `XX` in `0x10XX`:
| XX | description |
|------|-------------|
| 0x00 | following byte is the token ID of some register |
| 0x00 | following word is the token ID of some register |
values of `XX` in `0x20XX`:
| XX | description |
|------|-------------|
| 0x00 | following 8 bytes are the constant's value |
### example program