Masterbelt

masterbelt/masterbelt

Lexical Structure

Synced from main@9490864MarkdownSource

#Lexical Structure

This document describes the currently implemented Masterbelt lexical structure.

The lexical structure is intentionally minimal at this stage. Future token additions must extend this document before or together with implementation changes.

Masterbelt source text is UTF-8. Invalid UTF-8 source text is a syntax error.

Lexical EBNF specifications are normative.

#Identifiers

Identifiers name declarations and type references.

EBNF
identifier = identifier_start { identifier_continue } ;
identifier_start = "A".."Z" | "a".."z" | "_" ;
identifier_continue = identifier_start | "0".."9" ;

#Keywords

The implemented keywords are const, pub, type, use, from, as, readonly, writable, master, record, source, filter, include, exclude, validation, each, all, validate, assert, primary, static, and select.

The keywords const, pub, type, use, from, as, readonly, writable, master, record, source, filter, include, exclude, validation, each, all, validate, assert, primary, static, and select, and the literal words null, true, and false, are reserved and cannot be used as identifiers.

The master validation section keywords validation, each, all, and validate and the assert statement keyword are fully reserved words, consistent with the other section keywords (record, source, filter, static, select). The implicit validation bindings row and table are not reserved: they are ordinary identifiers the validation evaluator binds inside a rule body, so a program may still use those names elsewhere.

Additional keywords reserved by specific declaration or statement forms (enum, fn, asyncable, failable, cancellable, return, self, if, else, let, match, for, in, break, continue, fail) are listed in syntax.md.

The master scope section keywords scope and indexed are context keywords: they are matched only at the scope-declaration position inside a master body and remain usable as ordinary identifiers everywhere else. The identifier self is fully reserved (it appears in the list above): it is the implicit relation receiver inside a scope body and the implicit record / collection binding inside a validation rule, and a program may not declare it as a binding in any context.

#Operator Tokens

Masterbelt recognises the following operator tokens in expression positions. Their grammar productions are defined in syntax.md; their evaluation surface is defined in builtins.md.

EBNF
unary_operator  = "!" | "+" | "-" ;
binary_operator =
    "+" | "-" | "*" | "/" | "%"
  | "==" | "!=" | "<" | "<=" | ">" | ">="
  | "&" | "|" | "^" | "<<" | ">>" ;

Operator tokens never appear inside identifiers. The lexer matches the longest applicable operator token at each position, so << and <= are single tokens and not the concatenation of < with < or =.

The reserved literal word null may appear as a built-in type name in type expression positions.

The identifiers list and map are not reserved. They name built-in generic type constructors only when written with type arguments in a type expression position.

#Whitespace

Whitespace is ignored between tokens.

Whitespace is limited to the ASCII characters space, tab, line feed, carriage return, and form feed.

Line feed and carriage-return line feed are both accepted as line endings.

#Null Literal

The null literal is written as null.

EBNF
null_literal = "null" ;

#Bool Literals

Boolean literals are written as true and false.

EBNF
bool_literal = "true" | "false" ;

#Integer Literals

Integer literals support decimal, binary, octal, and hexadecimal notation.

EBNF
integer_literal =
    decimal_integer_literal
  | binary_integer_literal
  | octal_integer_literal
  | hexadecimal_integer_literal ;

decimal_integer_literal     = decimal_digit { digit_separator decimal_digit } ;
binary_integer_literal      = "0" ( "b" | "B" ) binary_digit { digit_separator binary_digit } ;
octal_integer_literal       = "0" ( "o" | "O" ) octal_digit { digit_separator octal_digit } ;
hexadecimal_integer_literal = "0" ( "x" | "X" ) hexadecimal_digit { digit_separator hexadecimal_digit } ;

digit_separator  = { "_" } ;
decimal_digit     = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
binary_digit      = "0" | "1" ;
octal_digit       = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" ;
hexadecimal_digit = decimal_digit | "a" | "b" | "c" | "d" | "e" | "f" | "A" | "B" | "C" | "D" | "E" | "F" ;

Radix prefixes are case-insensitive for b, o, and x.

Decimal literals may be zero-padded. For example, 000000 is a decimal integer literal.

Digits may be separated by one or more underscores. Consecutive underscores are allowed between digits. For example, 100_000__00 is a decimal integer literal.

Examples:

Masterbelt
0
000000
100_000__00
0b00000
0B0101__10
0o00000
0O0123__70
0x00000
0Xdead__BEEF

#String Literals

A string literal is written between double quotes.

An empty string literal is valid and evaluates to the empty string.

String literals may not contain unescaped line endings.

String literal contents are interpreted as UTF-8. Non-ASCII characters may appear unescaped between the surrounding double quotes.

The supported escape sequences are:

  • \": double quote
  • \\: backslash
  • \n: line feed
  • \r: carriage return
  • \t: tab
  • \0: null character
EBNF
string_literal  = '"' { string_char } '"' ;
string_char     = unescaped_char | escape_sequence ;
unescaped_char  = ? any UTF-8 character except '"', '\', LF, and CR ? ;
escape_sequence = '\"' | '\\' | '\n' | '\r' | '\t' | '\0' ;

An unterminated string literal is a syntax error.

Examples:

Masterbelt
""
"hello"
"quote: \""
"line\nbreak"
"こんにちは"

#Comments

Masterbelt has line comments, block comments, and documentation comments. These are distinct lexical categories.

#Line Comments

A line comment starts with // and continues until the end of the line.

Masterbelt
// line comment
null

Line comments are named syntax nodes. They are ignored as separators but remain available to syntax queries and tooling.

A token starting with /// is a documentation comment, not a line comment.

#Block Comments

A block comment starts with /* and ends with */.

Masterbelt
/* block comment */
true

Block comments may span multiple lines.

Masterbelt
/* block
   comment */
0xFF

Block comments are named syntax nodes. They are ignored as separators but remain available to syntax queries and tooling.

Block comments do not nest.

An unterminated block comment is a syntax error.

#Documentation Comments

A documentation comment starts with /// and continues until the end of the line.

Documentation comments attach to the next declaration or statement. Multiple consecutive documentation comments may attach to the same declaration or statement.

Masterbelt
/// docs for null
null

/// docs for true
/// second doc line
true

Documentation comments are part of the syntax tree and are not treated as ordinary ignored comments.

Documentation comments may only appear immediately before the declaration, statement, or grouped const item they document.

A documentation comment with no following declaration, statement, or grouped const item is a syntax error.

A documentation comment after another token on the same line is a syntax error.

Specification