Overview
C squared
Caution
This spec is deprecated, as C² is being redesigned and rewritten.
You can quite literally ignore it whole right now, because most information (except the what this book is not and how and when to read this book sections) is mostly obsolete.
Overview
C² is an enhanced version of [C lang](https://en.wikipedia.org/wiki/C_(programming_language), made to:
- Be easier.
- Be more flexible.
- Include more features for convenience.
- Include more complex structures and types overall.
- Give better error reports and help overall.
- And, most importantly, make low-level simpler.
What this book IS NOT
This book is the specification of C². It contains all information about the behavior and definitions of C². This book is NOT a manual or documentation for C², it supposes the user already has background knowledge of C².
If C² declares a magic singleton, you will find it here; if it contains a primitive built-in type, you will find it here.
You can’t learn about C² by just reading this book. It does not contain informative paragraphs destinated to teaching, but objective information purely dedicated to showcasing how the language works internally.
How and when to read this book
There are two main reasons why you would read this book:
- Answer a specific question: search for the answer by pressing
s, and see if it can be solved. - Learn about the language’s internals: use the table of contents in the left.
Lexical analysis
In this section you will find:
- What sources the C² compiler allows.
- What reserved keywords C² contains.
- What other members (such as, but not limited to, identifiers) C² includes.
Source
C² sources must be:
- Files, with
.c2pextension. - Encoded with UTF-8/ANSI.
Reserved keywords
What is a reserved keyword?
Reserved keywords are special identifiers, built into the compiler, which can’t be used by the user as a variable name, as they instead serve a special function (for example, starting a specific statement).
Reserved keywords in C²
Reserved keywords in C² are:
ifinorandfornilnewtryintboolcaseelseenumfuncselftruebreakcatchdeferfalsefloatspawnthrowuntilwhileimportrepeatreturnstringstructswitchdefaultprivatecontinuefunctionoptional
These can’t be used as:
- Variable names.
- Function names.
- Structure names.
- Field names.
- Argument names.
Other lexical members
Identifiers
Identifiers consist of:
- Any letter (or an underscore) present one time.
- Any letter, number or an underscore as all next characters repeated zero or more times.
Identifier examples:
myvar _default c9
Strings
Strings consist of:
- Either a single quote (
') or double quote ("), called the delimiter. - Any character except the delimiter (being the first character, a single/double quote) repeated zero or more times.
- The delimiter once again.
Characters inside strings can be escaped with a backslash (\), meaning you can insert the delimiter of the string if you use a backslash (\) to escape it.
List of escaped characters:
\t: tab\n: newline\r: carriage return\",\': double/single quote.\\: backslash.\0: null character.
String examples:
"this is a string" 'this is a string too!' "this string has \"escaped\"\tcharacters!"
Digits
There are three types of digits:
- Decimal digits.
- Hexadecimal digits.
- Binary digits.
Decimal digits
They consist of:
- Any character from
0to9repeated one or more times. - Optionally, a period (
.) followed by any character from0to9repeated one or more times.
Hexadecimal digits
They consist of:
- The prefix
0x, which can’t change. - Any character from
0to9andatof(case insensitive) repeated one or more times.
Binary digits
They consist of:
- The prefix
0b, which can’t change. - Either
0or1repeated one or more times.
Digit examples:
5 3.14 0x56F 0b1011
Operators, punctuation
Operators can’t be matched by a single expression, as there is a static list of possible operators/punctuation.
All operators supported in C²:
++=++--=->**=////=%^===!!=<<=>>=&&&|||?
All punctuation supported in C²:
......(){}[]:;,#
Syntax
C²’s syntax determines how a C squared program must look.
Concrete syntax
The abstract syntax might correspond to various concrete ones, even if they are different in style. That’s why the concrete syntax exists.
The concrete syntax dictates how a C² program MUST be syntacticaly. It is formed by various concrete rules, which, if not followed, cause a syntax error.
Lexer
The lexer is a program which generates a list of Tokens1. It is the first step in C²’s compiler, which goes before the parser, and allows for the parser to use a list of words rather than the whole source directly. You COULD consider the lexer a “crutch” for the parser, or even a part of it.
Further reading of the lexer.
To see what input the lexer accepts, read Lexical information.
Note
The lexer’s source can be found here.
Parser
The parser is a program which generates an AST (Abstract Syntax Tree).
The abstract syntax tree allows the compiler to read the original source, which was only human readable (partially, as we could technically directly compile the source in a text format, although unefficient and too complex), in a friendlier format consisting of a syntactical tree in memory.
Further reading of the parser.
Note
The parser’s source can be found here.
-
A word, digit or character generated by the lexer and used by the parser to generate the AST. Further reading ↩
Grammar
Also known as concrete syntax.
| Usage | Notation |
|---|---|
| definition | = |
| concatenation | , |
| termination | ; |
| alternation | | |
| optional | [ … ] |
| repetition | { … } |
| grouping | ( … ) |
| terminal string | “ … “ |
| terminal string | ’ … ’ |
| comment | (* … *) |
| special sequence | ? … ? |
| exception | - |
Note
Table from this GitHub gist.
Warning
This grammar might be outdated compared to the compiler.
Program = { Ignorable | Declaration | Statement } ;
Ignorable = WS | Comment ;
Comment = "//", { anychar - "\n" } ;
(* ===== Lexical ===== *)
boolean = "true" | "false" ;
ident = ( letter, { letter | digit | "_" } ) - reserved ;
number = digit, { digit }, [ ".", digit, { digit } ] ;
string =
'"', { ( anychar - '"' ) | interpolated }, '"'
| "'", { ( anychar - "'" ) | interpolated }, "'" ;
interpolated = "$", "{", { anychar - "}" }, "}" ;
tag = "@", string ;
letter =
"a" | "b" | "c" | "d" | "e" | "f" | "g"
| "h" | "i" | "j" | "k" | "l" | "m" | "n"
| "o" | "p" | "q" | "r" | "s" | "t" | "u"
| "v" | "w" | "x" | "y" | "z"
| "A" | "B" | "C" | "D" | "E" | "F" | "G"
| "H" | "I" | "J" | "K" | "L" | "M" | "N"
| "O" | "P" | "Q" | "R" | "S" | "T" | "U"
| "V" | "W" | "X" | "Y" | "Z" ;
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
WS = " " | "\t" | "\n" | "\r" ;
anychar = ? any character ? ;
bit_depth = "8" | "16" | "32" | "64" ;
(* ===== Literals ===== *)
Literal =
number
| string
| boolean
| tag
| ArrayLiteral
| MapLiteral ;
ArrayLiteral = "[", [ Expr, { ",", Expr } ], "]" ;
KeyValue = "[", Expr, "]", ":", Expr ;
MapLiteral = "{", [ KeyValue, { ",", KeyValue } ], "}" ;
(* ===== Types ===== *)
Type =
["const"], ident - "void"
| ArrayType
| MapType
| FunctionType ;
ArrayType = "[", Type, "]" ;
MapType = "{", Type, ",", Type, "}" ;
FunctionType = "(", [ Type, { ",", Type } ], ")", Type ;
TypeAnnot = ":", Type ;
(* ===== Declarations ===== *)
Declaration =
Variable
| func
| Structure
| Enumeration ;
Variable =
Type,
ident,
"=",
Expr ;
func =
{ Decorator },
"func",
ident,
"(",
[ Parameter, [ { ",", Parameter } ] ],
")",
"->", Type,
"{", Block, "}" ;
Decorator = "#", ident ;
Parameter =
"self"
| "...", ident, TypeAnnot
| ident, TypeAnnot, [ "=", Expr ] ;
StructParameter = ident, TypeAnnot, [ "=", Expr ] ;
Structure =
"struct",
ident,
[ "(", StructParameter, { ",", StructParameter } , ")" ],
"{",
{ StructureField | func }, ";",
"}" ;
StructureField =
[ "private" ],
ident,
[ TypeAnnot ],
"=",
Expr ;
Enumeration =
"enum",
ident,
"{",
ident, { ",", ident },
"}" ;
(* ===== Statements ===== *)
Statement =
If
| Switch
| Defer
| Return
| Throw
| Try
| Loop
| Declaration
| Expr ;
If =
"if", "(", Expr, ")", "{", Block, "}",
{ "else", "if", "(", Expr, ")", "{", Block, "}" },
[ "else", "{", Block, "}" ] ;
Switch =
"switch", "(", Expr, ")", "{",
{ Case },
[ "default", "{", Block, "}" ],
"}" ;
Case =
"case",
"(",
Expr, { ",", Expr },
")",
"{", Block, "}" ;
Defer =
"defer", Expr
| "defer", "{", Block, "}" ;
Return = "return", [ Expr ] ;
Throw = "throw", Expr ;
Try =
"try", "{", Block, "}",
"catch", "(", ident, ")", "{", Block, "}" ;
Loop =
For
| While
| Until
| Repeat ;
For =
"for",
"(",
ident, { ",", ident },
"in",
Expr,
")",
"{", Block, "}" ;
While =
"while",
[ "(", Expr, ")" ],
"{", Block, "}" ;
Until =
"until",
"(", Expr, ")",
"{", Block, "}" ;
Repeat =
"repeat",
Expr,
"{", Block, "}" ;
Block = { Ignorable | Statement, ( ";" | "}" ) } ;
(* ===== Expressions ===== *)
Expr = Assignment ;
Assignment =
LogicOr,
[ ( "=" | "+=" | "-=" | "*=" | "/=" ), Assignment ] ;
LogicOr =
LogicAnd, { "or", LogicAnd } ;
LogicAnd =
Equality, { "and", Equality } ;
Equality =
Relational, { ( "==" | "!=" ), Relational } ;
Relational =
Additive, { ( "<" | "<=" | ">" | ">=" ), Additive } ;
Additive =
Multiplicative, { ( "+" | "-" ), Multiplicative } ;
Multiplicative =
Exponent, { ( "*" | "/" | "%" ), Exponent } ;
Exponent =
Unary, [ "^", Exponent ] ;
Unary =
( "-" | "#" ), Unary
| Range ;
Primary =
Literal
| ident
| "(" Expr ")"
| FunctionCall
| AnonFunction
| New
| Spawn ;
Postfix =
Primary,
{ "." ident
| "[" Expr "]"
},
[ "++" | "--" ] ;
Range =
Postfix, "..", Postfix ;
(* Reserved keywords *)
func = "func" ;
if = "if" ;
else = "else" ;
switch = "switch" ;
case = "case" ;
default = "default" ;
while = "while" ;
for = "for" ;
return = "return" ;
throw = "throw" ;
struct = "struct" ;
enum = "enum" ;
const = "const" ;
let = "let" ;
in = "in" ;
import = "import" ;
new = "new" ;
repeat = "repeat" ;
until = "until" ;
defer = "defer" ;
try = "try" ;
catch = "catch" ;
spawn = "spawn" ;
private = "private" ;
self = "self" ;
or = "or" ;
and = "and" ;
reserved =
if | else | switch | case | default | while | for
| return | throw | struct | enum | const | let
| in | import | new | repeat | until | defer | try
| catch | spawn | private | self | or | and ;
Lexer
The lexer has the job of turning your source into a list of tokens.
Given this source:
func main() -> int
{
puts("Hello, World!")
return 0
}
The lexer will generate this token list:
format:
TOKEN_TYPE,LEXEME
TOKEN_KEYWORD_FUNC, "func";
TOKEN_IDENTIFIER, "main";
TOKEN_OPEN_PAREN, "(";
TOKEN_CLOSE_PAREN, ")";
TOKEN_ARROW, "->";
TOKEN_KEYWORD_INT, "int";
TOKEN_OPEN_BRACE, "{";
TOKEN_IDENTIFIER, "puts";
TOKEN_OPEN_PAREN, "(";
TOKEN_STRING, "Hello, World!";
TOKEN_CLOSE_PAREN, ")";
TOKEN_KEYWORD_RETURN, "return";
TOKEN_NUMBER, "0";
TOKEN_CLOSE_BRACE, "}";
This output will then be fed to the parser.
The csq_lexer type
The csq_lexer type represents the lexer internally. It serves as storage for the lexer state and for position tracking (line and column, for error reports).
It contains:
buffer: entire source file buffer.start: start of the current token being scanned.current: current position in buffer.line: current line number (1-indexed).column: current column number (1-indexed).path: path to source file for error messages.diag: diagnostic reporter for error handling.
Full definition:
typedef struct csq_lexer { const char* buffer; const char* start; const char* current; size_t line; size_t column; const char* path; DiagReporter* diag; } csq_lexer;
Main lexing process
To lex, C² first creates the lexer using the lexer_create function
C² uses a state table to lex, meaning that to start lexing a call to lexer_next is made.
lexer_next:
- Checks for whitespaces, which are then skipped.
- Updates
lexer->starttolexer->current. - If the end of the buffer was reached, an EOF (End of file) token is returned.
- The lexer gets current lex state. The lex state is of type
csq_lexstate, type defined ascsq_token (*csq_lexstate)(csq_lexer*))which acts as a function pointer for lexer state handlers. state(lexer)is returned, asstateis a function pointer.- If there is no state (the state is
NULL), then an error token is returned.
- If there is no state (the state is
state(lexer) will then return a token. The way state is retrieved is through the get_lex_state function, which checks the state table and returns the correct lexer state handler depending on the lexer’s current character.
For example, if the character is a digit (=0 to 9), then lex_number is returned, the handler for TOKEN_NUMBER tokens.
Lexer state handlers
There are 6 lexer state handlers: lex_whitespace, lex_identifier, lex_number, lex_string, lex_tag and lex_operator.
lex_whitespace just advances to the next token.
lex_identifier will first advance character while the current one is an alphanumeric character (a-z, A-Z or 0-9) or an underscore, then it will declare the type variable, which is obtained through the check_keyword function, which either returns a TOKEN_KEYWORD_* type
or TOKEN_IDENTIFIER, if the identifier isn’t a keyword. Finally, the function will check correct format of the identifier and, if everything is valid, it will return a new TOKEN_IDENTIFIER token.
lex_number will first check the base of the digit: default is 10, but if any prefix that modifies it (0x/0X, 0b/0B, 0o/0O) is found, then it will change to the correct base (16, 2 and 8, respectively).
After retrieving the base, it will lex the digit, check correct format and finally return a new TOKEN_NUMBER token.
lex_string will first set the variable quote to either a double or single quote, depending on how the string starts. Then, it will enter a while loop, where while the current character isn’t the quote and there is a character after it, it will add new characters to the string.
To add characters, it will first check for escaped characters, and if it doesn’t find any, it will simply add the current one. It will finally return a new TOKEN_STRING token.
lex_tag is very similar to lex_string, but it will first check for the @ prefix.
lex_operator will use a switch statement to check what operator was detected: it will first check for the first character, and if there are two or three character operators/puncuation that start equally, then it will check if it continues. It will finally return a new TOKEN_* token (every operator/puncuation has its own csq_tokentype).
The lex state table
This table is used to determine what handler is used for the current token.
void initialize_state_table(void) {
state_table[' '] = lex_whitespace;
state_table['\t'] = lex_whitespace;
state_table['\n'] = lex_whitespace;
state_table['\r'] = lex_whitespace;
state_table['\v'] = lex_whitespace;
state_table['\f'] = lex_whitespace;
for (char c = 'a'; c <= 'z'; c++)
state_table[(unsigned char)c] = lex_identifier;
for (char c = 'A'; c <= 'Z'; c++)
state_table[(unsigned char)c] = lex_identifier;
state_table['_'] = lex_identifier;
for (char c = '0'; c <= '9'; c++)
state_table[(unsigned char)c] = lex_number;
state_table['"'] = lex_string;
state_table['\''] = lex_string;
state_table['@'] = lex_tag;
state_table['+'] = lex_operator;
state_table['-'] = lex_operator;
state_table['*'] = lex_operator;
state_table['/'] = lex_operator;
state_table['%'] = lex_operator;
state_table['^'] = lex_operator;
state_table['='] = lex_operator;
state_table['!'] = lex_operator;
state_table['<'] = lex_operator;
state_table['>'] = lex_operator;
state_table['&'] = lex_operator;
state_table['|'] = lex_operator;
state_table['.'] = lex_operator;
state_table['('] = lex_operator;
state_table[')'] = lex_operator;
state_table['{'] = lex_operator;
state_table['}'] = lex_operator;
state_table['['] = lex_operator;
state_table[']'] = lex_operator;
state_table[':'] = lex_operator;
state_table[';'] = lex_operator;
state_table[','] = lex_operator;
state_table['#'] = lex_operator;
state_table['?'] = lex_operator;
state_table['\0'] = NULL;
}
Tokens
The csq_token type
The csq_token type represents a token generated by the lexer.
It contains:
type: the token’s type.start: the start of the token in source.length: the length of the token.line: line where the token is found (1-indexed).column: column where the token starts (1-indexed).
Full definition:
typedef struct { csq_tktype type; const char *start; size_t length; size_t line; size_t column; } csq_token;
The csq_tktype enum
This enum contains every kind of token the lexer can generate.
This is the full definition:
typedef enum {
TOKEN_EOF,
TOKEN_ERROR,
TOKEN_IDENTIFIER,
TOKEN_NUMBER,
TOKEN_STRING,
TOKEN_TAG,
TOKEN_BOOLEAN,
TOKEN_OPERATOR,
TOKEN_PLUS,
TOKEN_MINUS,
TOKEN_STAR,
TOKEN_SLASH,
TOKEN_PERCENT,
TOKEN_CARET,
TOKEN_AMPERSAND,
TOKEN_PIPE,
TOKEN_BANG,
TOKEN_ASSIGN,
TOKEN_PLUS_ASSIGN,
TOKEN_MINUS_ASSIGN,
TOKEN_STAR_ASSIGN,
TOKEN_SLASH_ASSIGN,
TOKEN_EQUAL,
TOKEN_NOT_EQUAL,
TOKEN_LESS,
TOKEN_GREATER,
TOKEN_LESS_EQUAL,
TOKEN_GREATER_EQUAL,
TOKEN_LOGICAL_AND,
TOKEN_LOGICAL_OR,
TOKEN_INCREMENT,
TOKEN_DECREMENT,
TOKEN_DOUBLE_DOT,
TOKEN_TRIPLE_DOT,
TOKEN_RANGE,
TOKEN_ARROW,
TOKEN_OPEN_PAREN,
TOKEN_CLOSE_PAREN,
TOKEN_OPEN_BRACE,
TOKEN_CLOSE_BRACE,
TOKEN_OPEN_BRACKET,
TOKEN_CLOSE_BRACKET,
TOKEN_COLON,
TOKEN_SEMICOLON,
TOKEN_COMMA,
TOKEN_DOT,
TOKEN_HASH,
TOKEN_AT,
TOKEN_QUESTION_MARK,
TOKEN_KEYWORD_FUNCTION,
TOKEN_KEYWORD_FUNC,
TOKEN_KEYWORD_IF,
TOKEN_KEYWORD_ELSE,
TOKEN_KEYWORD_SWITCH,
TOKEN_KEYWORD_CASE,
TOKEN_KEYWORD_DEFAULT,
TOKEN_KEYWORD_WHILE,
TOKEN_KEYWORD_FOR,
TOKEN_KEYWORD_IN,
TOKEN_KEYWORD_RETURN,
TOKEN_KEYWORD_THROW,
TOKEN_KEYWORD_STRUCT,
TOKEN_KEYWORD_ENUM,
TOKEN_KEYWORD_IMPORT,
TOKEN_KEYWORD_NEW,
TOKEN_KEYWORD_REPEAT,
TOKEN_KEYWORD_UNTIL,
TOKEN_KEYWORD_DEFER,
TOKEN_KEYWORD_TRY,
TOKEN_KEYWORD_CATCH,
TOKEN_KEYWORD_SPAWN,
TOKEN_KEYWORD_PRIVATE,
TOKEN_KEYWORD_SELF,
TOKEN_KEYWORD_OR,
TOKEN_KEYWORD_AND,
TOKEN_KEYWORD_TRUE,
TOKEN_KEYWORD_FALSE,
TOKEN_KEYWORD_BOOL,
TOKEN_KEYWORD_INT,
TOKEN_KEYWORD_STRING,
TOKEN_KEYWORD_FLOAT,
TOKEN_KEYWORD_BREAK,
TOKEN_KEYWORD_CONTINUE,
TOKEN_KEYWORD_OPTIONAL,
TOKEN_KEYWORD_NIL,
} csq_tktype;