Overview

C squared

Caution

This spec is deprecated, as C² is being redesigned and rewritten.

You can quite literally ignore it whole right now, because most information (except the what this book is not and how and when to read this book sections) is mostly obsolete.

Overview

C² is an enhanced version of [C lang](https://en.wikipedia.org/wiki/C_(programming_language), made to:

Be easier.
Be more flexible.
Include more features for convenience.
Include more complex structures and types overall.
Give better error reports and help overall.
And, most importantly, make low-level simpler.

What this book IS NOT

This book is the specification of C². It contains all information about the behavior and definitions of C². This book is NOT a manual or documentation for C², it supposes the user already has background knowledge of C².

If C² declares a magic singleton, you will find it here; if it contains a primitive built-in type, you will find it here.

You can’t learn about C² by just reading this book. It does not contain informative paragraphs destinated to teaching, but objective information purely dedicated to showcasing how the language works internally.

How and when to read this book

There are two main reasons why you would read this book:

Answer a specific question: search for the answer by pressing s, and see if it can be solved.
Learn about the language’s internals: use the table of contents in the left.

Lexical analysis

In this section you will find:

What sources the C² compiler allows.
What reserved keywords C² contains.
What other members (such as, but not limited to, identifiers) C² includes.

Source

C² sources must be:

Files, with .c2p extension.
Encoded with UTF-8/ANSI.

Reserved keywords

What is a reserved keyword?

Reserved keywords are special identifiers, built into the compiler, which can’t be used by the user as a variable name, as they instead serve a special function (for example, starting a specific statement).

Reserved keywords in C²

Reserved keywords in C² are:

if
in
or
and
for
nil
new
try
int
bool
case
else
enum
func
self
true
break
catch
defer
false
float
spawn
throw
until
while
import
repeat
return
string
struct
switch
default
private
continue
function
optional

These can’t be used as:

Variable names.
Function names.
Structure names.
Field names.
Argument names.

Other lexical members

Identifiers

Identifiers consist of:

Any letter (or an underscore) present one time.
Any letter, number or an underscore as all next characters repeated zero or more times.

Identifier examples:
myvar
_default
c9

Strings

Strings consist of:

Either a single quote (') or double quote ("), called the delimiter.
Any character except the delimiter (being the first character, a single/double quote) repeated zero or more times.
The delimiter once again.

Characters inside strings can be escaped with a backslash (\), meaning you can insert the delimiter of the string if you use a backslash (\) to escape it.

List of escaped characters:

\t: tab
\n: newline
\r: carriage return
\", \': double/single quote.
\\: backslash.
\0: null character.

String examples:

"this is a string"
'this is a string too!'
"this string has \"escaped\"\tcharacters!"

Digits

There are three types of digits:

Decimal digits.
Hexadecimal digits.
Binary digits.

Decimal digits

They consist of:

Any character from 0 to 9 repeated one or more times.
Optionally, a period (.) followed by any character from 0 to 9 repeated one or more times.

Hexadecimal digits

They consist of:

The prefix 0x, which can’t change.
Any character from 0 to 9 and a to f (case insensitive) repeated one or more times.

Binary digits

They consist of:

The prefix 0b, which can’t change.
Either 0 or 1 repeated one or more times.

Digit examples:
5
3.14
0x56F
0b1011

Operators, punctuation

Operators can’t be matched by a single expression, as there is a static list of possible operators/punctuation.

All operators supported in C²:

+
+=
++
-
-=
->
*
*=
/
//
/=
%
^
=
==
!
!=
<
<=
>
>=
&
&&
|
||
?

All punctuation supported in C²:

.
..
...
(
)
{
}
[
]
:
;
,
#

Syntax

C²’s syntax determines how a C squared program must look.

Concrete syntax

The abstract syntax might correspond to various concrete ones, even if they are different in style. That’s why the concrete syntax exists.

The concrete syntax dictates how a C² program MUST be syntacticaly. It is formed by various concrete rules, which, if not followed, cause a syntax error.

C²’s grammar.

Lexer

The lexer is a program which generates a list of Tokens¹. It is the first step in C²’s compiler, which goes before the parser, and allows for the parser to use a list of words rather than the whole source directly. You COULD consider the lexer a “crutch” for the parser, or even a part of it.

Parser

The parser is a program which generates an AST (Abstract Syntax Tree).

The abstract syntax tree allows the compiler to read the original source, which was only human readable (partially, as we could technically directly compile the source in a text format, although unefficient and too complex), in a friendlier format consisting of a syntactical tree in memory.

Grammar

Also known as concrete syntax.

Usage	Notation
definition	`=`
concatenation	`,`
termination	`;`
alternation	`\|`
optional	`[ … ]`
repetition	`{ … }`
grouping	`( … )`
terminal string	`“ … “`
terminal string	`’ … ’`
comment	`(* … *)`
special sequence	`? … ?`
exception	`-`

Note

Table from this GitHub gist.

Warning

This grammar might be outdated compared to the compiler.

Program = { Ignorable | Declaration | Statement } ;

Ignorable = WS | Comment ;

Comment = "//", { anychar - "\n" } ;

(* ===== Lexical ===== *)

boolean = "true" | "false" ;

ident = ( letter, { letter | digit | "_" } ) - reserved ;

number = digit, { digit }, [ ".", digit, { digit } ] ;

string =
      '"', { ( anychar - '"' ) | interpolated }, '"'
    | "'", { ( anychar - "'" ) | interpolated }, "'" ;

interpolated = "$", "{", { anychar - "}" }, "}" ;

tag = "@", string ;

letter =
      "a" | "b" | "c" | "d" | "e" | "f" | "g"
    | "h" | "i" | "j" | "k" | "l" | "m" | "n"
    | "o" | "p" | "q" | "r" | "s" | "t" | "u"
    | "v" | "w" | "x" | "y" | "z"
    | "A" | "B" | "C" | "D" | "E" | "F" | "G"
    | "H" | "I" | "J" | "K" | "L" | "M" | "N"
    | "O" | "P" | "Q" | "R" | "S" | "T" | "U"
    | "V" | "W" | "X" | "Y" | "Z" ;

digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;

WS = " " | "\t" | "\n" | "\r" ;

anychar = ? any character ? ;

bit_depth = "8" | "16" | "32" | "64" ;

(* ===== Literals ===== *)

Literal =
      number
    | string
    | boolean
    | tag
    | ArrayLiteral
    | MapLiteral ;

ArrayLiteral = "[", [ Expr, { ",", Expr } ], "]" ;

KeyValue = "[", Expr, "]", ":", Expr ;

MapLiteral = "{", [ KeyValue, { ",", KeyValue } ], "}" ;

(* ===== Types ===== *)

Type =
    ["const"], ident - "void"
    | ArrayType
    | MapType
    | FunctionType ;

ArrayType = "[", Type, "]" ;

MapType = "{", Type, ",", Type, "}" ;

FunctionType = "(", [ Type, { ",", Type } ], ")", Type ;

TypeAnnot = ":", Type ;

(* ===== Declarations ===== *)

Declaration =
      Variable
    | func
    | Structure
    | Enumeration ;

Variable =
    Type,
    ident,
    "=",
    Expr ;

func =
    { Decorator },
    "func",
    ident,
    "(",
        [ Parameter, [ { ",", Parameter } ] ],
    ")",
    "->", Type,
    "{", Block, "}" ;

Decorator = "#", ident ;

Parameter =
      "self"
    | "...", ident, TypeAnnot
    | ident, TypeAnnot, [ "=", Expr ] ;

StructParameter = ident, TypeAnnot, [ "=", Expr ] ;

Structure =
    "struct",
    ident,
    [ "(", StructParameter, { ",", StructParameter } , ")" ],
    "{",
        { StructureField | func }, ";",
    "}" ;

StructureField =
    [ "private" ],
    ident,
    [ TypeAnnot ],
    "=",
    Expr ;

Enumeration =
    "enum",
    ident,
    "{",
        ident, { ",", ident },
    "}" ;

(* ===== Statements ===== *)

Statement =
      If
    | Switch
    | Defer
    | Return
    | Throw
    | Try
    | Loop
    | Declaration
    | Expr ;

If =
    "if", "(", Expr, ")", "{", Block, "}",
    { "else", "if", "(", Expr, ")", "{", Block, "}" },
    [ "else", "{", Block, "}" ] ;

Switch =
    "switch", "(", Expr, ")", "{",
        { Case },
        [ "default", "{", Block, "}" ],
    "}" ;

Case =
    "case",
    "(",
        Expr, { ",", Expr },
    ")",
    "{", Block, "}" ;

Defer =
      "defer", Expr
    | "defer", "{", Block, "}" ;

Return = "return", [ Expr ] ;

Throw = "throw", Expr ;

Try =
    "try", "{", Block, "}",
    "catch", "(", ident, ")", "{", Block, "}" ;

Loop =
      For
    | While
    | Until
    | Repeat ;

For =
    "for",
    "(",
        ident, { ",", ident },
        "in",
        Expr,
    ")",
    "{", Block, "}" ;

While =
    "while",
    [ "(", Expr, ")" ],
    "{", Block, "}" ;

Until =
    "until",
    "(", Expr, ")",
    "{", Block, "}" ;

Repeat =
    "repeat",
    Expr,
    "{", Block, "}" ;

Block = { Ignorable | Statement, ( ";" | "}" ) } ;

(* ===== Expressions ===== *)

Expr = Assignment ;

Assignment =
    LogicOr,
    [ ( "=" | "+=" | "-=" | "*=" | "/=" ), Assignment ] ;

LogicOr =
    LogicAnd, { "or", LogicAnd } ;

LogicAnd =
    Equality, { "and", Equality } ;

Equality =
    Relational, { ( "==" | "!=" ), Relational } ;

Relational =
    Additive, { ( "<" | "<=" | ">" | ">=" ), Additive } ;

Additive =
    Multiplicative, { ( "+" | "-" ), Multiplicative } ;

Multiplicative =
    Exponent, { ( "*" | "/" | "%" ), Exponent } ;

Exponent =
    Unary, [ "^", Exponent ] ;


Unary =
      ( "-" | "#" ), Unary
    | Range ;

Primary =
      Literal
    | ident
    | "(" Expr ")"
    | FunctionCall
    | AnonFunction
    | New
    | Spawn ;

Postfix =
    Primary,
    { "." ident
    | "[" Expr "]"
    },
    [ "++" | "--" ] ;

Range =
    Postfix, "..", Postfix ;

(* Reserved keywords *)
func = "func" ;
if = "if" ;
else = "else" ;
switch = "switch" ;
case = "case" ;
default = "default" ;
while = "while" ;
for = "for" ;
return = "return" ;
throw = "throw" ;
struct = "struct" ;
enum = "enum" ;
const = "const" ;
let = "let" ;
in = "in" ;
import = "import" ;
new = "new" ;
repeat = "repeat" ;
until = "until" ;
defer = "defer" ;
try = "try" ;
catch = "catch" ;
spawn = "spawn" ;
private = "private" ;
self = "self" ;
or = "or" ;
and = "and" ;
reserved =  
        if | else | switch | case | default | while | for 
        | return  | throw | struct | enum | const | let 
        | in | import | new | repeat | until | defer | try 
        | catch | spawn | private | self | or | and ;

Lexer

The lexer has the job of turning your source into a list of tokens.

Given this source:

func main() -> int
{
  puts("Hello, World!")
  return 0
}

The lexer will generate this token list:

format: TOKEN_TYPE, LEXEME

TOKEN_KEYWORD_FUNC, "func";
TOKEN_IDENTIFIER, "main";
TOKEN_OPEN_PAREN, "(";
TOKEN_CLOSE_PAREN, ")";
TOKEN_ARROW, "->";
TOKEN_KEYWORD_INT, "int";
TOKEN_OPEN_BRACE, "{";
TOKEN_IDENTIFIER, "puts";
TOKEN_OPEN_PAREN, "(";
TOKEN_STRING, "Hello, World!";
TOKEN_CLOSE_PAREN, ")";
TOKEN_KEYWORD_RETURN, "return";
TOKEN_NUMBER, "0";
TOKEN_CLOSE_BRACE, "}";

This output will then be fed to the parser.

The `csq_lexer` type

The csq_lexer type represents the lexer internally. It serves as storage for the lexer state and for position tracking (line and column, for error reports).

It contains:

buffer: entire source file buffer.
start: start of the current token being scanned.
current: current position in buffer.
line: current line number (1-indexed).
column: current column number (1-indexed).
path: path to source file for error messages.
diag: diagnostic reporter for error handling.

Full definition:

typedef struct csq_lexer {
   const char* buffer;
   const char* start;
   const char* current;
   size_t line;
   size_t column;
   const char* path;
   DiagReporter* diag;
} csq_lexer;

Main lexing process

To lex, C² first creates the lexer using the lexer_create function

C² uses a state table to lex, meaning that to start lexing a call to lexer_next is made.

lexer_next:

Checks for whitespaces, which are then skipped.
Updates lexer->start to lexer->current.
If the end of the buffer was reached, an EOF (End of file) token is returned.
The lexer gets current lex state. The lex state is of type csq_lexstate, type defined as csq_token (*csq_lexstate)(csq_lexer*)) which acts as a function pointer for lexer state handlers.
state(lexer) is returned, as state is a function pointer.
- If there is no state (the state is NULL), then an error token is returned.

state(lexer) will then return a token. The way state is retrieved is through the get_lex_state function, which checks the state table and returns the correct lexer state handler depending on the lexer’s current character.

For example, if the character is a digit (=0 to 9), then lex_number is returned, the handler for TOKEN_NUMBER tokens.

Lexer state handlers

There are 6 lexer state handlers: lex_whitespace, lex_identifier, lex_number, lex_string, lex_tag and lex_operator.

lex_whitespace just advances to the next token.

lex_identifier will first advance character while the current one is an alphanumeric character (a-z, A-Z or 0-9) or an underscore, then it will declare the type variable, which is obtained through the check_keyword function, which either returns a TOKEN_KEYWORD_* type or TOKEN_IDENTIFIER, if the identifier isn’t a keyword. Finally, the function will check correct format of the identifier and, if everything is valid, it will return a new TOKEN_IDENTIFIER token.

lex_number will first check the base of the digit: default is 10, but if any prefix that modifies it (0x/0X, 0b/0B, 0o/0O) is found, then it will change to the correct base (16, 2 and 8, respectively). After retrieving the base, it will lex the digit, check correct format and finally return a new TOKEN_NUMBER token.

lex_string will first set the variable quote to either a double or single quote, depending on how the string starts. Then, it will enter a while loop, where while the current character isn’t the quote and there is a character after it, it will add new characters to the string. To add characters, it will first check for escaped characters, and if it doesn’t find any, it will simply add the current one. It will finally return a new TOKEN_STRING token.

lex_tag is very similar to lex_string, but it will first check for the @ prefix.

lex_operator will use a switch statement to check what operator was detected: it will first check for the first character, and if there are two or three character operators/puncuation that start equally, then it will check if it continues. It will finally return a new TOKEN_* token (every operator/puncuation has its own csq_tokentype).

The lex state table

This table is used to determine what handler is used for the current token.

void initialize_state_table(void) {
  state_table[' '] = lex_whitespace;
  state_table['\t'] = lex_whitespace;
  state_table['\n'] = lex_whitespace;
  state_table['\r'] = lex_whitespace;
  state_table['\v'] = lex_whitespace;
  state_table['\f'] = lex_whitespace;

  for (char c = 'a'; c <= 'z'; c++)
    state_table[(unsigned char)c] = lex_identifier;

  for (char c = 'A'; c <= 'Z'; c++)
    state_table[(unsigned char)c] = lex_identifier;

  state_table['_'] = lex_identifier;

  for (char c = '0'; c <= '9'; c++)
    state_table[(unsigned char)c] = lex_number;

  state_table['"'] = lex_string;
  state_table['\''] = lex_string;

  state_table['@'] = lex_tag;

  state_table['+'] = lex_operator;
  state_table['-'] = lex_operator;
  state_table['*'] = lex_operator;
  state_table['/'] = lex_operator;
  state_table['%'] = lex_operator;
  state_table['^'] = lex_operator;
  state_table['='] = lex_operator;
  state_table['!'] = lex_operator;
  state_table['<'] = lex_operator;
  state_table['>'] = lex_operator;
  state_table['&'] = lex_operator;
  state_table['|'] = lex_operator;
  state_table['.'] = lex_operator;
  state_table['('] = lex_operator;
  state_table[')'] = lex_operator;
  state_table['{'] = lex_operator;
  state_table['}'] = lex_operator;
  state_table['['] = lex_operator;
  state_table[']'] = lex_operator;
  state_table[':'] = lex_operator;
  state_table[';'] = lex_operator;
  state_table[','] = lex_operator;
  state_table['#'] = lex_operator;
  state_table['?'] = lex_operator;

  state_table['\0'] = NULL;
}

Tokens

The `csq_token` type

The csq_token type represents a token generated by the lexer.

It contains:

type: the token’s type.
start: the start of the token in source.
length: the length of the token.
line: line where the token is found (1-indexed).
column: column where the token starts (1-indexed).

Full definition:

typedef struct {
 csq_tktype type;
 const char *start;
 size_t length;
 size_t line;
 size_t column;
} csq_token;

The `csq_tktype` enum

This enum contains every kind of token the lexer can generate.

This is the full definition:

typedef enum {
  TOKEN_EOF,
  TOKEN_ERROR,
  TOKEN_IDENTIFIER,
  TOKEN_NUMBER,
  TOKEN_STRING,
  TOKEN_TAG,
  TOKEN_BOOLEAN,
  TOKEN_OPERATOR,
  TOKEN_PLUS,
  TOKEN_MINUS,
  TOKEN_STAR,
  TOKEN_SLASH,
  TOKEN_PERCENT,
  TOKEN_CARET,
  TOKEN_AMPERSAND,
  TOKEN_PIPE,
  TOKEN_BANG,
  TOKEN_ASSIGN,
  TOKEN_PLUS_ASSIGN,
  TOKEN_MINUS_ASSIGN,
  TOKEN_STAR_ASSIGN,
  TOKEN_SLASH_ASSIGN,
  TOKEN_EQUAL,
  TOKEN_NOT_EQUAL,
  TOKEN_LESS,
  TOKEN_GREATER,
  TOKEN_LESS_EQUAL,
  TOKEN_GREATER_EQUAL,
  TOKEN_LOGICAL_AND,
  TOKEN_LOGICAL_OR,
  TOKEN_INCREMENT,
  TOKEN_DECREMENT,
  TOKEN_DOUBLE_DOT,
  TOKEN_TRIPLE_DOT,
  TOKEN_RANGE,
  TOKEN_ARROW,
  TOKEN_OPEN_PAREN,
  TOKEN_CLOSE_PAREN,
  TOKEN_OPEN_BRACE,
  TOKEN_CLOSE_BRACE,
  TOKEN_OPEN_BRACKET,
  TOKEN_CLOSE_BRACKET,
  TOKEN_COLON,
  TOKEN_SEMICOLON,
  TOKEN_COMMA,
  TOKEN_DOT,
  TOKEN_HASH,
  TOKEN_AT,
  TOKEN_QUESTION_MARK,
  TOKEN_KEYWORD_FUNCTION,
  TOKEN_KEYWORD_FUNC,
  TOKEN_KEYWORD_IF,
  TOKEN_KEYWORD_ELSE,
  TOKEN_KEYWORD_SWITCH,
  TOKEN_KEYWORD_CASE,
  TOKEN_KEYWORD_DEFAULT,
  TOKEN_KEYWORD_WHILE,
  TOKEN_KEYWORD_FOR,
  TOKEN_KEYWORD_IN,
  TOKEN_KEYWORD_RETURN,
  TOKEN_KEYWORD_THROW,
  TOKEN_KEYWORD_STRUCT,
  TOKEN_KEYWORD_ENUM,
  TOKEN_KEYWORD_IMPORT,
  TOKEN_KEYWORD_NEW,
  TOKEN_KEYWORD_REPEAT,
  TOKEN_KEYWORD_UNTIL,
  TOKEN_KEYWORD_DEFER,
  TOKEN_KEYWORD_TRY,
  TOKEN_KEYWORD_CATCH,
  TOKEN_KEYWORD_SPAWN,
  TOKEN_KEYWORD_PRIVATE,
  TOKEN_KEYWORD_SELF,
  TOKEN_KEYWORD_OR,
  TOKEN_KEYWORD_AND,
  TOKEN_KEYWORD_TRUE,
  TOKEN_KEYWORD_FALSE,
  TOKEN_KEYWORD_BOOL,
  TOKEN_KEYWORD_INT,
  TOKEN_KEYWORD_STRING,
  TOKEN_KEYWORD_FLOAT,
  TOKEN_KEYWORD_BREAK,
  TOKEN_KEYWORD_CONTINUE,
  TOKEN_KEYWORD_OPTIONAL,
  TOKEN_KEYWORD_NIL,
} csq_tktype;

C squared specification

Overview

C squared

Overview

What this book IS NOT

How and when to read this book

Lexical analysis

Source

Reserved keywords

What is a reserved keyword?

Reserved keywords in C²

Other lexical members

Identifiers

Strings

Digits

Decimal digits

Hexadecimal digits

Binary digits

Operators, punctuation

Syntax

Concrete syntax

Lexer

Parser

Grammar

Lexer

The `csq_lexer` type

Main lexing process

Lexer state handlers

The lex state table

Tokens

The `csq_token` type

The `csq_tktype` enum

Parser

Nodes

Keyboard shortcuts

C squared specification

C squared