Kawa: Lexical syntax

Lexical syntax

The lexical syntax determines how a character sequence is split into a sequence of lexemes, omitting non–significant portions such as comments and whitespace. The character sequence is assumed to be text according to the Unicode standard. Some of the lexemes, such as identifiers, representations of number objects, strings etc., of the lexical syntax are syntactic data in the datum syntax, and thus represent objects. Besides the formal account of the syntax, this section also describes what datum values are represented by these syntactic data.

The lexical syntax, in the description of comments, contains a forward reference to datum, which is described as part of the datum syntax. Being comments, however, these datums do not play a significant role in the syntax.

Case is significant except in representations of booleans, number objects, and in hexadecimal numbers specifying Unicode scalar values. For example, #x1A and #X1a are equivalent. The identifier Foo is, however, distinct from the identifier FOO.

Formal account

Interlexeme-space may occur on either side of any lexeme, but not within a lexeme.

Identifiers, ., numbers, characters, and booleans, must be terminated by a delimiter or by the end of the input.

lexeme ::= identifier | boolean | number
         | character | string
         | ( |  ) |  [ |  ] |  #(
         | ’ | ‘ | , | ,@ | .
         | #’ |  #‘ |  #, |  #,@
delimiter ::= ( |  ) |  [ | ] | " | ; | #
         | whitespace

((UNFINISHED))

Line endings

Line endings are significant in Scheme in single–line comments and within string literals. In Scheme source code, any of the line endings in line-ending marks the end of a line. Moreover, the two–character line endings carriage-return linefeed and carriage-return next-line each count as a single line ending.

In a string literal, a line-ending not preceded by a \ stands for a linefeed character, which is the standard line–ending character of Scheme.

Whitespace and comments

intraline-whitespace ::= space | character-tabulation
whitespace ::=  intraline-whitespace
         | linefeed | line-tabulation | form-feed
         | carriage-return | next-line
         | any character whose category is Zs, Zl, or Zp
line-ending ::= linefeed | carriage return
         | carriage-return linefeed | next-line
         | carriage-return next-line | line-separator
comment ::=  ; all subsequent characters up to a line-ending
                or paragraph-separator
         | nested-comment
         | #; interlexeme-space datum
         | shebang-comment
nested-comment ::=  #| comment-text comment-cont* |#
comment-text ::= character sequence not containing #| or |#
comment-cont ::= nested-comment comment-text
atmosphere ::= whitespace | comment
interlexeme-space ::= atmosphere^*

As a special case the characters #!/ are treated as starting a comment, but only at the beginning of file. These characters are used on Unix systems as an Shebang interpreter directive. The Kawa reader skips the entire line. If the last non-whitespace character is \ (backslash) then the following line is also skipped, and so on.

shebang-comment ::= #! absolute-filename text up to non-escaped line-ending

Whitespace characters are spaces, linefeeds, carriage returns, character tabulations, form feeds, line tabulations, and any other character whose category is Zs, Zl, or Zp. Whitespace is used for improved readability and as necessary to separate lexemes from each other. Whitespace may occur between any two lexemes, but not within a lexeme. Whitespace may also occur inside a string, where it is significant.

The lexical syntax includes several comment forms. In all cases, comments are invisible to Scheme, except that they act as delimiters, so, for example, a comment cannot appear in the middle of an identifier or representation of a number object.

A semicolon (;) indicates the start of a line comment. The comment continues to the end of the line on which the semicolon appears.

Another way to indicate a comment is to prefix a datum with #;, possibly with interlexeme-space before the datum. The comment consists of the comment prefix #; and the datum together. This notation is useful for “commenting out” sections of code.

Block comments may be indicated with properly nested #| and |# pairs.

#|
   The FACT procedure computes the factorial of a
   non-negative integer.
|#
(define fact
  (lambda (n)
    ;; base case
    (if (= n 0)
        #;(= n 1)
        1       ; identity of *
        (* n (fact (- n 1))))))

Identifiers

identifier ::= initial subsequent*
         | peculiar-identifier
initial ::= constituent | special-initial
         | inline-hex-escape
letter ::= a | b | c | ... | z
         | A | B | C | ... | Z
constituent ::= letter
         | any character whose Unicode scalar value is greater than
             127, and whose category is Lu, Ll, Lt, Lm, Lo, Mn,
             Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co
special-initial ::= ! | $ | % | & | * | / | < | =
         | > | ? | ^ | _ | ~
subsequent ::= initial | digit
         | any character whose category is Nd, Mc, or Me
         | special-subsequent
digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
oct-digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
hex-digit ::= digit
         | a | A | b | B | c | C | d | D | e | E | f | F
special-subsequent ::= + | - | . | @
escape-sequence ::= inline-hex-escape
         | \character-except-x
         | multi-escape-sequence
inline-hex-escape ::= \xhex-scalar-value;
hex-scalar-value ::= hex-digit+
multi-escape-sequence ::= |symbol-element^*|
symbol-element ::=  any character except | or \
         | inline-hex-escape | mnemonic-escape | \|

character-except-x ::= any character except x
peculiar-identifier ::= + | - | ... | -> subsequent^*

Most identifiers allowed by other programming languages are also acceptable to Scheme. In general, a sequence of letters, digits, and “extended alphabetic characters” is an identifier when it begins with a character that cannot begin a representation of a number object. In addition, +, -, and ... are identifiers, as is a sequence of letters, digits, and extended alphabetic characters that begins with the two–character sequence ->. Here are some examples of identifiers:

lambda         q                soup
list->vector   +                V17a
<=             a34kTMNs         ->-
the-word-recursion-has-many-meanings

Extended alphabetic characters may be used within identifiers as if they were letters. The following are extended alphabetic characters:

! $ % & * + - . / < = > ? @ ^ _ ~

Moreover, all characters whose Unicode scalar values are greater than 127 and whose Unicode category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd, Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co can be used within identifiers. In addition, any character can be used within an identifier when specified using an escape-sequence. For example, the identifier H\x65;llo is the same as the identifier Hello.

Kawa supports two additional non-R6RS ways of making identifiers using special characters, both taken from Common Lisp: Any character (except x) following a backslash is treated as if it were a letter; as is any character between a pair of vertical bars.

Identifiers have two uses within Scheme programs:

Any identifier may be used as a variable or as a syntactic keyword.
When an identifier appears as or with in literal, it is being used to denote a symbol.

In contrast with older versions of Scheme, the syntax distinguishes between upper and lower case in identifiers and in characters specified via their names, but not in numbers, nor in inline hex escapes used in the syntax of identifiers, characters, or strings. The following directives give explicit control over case folding.

Syntax: #!fold-case

Syntax: #!no-fold-case

These directives may appear anywhere comments are permitted and are treated as comments, except that they affect the reading of subsequent data. The #!fold-case directive causes the read procedure to case-fold (as if by string-foldcase) each identifier and character name subsequently read from the same port. The #!no-fold-case directive causes the read procedure to return to the default, non-folding behavior.

Note that colon : is treated specially for colon notation in Kawa Scheme, though it is a special-initial in standard Scheme (R6RS).

Numbers

((INCOMPLETE))

number ::= ((TODO))
  | quantity
decimal ::= digit+ optional-exponent
  | . digit+ optional-exponent
  | digit+ . digit+ optional-exponent

optional-exponent ::= empty
| exponent-marker optional-sign digit+
exponent-marker ::= e | s | f | d | l

The letter used for the exponent in a floating-point literal determines its type:

e: Returns a gnu.math.DFloat - for example 12e2. Note this matches the default when there is no exponent-marker.
s or f: Returns a primitive float (or java.lang.Float when boxed as an object) - for example 12s2 or 12f2.
d: Returns a primitive double (or java.lang.Double when boxed) - for example 12d2.
l: Returns a java.math.BigDecimal - for example 12l2.

optional-sign ::= empty | + | -
digit-2 ::= 0 | 1
digit-8 ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
digit-10 ::= digit
digit-16 ::= digit-10 | a | b | c | d | e | f