View Source Lexer Error Format

This document describes the error format returned by the Cure lexer, including location information for all errors.

Error Format

All lexer errors are returned in the following format:

{error, {Reason, Line, Column}}

Where:

  • Reason - Error reason (atom or tuple describing the error)
  • Line - Line number where error occurred (1-based integer)
  • Column - Column number where error occurred (1-based integer)

Error Types

Unexpected Character

When an unrecognized character is encountered:

{error, {{unexpected_character, CodePoint}, Line, Column}}
  • CodePoint - Unicode codepoint of the unexpected character (integer)

Example:

cure_lexer:tokenize(<<"test … fail">>).
% => {error, {{unexpected_character, 8230}, 1, 6}}
% 8230 = U+2026 (horizontal ellipsis)

Invalid UTF-8

When malformed UTF-8 is encountered:

{error, {{invalid_utf8, FirstByte}, Line, Column}}
  • FirstByte - The first byte of the invalid sequence

Unterminated String

When a string literal is not closed:

{error, {unterminated_string, Line, Column}}

Example:

cure_lexer:tokenize(<<"x = \"hello">>).
% => {error, {unterminated_string, 1, 13}}

Unterminated Quoted Atom

When a quoted atom is not closed:

{error, {unterminated_quoted_atom, Line, Column}}

Unterminated Interpolation

When string interpolation #{...} is not properly closed:

{error, {unterminated_interpolation, Line, Column}}

Unterminated Charlist

When a charlist literal (using Unicode quotes) is not closed:

{error, {{unterminated_charlist, Line, Column}}}

Location Tracking

Line Numbers

  • 1-based indexing
  • Incremented on every newline character (\n)
  • Reset column to 1 after newline

Column Numbers

  • 1-based indexing
  • Incremented for each character processed
  • Multi-byte UTF-8 characters count as 1 column
  • Reset to 1 at the start of each line

Example with Multi-line Code

Code = <<"line 1
line 2
x … y
line 4">>.

cure_lexer:tokenize(Code).
% => {error, {{unexpected_character, 8230}, 3, 3}}
%    Line 3 (third line)
%    Column 3 (after "x ")

UTF-8 Support

The lexer properly handles UTF-8 characters in error reporting:

1-byte (ASCII)

cure_lexer:tokenize(<<"x $ y">>).
% => {error, {{unexpected_character, 36}, 1, 3}}
% 36 = U+0024 ($)

2-byte UTF-8

cure_lexer:tokenize(<<"x ¢ y">>).
% => {error, {{unexpected_character, 162}, 1, 3}}
% 162 = U+00A2 (¢)

3-byte UTF-8

cure_lexer:tokenize(<<"x … y">>).
% => {error, {{unexpected_character, 8230}, 1, 3}}
% 8230 = U+2026 (…)

4-byte UTF-8 (Emoji)

cure_lexer:tokenize(<<"x 😀 y">>).
% => {error, {{unexpected_character, 128512}, 1, 3}}
% 128512 = U+1F600 (😀)

Integration with LSP

The LSP server uses the location information to create diagnostics with proper ranges:

% Lexer error
{error, {{unexpected_character, 8230}, 3, 6}}

% Converted to LSP diagnostic
#{
    range => #{
        start => #{line => 2, character => 5},  % 0-based in LSP
        end => #{line => 2, character => 6}
    },
    severity => 1,  % Error
    source => <<"cure">>,
    message => <<"Unexpected character: … (U+2026)">>
}

Testing Location Tracking

To verify location tracking:

% Test line tracking
cure_lexer:tokenize(<<"a\nb\nc … d">>).
% => {error, {{unexpected_character, 8230}, 3, 3}}

% Test column tracking  
cure_lexer:tokenize(<<"hello … world">>).
% => {error, {{unexpected_character, 8230}, 1, 7}}

% Test multi-byte at column 1
cure_lexer:tokenize(<<"…test">>).
% => {error, {{unexpected_character, 8230}, 1, 1}}

Summary

All lexer errors include location information
Line and column are 1-based
UTF-8 characters properly decoded to codepoints
Location tracking accurate across newlines
Multi-byte UTF-8 counted as single column
LSP integration uses location for diagnostics

The Cure lexer provides comprehensive location information for all errors, enabling precise error reporting in both command-line and IDE environments.