-
Notifications
You must be signed in to change notification settings - Fork 1
Fix C codegen for interpolated strings (f-strings) #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add builtins for self-hosting: args(), exec(), exec_output(), mkdir(), append_file(), getenv(), setenv() - Fix char comparison to compare values instead of pointers - Fix escape sequence processing in char and string literals - Add bootstrap lexer in Carv (examples/bootstrap/lexer.carv) - Add test coverage for pkg/ast, pkg/codegen, pkg/module
- Add parse_int() and parse_float() builtins for string-to-number conversion
- Add index assignment support for arrays and maps (arr[i] = val, map[key] = val)
- Add bootstrap parser written in Carv (examples/bootstrap/parser.carv)
- Uses Pratt parsing (recursive descent with operator precedence)
- Parses all Carv constructs: functions, classes, let/const, if/else,
for/for-in/while, match, pipe operator, etc.
- AST nodes represented as maps for simplicity
- Add test file (examples/bootstrap/test_simple.carv)
This is an experimental branch for self-hosting exploration.
Add support for generating C code from interpolated strings:
- Add carv_int_to_string, carv_float_to_string, carv_bool_to_string helpers
- Add carv_concat for string concatenation
- Add generateInterpolatedString to properly convert f-strings to C
- Add InterpolatedString case to inferExprType
Previously, f-strings like `println(f"x = {x}")` would generate
invalid C code with empty printf arguments. Now they correctly
generate string concatenation code.
📝 WalkthroughWalkthroughThis PR introduces CLI argument passing to Carv programs, adds a complete bootstrap lexer and parser implementation, extends the runtime with system-level builtins (parse_int, parse_float, args, exec, mkdir, etc.), implements escape sequence handling for string/char literals, adds interpolated string support in code generation, and provides comprehensive test coverage for AST, code generation, and module loading. Changes
Sequence DiagramsequenceDiagram
participant CLI as Command Line
participant Main as cmd/carv/main.go
participant Interpreter as pkg/eval
participant Program as Carv Program
CLI->>Main: carv run script.carv arg1 arg2
Main->>Main: Extract os.Args[3:] as programArgs
Main->>Interpreter: runFile(filename, programArgs)
Interpreter->>Interpreter: SetArgs(programArgs)
Interpreter->>Interpreter: eval.cliArgs = programArgs
Interpreter->>Program: Execute program
Program->>Interpreter: Call args() builtin
Interpreter->>Program: Return stored CLI arguments
Program->>Program: Access and process arguments
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Closing due to merge conflicts. Replaced by new PR from clean branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/eval/eval.go (1)
290-299: Prevent non‑numeric operands from slipping into the float path.With CHAR support, expressions like
'a' + 1.0will hit the float branch (left.Type()==FLOAT_OBJ || right.Type()==FLOAT_OBJ) andevalFloatInfixExpressionwill silently treat CHAR as 0, producing incorrect results instead of an error. Consider guarding the float branch or rejecting non‑int/float operands inevalFloatInfixExpression.🔧 Suggested guard in
evalFloatInfixExpressionfunc evalFloatInfixExpression(operator string, left, right Object) Object { var leftVal, rightVal float64 switch l := left.(type) { case *Float: leftVal = l.Value case *Integer: leftVal = float64(l.Value) + default: + return &Error{Message: "unknown operator: " + string(left.Type()) + " " + operator + " " + string(right.Type())} } switch r := right.(type) { case *Float: rightVal = r.Value case *Integer: rightVal = float64(r.Value) + default: + return &Error{Message: "unknown operator: " + string(left.Type()) + " " + operator + " " + string(right.Type())} } // ... }
🤖 Fix all issues with AI agents
In `@examples/bootstrap/lexer.carv`:
- Around line 76-115: The lexer is missing the `any` type token so it gets
tokenized as IDENT; add a new constant (e.g., TOKEN_ANY_TYPE with a unique value
alongside TOKEN_INT_TYPE..TOKEN_VOID_TYPE), add the mapping "any":
TOKEN_ANY_TYPE into the keywords object, and update the token name listing (the
token names array referenced around lines 184-189) to include the human-readable
name for TOKEN_ANY_TYPE so the lexer and parser (parser.carv) stay aligned.
- Around line 281-287: The comment-consumption branch that checks if ch == '/'
and char_at(..., i+1) == '/' advances i but doesn't update the column counter,
so EOF after a line comment leaves col at the comment start; inside that branch
(the while loop that does while i < input_len && char_at(input, i) != '\n')
increment the col variable in lockstep with i (e.g., col = col + 1 each time you
do i = i + 1) so that both i and col reflect consumed comment characters; adjust
only within the comment-consumption code path (the block using ch, i, col,
input_len, char_at) and leave newline handling elsewhere unchanged.
In `@examples/bootstrap/parser.carv`:
- Around line 477-487: Add handling for character literals: in the prefix parse
branch (the code checking mode == "prefix" and calling parse_ident,
parse_int_lit, parse_string_lit, etc.) add a branch for TOKEN_CHAR that calls
parse_char_lit(). Implement parse_char_lit() to consume the TOKEN_CHAR token,
build the appropriate AST node (e.g., CharLit or CharLiteral) containing the
character value, and return it (mirroring how parse_int_lit()/parse_string_lit()
work). Finally update the AST printer/pretty-printer to recognize and render the
CharLit/CharLiteral node (add a case in the expression printing function) so
char literals like 'a' are printed correctly.
In `@pkg/codegen/cgen.go`:
- Around line 269-275: carv_concat currently calls strlen on both inputs and
will crash if a or b is NULL; update carv_concat to guard inputs similar to
carv_trim/carv_substr by treating NULL as an empty string (e.g., use local const
char* a_ptr = a ? a : ""; const char* b_ptr = b ? b : ""; compute len with
strlen(a_ptr)/strlen(b_ptr), allocate len_a+len_b+1, and copy from a_ptr/b_ptr)
so no strlen/memcpy is ever called on NULL and the function returns a valid
allocated string.
In `@pkg/parser/parser.go`:
- Around line 741-770: The static text segments in interpolated strings are
being added using raw token literals and thus bypass escape processing; update
the code in parseInterpolatedString so that when you create
InterpolatedString.Parts for static segments you pass the token's literal
through unescapeString (the same routine used by parseStringLiteral) rather than
using the raw literal, taking care to provide the inner string content expected
by unescapeString so escaped sequences like "\n" become actual newlines.
🧹 Nitpick comments (3)
pkg/eval/builtins.go (1)
620-676: Consider adding cancellation/timeout forexec/exec_outputto avoid hangs.Both builtins call external processes via
exec.Commandand will block indefinitely if the command stalls. If this runtime can execute untrusted or long‑running commands, add context/timeout (or a cancellation mechanism) to bound execution.♻️ Suggested approach (apply similarly in
exec_output)- cmd := exec.Command(cmdStr.Value, cmdArgs...) + ctx, cancel := context.WithTimeout(context.Background(), defaultExecTimeout) + defer cancel() + cmd := exec.CommandContext(ctx, cmdStr.Value, cmdArgs...)pkg/types/checker.go (1)
83-89: Alignexec/exec_outputsignatures with runtime string-only args.The evaluator requires string arguments, but the checker currently advertises
Any. Consider modeling these asString(and variadic if the type system supports it) or clearly documenting dynamic typing here to avoid misleading type hints.
Line 83-89.examples/bootstrap/parser.carv (1)
6-122: Consider sharing token/keyword definitions with the lexer.The token constants and keyword map are duplicated across
lexer.carvandparser.carv, which can drift over time. A shared module (orrequire‑ing the lexer’s tables) would keep them in sync.
| // Type keywords | ||
| const TOKEN_INT_TYPE = 80; | ||
| const TOKEN_FLOAT_TYPE = 81; | ||
| const TOKEN_BOOL_TYPE = 82; | ||
| const TOKEN_STRING_TYPE = 83; | ||
| const TOKEN_CHAR_TYPE = 84; | ||
| const TOKEN_VOID_TYPE = 85; | ||
|
|
||
| // Keywords map | ||
| let keywords = { | ||
| "fn": TOKEN_FN, | ||
| "let": TOKEN_LET, | ||
| "const": TOKEN_CONST, | ||
| "if": TOKEN_IF, | ||
| "else": TOKEN_ELSE, | ||
| "match": TOKEN_MATCH, | ||
| "for": TOKEN_FOR, | ||
| "while": TOKEN_WHILE, | ||
| "break": TOKEN_BREAK, | ||
| "continue": TOKEN_CONTINUE, | ||
| "return": TOKEN_RETURN, | ||
| "class": TOKEN_CLASS, | ||
| "pub": TOKEN_PUB, | ||
| "new": TOKEN_NEW, | ||
| "self": TOKEN_SELF, | ||
| "true": TOKEN_TRUE, | ||
| "false": TOKEN_FALSE, | ||
| "nil": TOKEN_NIL, | ||
| "Ok": TOKEN_OK, | ||
| "Err": TOKEN_ERR, | ||
| "require": TOKEN_REQUIRE, | ||
| "from": TOKEN_FROM, | ||
| "in": TOKEN_IN, | ||
| "mut": TOKEN_MUT, | ||
| "int": TOKEN_INT_TYPE, | ||
| "float": TOKEN_FLOAT_TYPE, | ||
| "bool": TOKEN_BOOL_TYPE, | ||
| "string": TOKEN_STRING_TYPE, | ||
| "char": TOKEN_CHAR_TYPE, | ||
| "void": TOKEN_VOID_TYPE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add any type token to keep lexer and parser aligned.
parser.carv recognizes any as a type keyword, but this lexer will tokenize it as IDENT. Add TOKEN_ANY_TYPE, the keyword map entry, and token name to avoid mismatches.
✅ Suggested update
const TOKEN_VOID_TYPE = 85;
+const TOKEN_ANY_TYPE = 86;
let keywords = {
@@
- "void": TOKEN_VOID_TYPE
+ "void": TOKEN_VOID_TYPE,
+ "any": TOKEN_ANY_TYPE
};
let token_names = {
@@
- 85: "void"
+ 85: "void",
+ 86: "any"
};Also applies to: 184-189
🤖 Prompt for AI Agents
In `@examples/bootstrap/lexer.carv` around lines 76 - 115, The lexer is missing
the `any` type token so it gets tokenized as IDENT; add a new constant (e.g.,
TOKEN_ANY_TYPE with a unique value alongside TOKEN_INT_TYPE..TOKEN_VOID_TYPE),
add the mapping "any": TOKEN_ANY_TYPE into the keywords object, and update the
token name listing (the token names array referenced around lines 184-189) to
include the human-readable name for TOKEN_ANY_TYPE so the lexer and parser
(parser.carv) stay aligned.
| // Comments | ||
| if ch == '/' && i + 1 < input_len && char_at(input, i + 1) == '/' { | ||
| while i < input_len && char_at(input, i) != '\n' { | ||
| i = i + 1; | ||
| } | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment skipping doesn't advance column tracking.
If the file ends with a // comment, the EOF column will point to the comment start. Increment col while consuming comment characters.
✅ Suggested fix
- if ch == '/' && i + 1 < input_len && char_at(input, i + 1) == '/' {
- while i < input_len && char_at(input, i) != '\n' {
- i = i + 1;
- }
+ if ch == '/' && i + 1 < input_len && char_at(input, i + 1) == '/' {
+ while i < input_len && char_at(input, i) != '\n' {
+ i = i + 1;
+ col = col + 1;
+ }
continue;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Comments | |
| if ch == '/' && i + 1 < input_len && char_at(input, i + 1) == '/' { | |
| while i < input_len && char_at(input, i) != '\n' { | |
| i = i + 1; | |
| } | |
| continue; | |
| } | |
| // Comments | |
| if ch == '/' && i + 1 < input_len && char_at(input, i + 1) == '/' { | |
| while i < input_len && char_at(input, i) != '\n' { | |
| i = i + 1; | |
| col = col + 1; | |
| } | |
| continue; | |
| } |
🤖 Prompt for AI Agents
In `@examples/bootstrap/lexer.carv` around lines 281 - 287, The
comment-consumption branch that checks if ch == '/' and char_at(..., i+1) == '/'
advances i but doesn't update the column counter, so EOF after a line comment
leaves col at the comment start; inside that branch (the while loop that does
while i < input_len && char_at(input, i) != '\n') increment the col variable in
lockstep with i (e.g., col = col + 1 each time you do i = i + 1) so that both i
and col reflect consumed comment characters; adjust only within the
comment-consumption code path (the block using ch, i, col, input_len, char_at)
and leave newline handling elsewhere unchanged.
| if mode == "prefix" { | ||
| let t = cur_tok()["type"]; | ||
|
|
||
| if t == TOKEN_IDENT { return parse_ident(); } | ||
| if t == TOKEN_INT { return parse_int_lit(); } | ||
| if t == TOKEN_FLOAT { return parse_float_lit(); } | ||
| if t == TOKEN_STRING { return parse_string_lit(); } | ||
| if t == TOKEN_TRUE || t == TOKEN_FALSE { return parse_bool_lit(); } | ||
| if t == TOKEN_NIL { return parse_nil_lit(); } | ||
| if t == TOKEN_SELF { return parse_self_expr(); } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Char literals aren’t parsed yet.
TOKEN_CHAR exists, but parse("prefix") has no branch for it, so 'a' yields a parse error. Add a char literal parser and AST printing support.
✅ Suggested implementation
fn parse_string_lit() {
let n = node("string", cur_tok());
n["value"] = cur_tok()["literal"];
return n;
}
+fn parse_char_lit() {
+ let n = node("char", cur_tok());
+ n["value"] = cur_tok()["literal"];
+ return n;
+}
@@
- if t == TOKEN_STRING { return parse_string_lit(); }
+ if t == TOKEN_STRING { return parse_string_lit(); }
+ if t == TOKEN_CHAR { return parse_char_lit(); }
@@
if kind == "string" {
let val = n["value"];
println(f"String: \"{val}\"");
return;
}
+ if kind == "char" {
+ let val = n["value"];
+ println(f"Char: '{val}'");
+ return;
+ }🤖 Prompt for AI Agents
In `@examples/bootstrap/parser.carv` around lines 477 - 487, Add handling for
character literals: in the prefix parse branch (the code checking mode ==
"prefix" and calling parse_ident, parse_int_lit, parse_string_lit, etc.) add a
branch for TOKEN_CHAR that calls parse_char_lit(). Implement parse_char_lit() to
consume the TOKEN_CHAR token, build the appropriate AST node (e.g., CharLit or
CharLiteral) containing the character value, and return it (mirroring how
parse_int_lit()/parse_string_lit() work). Finally update the AST
printer/pretty-printer to recognize and render the CharLit/CharLiteral node (add
a case in the expression printing function) so char literals like 'a' are
printed correctly.
| g.writeln("carv_string carv_concat(carv_string a, carv_string b) {") | ||
| g.writeln(" size_t len_a = strlen(a);") | ||
| g.writeln(" size_t len_b = strlen(b);") | ||
| g.writeln(" char* result = (char*)malloc(len_a + len_b + 1);") | ||
| g.writeln(" memcpy(result, a, len_a);") | ||
| g.writeln(" memcpy(result + len_a, b, len_b + 1);") | ||
| g.writeln(" return result;") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, locate and examine the specific lines in cgen.go
cd /tmp/repo 2>/dev/null || cd .
fd cgen.go -x cat -n {} | head -300Repository: dev-dami/carv
Length of output: 13021
🏁 Script executed:
# Get line count to understand file size
wc -l pkg/codegen/cgen.goRepository: dev-dami/carv
Length of output: 81
🏁 Script executed:
# View lines around 269-275 with context
sed -n '250,290p' pkg/codegen/cgen.goRepository: dev-dami/carv
Length of output: 1923
🏁 Script executed:
# Search for carv_string type definition
rg "type carv_string" -A 3Repository: dev-dami/carv
Length of output: 39
🏁 Script executed:
# Search for carv_concat usage/calls
rg "carv_concat" -B 2 -A 2Repository: dev-dami/carv
Length of output: 550
🏁 Script executed:
# Search for how NULL is handled in related functions
rg "strlen|NULL|if.*=" pkg/codegen/cgen.go | head -30Repository: dev-dami/carv
Length of output: 1544
Guard carv_concat against NULL inputs to prevent crashes.
Without NULL checks, strlen(NULL) will crash if either input is NULL. This can occur when carv_read_file fails and returns NULL, especially in string concatenation operations. Other runtime helpers like carv_trim and carv_substr already guard against NULL; carv_concat should follow the same pattern.
Proposed fix
carv_string carv_concat(carv_string a, carv_string b) {
+ if (!a) a = "";
+ if (!b) b = "";
size_t len_a = strlen(a);
size_t len_b = strlen(b);
char* result = (char*)malloc(len_a + len_b + 1);
memcpy(result, a, len_a);
memcpy(result + len_a, b, len_b + 1);
return result;
}🤖 Prompt for AI Agents
In `@pkg/codegen/cgen.go` around lines 269 - 275, carv_concat currently calls
strlen on both inputs and will crash if a or b is NULL; update carv_concat to
guard inputs similar to carv_trim/carv_substr by treating NULL as an empty
string (e.g., use local const char* a_ptr = a ? a : ""; const char* b_ptr = b ?
b : ""; compute len with strlen(a_ptr)/strlen(b_ptr), allocate len_a+len_b+1,
and copy from a_ptr/b_ptr) so no strlen/memcpy is ever called on NULL and the
function returns a valid allocated string.
| func (p *Parser) parseStringLiteral() ast.Expression { | ||
| return &ast.StringLiteral{Token: p.curToken, Value: p.curToken.Literal} | ||
| return &ast.StringLiteral{Token: p.curToken, Value: unescapeString(p.curToken.Literal)} | ||
| } | ||
|
|
||
| func unescapeString(s string) string { | ||
| var result []byte | ||
| for i := 0; i < len(s); i++ { | ||
| if s[i] == '\\' && i+1 < len(s) { | ||
| switch s[i+1] { | ||
| case 'n': | ||
| result = append(result, '\n') | ||
| case 't': | ||
| result = append(result, '\t') | ||
| case 'r': | ||
| result = append(result, '\r') | ||
| case '\\': | ||
| result = append(result, '\\') | ||
| case '"': | ||
| result = append(result, '"') | ||
| case '0': | ||
| result = append(result, '\x00') | ||
| default: | ||
| result = append(result, s[i+1]) | ||
| } | ||
| i++ | ||
| } else { | ||
| result = append(result, s[i]) | ||
| } | ||
| } | ||
| return string(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's find the parseInterpolatedString function in the parser
rg -A 30 "func.*parseInterpolatedString" pkg/parser/parser.goRepository: dev-dami/carv
Length of output: 772
🏁 Script executed:
# Also search for InterpolatedString type definition to understand the structure
rg -B 2 -A 10 "type InterpolatedString" pkg/Repository: dev-dami/carv
Length of output: 691
🏁 Script executed:
# Let's also check how the Parts are constructed in interpolated strings
rg -B 5 -A 15 "InterpolatedString\s*{" pkg/parser/parser.go | head -60Repository: dev-dami/carv
Length of output: 164
🏁 Script executed:
rg -A 50 "func \(p \*Parser\) parseInterpolatedString" pkg/parser/parser.goRepository: dev-dami/carv
Length of output: 1283
🏁 Script executed:
rg -A 60 "func \(p \*Parser\) parseInterpolatedString" pkg/parser/parser.go | tail -20Repository: dev-dami/carv
Length of output: 506
🏁 Script executed:
rg -A 65 "func \(p \*Parser\) parseInterpolatedString" pkg/parser/parser.go | tail -10Repository: dev-dami/carv
Length of output: 209
Static segments in interpolated strings bypass escape processing, causing inconsistent behavior.
parseStringLiteral properly unescapes escape sequences, but parseInterpolatedString constructs static string parts using raw token literal values without unescaping. This means f"line\n{expr}" will render \n literally instead of a newline, diverging from regular string behavior where "line\n" produces an actual newline.
Apply unescapeString to static segments when constructing InterpolatedString.Parts to keep behavior consistent.
🤖 Prompt for AI Agents
In `@pkg/parser/parser.go` around lines 741 - 770, The static text segments in
interpolated strings are being added using raw token literals and thus bypass
escape processing; update the code in parseInterpolatedString so that when you
create InterpolatedString.Parts for static segments you pass the token's literal
through unescapeString (the same routine used by parseStringLiteral) rather than
using the raw literal, taking care to provide the inner string content expected
by unescapeString so escaped sequences like "\n" become actual newlines.
Summary
Fixes string interpolation (
f"...") not compiling to valid C code.Problem
Previously, code like:
Would generate invalid C:
Solution
Added proper code generation for
InterpolatedStringAST nodes:New runtime helpers:
carv_int_to_string()- converts int to stringcarv_float_to_string()- converts float to stringcarv_bool_to_string()- converts bool to stringcarv_concat()- concatenates two stringsNew codegen functions:
generateInterpolatedString()- generates C code for f-stringsconvertToString()- converts any expression to string based on typeType inference: Added
InterpolatedStringcase toinferExprTypereturningcarv_stringResult
Now generates valid C:
Testing
Summary by CodeRabbit
Release Notes
New Features
parse_int,parse_float,args,exec,exec_output,mkdir,append_file,getenv,setenvImprovements
✏️ Tip: You can customize this high-level summary in your review settings.