Skip to content

Relexing logical tokens and unclosed interpolated strings #21896

@MichaReiser

Description

@MichaReiser

Our logical-line error recovery doesn't work as expected for unclosed interpolated strings. I'm also not sure if it works as expected in other cases but more below.

We call re_lex_logical_token during list parsing recovery if the next token is a valid list element or list terminator of any active recovery context, and we use list parsing (or the comma-separated variant thereof) for interpolated strings, lists, dictionaries, statements, etc.

Here are a few thoughts after trying to "quickly" make it work for interpolated strings:

  • re_lex_logical_token always decrements the nesting but this can lead to inconsistencies where the interpolated string on the top of the stack now has a deeper nesting than the Lexer's nesting. I think re_lex_logical_token should pop the interpolated string in that case.
  • Similarly, an interpolated string can have a format-specifier that we track based on the nesting. Should it probably end those?
  • Always decrementing self.nesting by one feels incorrect. We should only do so if we've seen a (, [ or { because it's only then that we increment the nesting. E.g. the tuple parsing shouldn't decrement if the tuple has no parentheses of its own.
  • The other part that feels off about always decrementing nesting regardless of what the current parsing function (error context) is that we only want to decrement nesting if the parentheses belong to the current context.
  • It's very easy to forget syncing the lexer's self.nesting with the parser's understanding during manual error recovery: e.g., after a manual self.expect(']').
  • The base promise of re_lex_logical_token is that it only lexes NonLogicalNewline tokens to Newline but that's not guaranteed because changing the nesting can move the lexer in/out of interpolated strings.
  • If the only purpose of re_lex_logical_token is to change NonLogicalNewline to Newline, then it should early return when self.nesting > 0 because the newline token will never be lexed differently, if nesting isn't zero
  • A base assumption of the list error recovery is that re_lex_logical_token would be called N times by unwinding the recovery context so that self.nesting ultimately reaches 0 and the re-lexing happens. This has two issues a) more inner context doesn't see the re-lexed token, even if it's pretty clear that we should reset nesting to 0 when we see a def statement, and b) I don't think we call the recovery function for every nesting level. Using the following rather broken code:
# Regression test for https://github.com/astral-sh/ty/issues/1828
(c: int = 1,f"""{d=[
def a(
class A:
    pass

I don't fully understand what's happening but this is what I'm seeing:

  • We hit error recovery the first time after the Lsqb. Nesting is already 3 at this point (because of the [. The lexer decrements nesting but it doesn't re-lex because there's no preceding NonLogicalNewline (the next token is Lsqb)
  • The next recovery is already Arguments from def a(. There's no recovery from the unclosed (

A few things seem to go wrong here:

  • The parser never pushes an error recovery context for ( because we parse parenthesized expressions manually
  • The lexer bumps nesting for (, {, and [ but the parser never enters the list parsing for [.

This inevitably leads the parser to call re_lex_logical_token too few times (once instead of three), so the error recovery doesn't kick in.

This makes me wonder if we need more carefully placed re_lex calls rather than a catch-all during error recovery (similar to what we do with unclosed interpolated strings).

For example, we could accept that the lexer and parser state diverge during expression parsing. Instead, we could have a manual relex call that resets the lexer into a "start of a logical line" (no interpolated strings, no nesting, no nothing) before parsing any statement or other logical line.

Syncing interpolated strings seems tricky because the parser and lexer state are so intertwined, maybe even so much, that there's an argument that the interpolated string state tracking should be within the parser.

I don't have any concrete solution here and what we have sort of works but it's also very easy to make it get out of sync, of which it sometimes never recovers from.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingparserRelated to the parser

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions