Relexing logical tokens and unclosed interpolated strings

Our logical-line error recovery doesn't work as expected for unclosed interpolated strings. I'm also not sure if it works as expected in other cases but more below.

We call `re_lex_logical_token` during list parsing recovery if the next token is a valid list element or list terminator of any active recovery context, and we use list parsing (or the comma-separated variant thereof) for interpolated strings, lists, dictionaries, statements, etc. 

Here are a few thoughts after trying to "quickly" make it work for interpolated strings:

* `re_lex_logical_token` always decrements the nesting but this can lead to inconsistencies where the interpolated string on the top of the stack now has a deeper nesting than the Lexer's nesting. I think `re_lex_logical_token` should pop the interpolated string in that case. 
* Similarly, an interpolated string can have a format-specifier that we track based on the nesting. Should it probably end those?
* Always decrementing `self.nesting` by one feels incorrect. We should only do so if we've seen a `(`, `[` or `{` because it's only then that we increment the `nesting`. E.g. the tuple parsing shouldn't decrement if the tuple has no parentheses of its own.
* The other part that feels off about always decrementing `nesting` regardless of what the current parsing function (error context) is that we only want to decrement `nesting` if the parentheses belong to the current context.
* It's very easy to forget syncing the lexer's `self.nesting` with the parser's understanding during manual error recovery: e.g., after a manual `self.expect(']').`
* The base promise of `re_lex_logical_token` is that it only lexes `NonLogicalNewline` tokens to `Newline` but that's not guaranteed because changing the nesting can move the lexer in/out of interpolated strings. 
* If the only purpose of `re_lex_logical_token` is to change `NonLogicalNewline` to `Newline`, then it should early return when `self.nesting > 0` because the newline token will never be lexed differently, if `nesting` isn't zero
* A base assumption of the list error recovery is that `re_lex_logical_token` would be called `N` times by unwinding the recovery context so that `self.nesting` ultimately reaches 0 and the re-lexing happens. This has two issues a) more inner context doesn't see the re-lexed token, even if it's pretty clear that we should reset `nesting` to `0` when we see a `def` statement, and b) I don't think we call the recovery function for every nesting level. Using the following rather broken code:

```py
# Regression test for https://github.com/astral-sh/ty/issues/1828
(c: int = 1,f"""{d=[
def a(
class A:
    pass
```

I don't fully understand what's happening but this is what I'm seeing:

* We hit error recovery the first time after the `Lsqb`. Nesting is already 3 at this point (because of the `[`. The lexer decrements `nesting` but it doesn't re-lex because there's no preceding `NonLogicalNewline` (the next token is `Lsqb`)
* The next recovery is already `Arguments` from `def a(`. There's no recovery from the unclosed `(`

A few things seem to go wrong here:

* The parser never pushes an error recovery context for `(` because we parse parenthesized expressions manually
* The lexer bumps nesting for `(`, `{`, and `[` but the parser never enters the list parsing for `[`. 

This inevitably leads the parser to call `re_lex_logical_token` too few times (once instead of three), so the error recovery doesn't kick in.


This makes me wonder if we need more carefully placed `re_lex` calls rather than a catch-all during error recovery (similar to what we do with unclosed interpolated strings).

For example, we could accept that the lexer and parser state diverge during expression parsing. Instead, we could have a manual `relex` call that resets the lexer into a "start of a logical line" (no interpolated strings, no nesting, no nothing) before parsing any statement or other logical line. 

Syncing interpolated strings seems tricky because the parser and lexer state are so intertwined, maybe even so much, that there's an argument that the interpolated string state tracking should be within the parser. 

I don't have any concrete solution here and what we have sort of works but it's also very easy to make it get out of sync, of which it sometimes never recovers from.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Relexing logical tokens and unclosed interpolated strings #21896

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Relexing logical tokens and unclosed interpolated strings #21896

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions