-
Notifications
You must be signed in to change notification settings - Fork 7
Migrate from SQLAlchemy to Ibis #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: ll/local_time2
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## ll/local_time2 #62 +/- ##
=================================================
Coverage ? 91.85%
=================================================
Files ? 55
Lines ? 4962
Branches ? 0
=================================================
Hits ? 4558
Misses ? 404
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
For rollback behavior, can't we just make a copy of the original dataframe to do the next operation so we have something to fall back on? |
lixiliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with switching to Ibis for its ability to avoid the ways we have to handle different backend now. There's less changes to the code structure that I expected.
I think Claude is overcomplicating the mapping logic a bit and could use some more iteration there.
| """Convert time columns with from_schema to to_schema configuration.""" | ||
|
|
||
|
|
||
| def _ensure_mapping_types_match_source( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, but it is redundant with the data type handling in the mapping process later.
| df_mapping = _ensure_mapping_types_match_source(df_mapping, from_schema, backend) | ||
|
|
||
| # Debug: Print mapping DF around target time | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this??
| if left_type is None or right_type is None: | ||
| return left_col == right_col | ||
|
|
||
| left_is_unknown = not hasattr(left_type, "is_timestamp") or str(left_type).startswith( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels incomplete and inconsistent.
Generally we only want to change the right_table key because left_table is the input data, right_table is the mapping table. IMO it's safer to have the right key conform to the left key only.
| if resampling_operation: | ||
| query = query.group_by(*groupby_stmt) | ||
| predicates = _build_join_predicates(left_table, right_table, keys) | ||
| joined = left_table.join(right_table, predicates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These join operations are clean, but the rest seems unnecessarily complicated.
I think Claude tried to preserve the existing way of handling potential column conflicts, which is to split the final columns into left_columns, right_columns, and joined_columns, but it also wrote additional code to handle potential conflicts all over again. That's why _build_select_columns and _build_query take in so many input variables and so complicated.
This is a prototype, mostly generated by Claude. The goal is to see if we can simplify interaction with dsgrid when running Spark jobs. The current code based on SQLAlchemy requires a separate Spark session: dsgrid creates a session with pyspark and chronify relies on an Apache Thrift Server (Hive).
We would get the following benefits by migrating:
We would lose this functionality in SQLAlchemy:
Outstanding work: