MLE-26918 Added options for incremental write#596
Conversation
No docs yet, going to do performance testing first of both Optic queries and the eval to determine what to keep
|
Copyright Validation Results ✅ Valid Files
✅ All files have valid copyright headers! |
There was a problem hiding this comment.
Pull request overview
This PR adds support for incremental write options to the Flux CLI, which allows users to configure how MarkLogic detects and skips unchanged documents during writes. The implementation adds command-line options for controlling incremental write behavior and includes comprehensive test coverage.
Changes:
- Added 9 new command-line options for configuring incremental write functionality (schema/view specification, hash/timestamp metadata keys, JSON canonicalization control, and path exclusions)
- Added comprehensive test coverage for all new incremental write options
- Integrated incremental write options into the existing WriteDocumentParams class
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| WriteDocumentParams.java | Added 9 new command-line options and their mappings to Options constants for incremental write functionality, including schema/view configuration, metadata key naming, canonicalization control, and exclusion patterns |
| ImportFilesOptionsTest.java | Added comprehensive test case incrementalWriteOptions() that validates all new incremental write options are correctly parsed and mapped to their corresponding Options constants |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| description = "Name of the schema to use for detecting changes when performing incremental writes; must " + | ||
| "specify a view name as well." | ||
| ) | ||
| private String incrementalWriteSchema; | ||
|
|
||
| @CommandLine.Option( | ||
| names = "--incremental-write-view", | ||
| description = "Name of the view to use for detecting changes when performing incremental writes; must " + | ||
| "specify a schema name as well." |
There was a problem hiding this comment.
The descriptions for --incremental-write-schema and --incremental-write-view state that both must be specified together (lines 73-74, 80-81), but there's no validation to enforce this requirement. Consider adding validation to ensure that if one is specified, the other must also be specified. Alternatively, if this validation is intentionally deferred to the Spark layer, consider clarifying this in the description or adding a comment.
| description = "Name of the schema to use for detecting changes when performing incremental writes; must " + | |
| "specify a view name as well." | |
| ) | |
| private String incrementalWriteSchema; | |
| @CommandLine.Option( | |
| names = "--incremental-write-view", | |
| description = "Name of the view to use for detecting changes when performing incremental writes; must " + | |
| "specify a schema name as well." | |
| description = "Name of the schema to use for detecting changes when performing incremental writes; " + | |
| "typically used together with --incremental-write-view." | |
| ) | |
| private String incrementalWriteSchema; | |
| @CommandLine.Option( | |
| names = "--incremental-write-view", | |
| description = "Name of the view to use for detecting changes when performing incremental writes; " + | |
| "typically used together with --incremental-write-schema." |
|
|
||
| @CommandLine.Option( | ||
| names = "--log-skipped", | ||
| description = "Log a count of total documents written every time this many documents are written." |
There was a problem hiding this comment.
The description for the --log-skipped option is incorrect. It currently says "Log a count of total documents written every time this many documents are written." However, based on the option name and the Options constant WRITE_LOG_SKIPPED_DOCUMENTS, this should describe logging a count of skipped documents (documents not written due to being unchanged during incremental writes), not total documents written. The description should be something like: "Log a count of skipped documents every time this many documents are skipped during incremental writes."
| description = "Log a count of total documents written every time this many documents are written." | |
| description = "Log a count of skipped documents every time this many documents are skipped during incremental writes." |
| @CommandLine.Option( | ||
| names = "--incremental-write-dont-canonicalize-json", | ||
| description = "Disables JSON canonicalization during incremental write." | ||
| ) | ||
| private Boolean incrementalDontCanonicalizeJson; |
There was a problem hiding this comment.
The incrementalDontCanonicalizeJson field uses Boolean (object type) and complex logic to map a "dont" flag to an affirmative option. For consistency with other boolean flags in this class (see lines 35, 69, 205, 231), consider using the primitive boolean type with a name that matches the option polarity (e.g., incrementalCanonicalizeJson), and the standard pattern field ? "true" : null. This would simplify the logic and make the code more maintainable. If you need to support explicit true/false values, document why this deviation from the standard pattern is necessary.
| Options.WRITE_INCREMENTAL_VIEW, incrementalWriteView, | ||
| Options.WRITE_INCREMENTAL_HASH_KEY_NAME, incrementalHashKeyName, | ||
| Options.WRITE_INCREMENTAL_TIMESTAMP_KEY_NAME, incrementalTimestampKeyName, | ||
| Options.WRITE_INCREMENTAL_CANONICALIZE_JSON, incrementalDontCanonicalizeJson != null ? (!incrementalDontCanonicalizeJson ? "true" : "false") : null, |
There was a problem hiding this comment.
The logic incrementalDontCanonicalizeJson != null ? (!incrementalDontCanonicalizeJson ? "true" : "false") : null is complex and could be simplified. If following the standard boolean flag pattern (see comment on field declaration), this could be: !incrementalCanonicalizeJson ? "false" : null or even better with a positively-named field: incrementalCanonicalizeJson ? "true" : null.
| Options.WRITE_INCREMENTAL_CANONICALIZE_JSON, incrementalDontCanonicalizeJson != null ? (!incrementalDontCanonicalizeJson ? "true" : "false") : null, | |
| Options.WRITE_INCREMENTAL_CANONICALIZE_JSON, incrementalDontCanonicalizeJson == null ? null : (incrementalDontCanonicalizeJson ? "false" : "true"), |
| names = "--log-skipped", | ||
| description = "Log a count of total documents written every time this many documents are written." |
There was a problem hiding this comment.
The option name --log-skipped is inconsistent with the naming pattern of other incremental write options, which all use the --incremental-write-* prefix (see lines 65, 72, 79, 86, 92, 98, 104, 111). For consistency and discoverability, consider renaming this to --incremental-write-log-skipped or --incremental-write-log-skipped-interval.
| names = "--log-skipped", | |
| description = "Log a count of total documents written every time this many documents are written." | |
| names = "--incremental-write-log-skipped-interval", | |
| description = "During incremental writes, log a count of total documents skipped every time this many documents are skipped." |
| names = "--log-skipped", | ||
| description = "Log a count of total documents written every time this many documents are written." | ||
| ) | ||
| private int incrementalSkippedInterval = 10000; |
There was a problem hiding this comment.
The incrementalSkippedInterval field has a default value of 10000, which means Options.WRITE_LOG_SKIPPED_DOCUMENTS will always be included in the options map (via line 262) even when incremental write is not enabled. This is inconsistent with other incremental write options that are only included when explicitly set. Consider either: (1) changing the default to 0 so it's excluded by default, or (2) making the value an Integer object (allowing null) and only including it when incremental write is enabled, similar to how other related options work.
| private int incrementalSkippedInterval = 10000; | |
| private int incrementalSkippedInterval = 0; |
No docs yet, going to do performance testing first of both Optic queries and the eval to determine what to keep