Add support for video (MOT + CVAT Video) and COCO image (BBOX and Segmentation) import and export#652
Add support for video (MOT + CVAT Video) and COCO image (BBOX and Segmentation) import and export#652
Conversation
…before opening a PR
…o query_result.py
Review Summary by QodoAdd video and COCO annotation import/export support
WalkthroughsDescription• Add support for importing and exporting video annotations (MOT, CVAT Video) • Add support for importing and exporting COCO image annotations (bbox, segmentation) • Implement video annotation flattening and Label Studio video task conversion • Fix error handling in blob download with try-catch moved to query_result.py • Add annotation field resolution utility for automatic field detection Diagramflowchart LR
A["Video Formats<br/>MOT, CVAT Video"] --> B["AnnotationImporter"]
C["Image Formats<br/>COCO"] --> B
B --> D["Flatten Video<br/>Annotations"]
B --> E["Convert to<br/>Label Studio"]
D --> F["QueryResult<br/>Export Methods"]
E --> F
F --> G["export_as_mot<br/>export_as_coco<br/>export_as_cvat_video"]
File Changes1. dagshub/data_engine/annotation/importer.py
|
📝 WalkthroughWalkthroughThis pull request extends annotation format support in the data engine by adding COCO, MOT, and CVAT video formats alongside existing YOLO and CVAT support. Changes include new import/export loaders, video annotation flattening logic, Label Studio conversion for video annotations, and corresponding test coverage. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Importer as AnnotationImporter
participant Loader as Format Loader
participant Converter as IR Converter
participant Output as Export/LS
User->>Importer: import_annotations(format=coco/mot/cvat_video)
Importer->>Loader: load_coco_from_file/load_mot_from_dir/load_cvat_from_zip
Loader->>Converter: Parse format to raw data
Converter->>Converter: Convert to IRAnnotation objects
alt is_video_format
Converter->>Importer: Return frame-wise annotations
Importer->>Importer: _flatten_video_annotations()
Importer->>Output: Return video-keyed dict
else image-based format
Converter->>Output: Return filename-keyed dict
end
Output-->>User: Annotations ready for metadata/export
sequenceDiagram
participant User
participant QueryResult as QueryResult
participant Importer as AnnotationImporter
participant Exporter as Format Exporter
participant FileSystem as Output Files
User->>QueryResult: export_as_coco/mot/cvat_video(download_dir)
QueryResult->>QueryResult: _resolve_annotation_field()
QueryResult->>Importer: _get_all_video_annotations() or gather annotations
QueryResult->>Exporter: Create context (CocoContext/MOTContext)
alt Video Format (MOT/CVAT Video)
Exporter->>FileSystem: Write video tracks/XML
else COCO Format
Exporter->>FileSystem: Write COCO JSON with image/annotation records
end
FileSystem-->>User: Return path to exported file/directory
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 1✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
CI Feedback 🧐A test triggered by this PR failed. Here is an AI-generated analysis of the failure:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (6)
dagshub/data_engine/model/query_result.py (1)
410-416: Consider catching more specific exceptions for blob download failures.The broad
Exceptioncatch is flagged by static analysis (BLE001). While this prevents crashes from download errors, consider catching more specific exceptions (e.g.,requests.RequestException,IOError) to avoid masking unexpected errors.The current approach of logging and removing the field from metadata is reasonable for resilience.
♻️ Optional: More specific exception handling
def _get_blob_fn(dp: Datapoint, field: str, url: str, blob_path: Path): try: blob_or_path = _get_blob(url, blob_path, auth, cache_on_disk, load_into_memory, path_format) - except Exception as e: + except (IOError, OSError) as e: logger.warning(f"Error while downloading blob for field {field} in datapoint {dp.path}: {e}") dp.metadata.pop(field, None) return dp.metadata[field] = blob_or_pathdagshub/data_engine/annotation/importer.py (2)
97-131: Consider extracting CVAT video detection logic to reduce duplication.The pattern of calling loader, then checking
_is_video_annotation_dict, then optionally flattening is duplicated betweencvat(lines 98-102) andcvat_video(lines 124-131) handlers.♻️ Optional: Extract common CVAT processing logic
def _process_cvat_result(self, result) -> Mapping[str, Sequence[IRAnnotationBase]]: if self._is_video_annotation_dict(result): return self._flatten_video_annotations(result) return resultThen use in both handlers:
elif self.annotations_type == "cvat": result = load_cvat_from_zip(annotations_file) annotation_dict = self._process_cvat_result(result)
222-226: Assertion may be too strict for edge cases.The
assert self.is_video_formatwill raiseAssertionErrorif a non-video annotation somehow hasNonefilename. Consider using a more informative exception for production code.♻️ Suggested improvement
for ann in anns: if ann.filename is not None: ann.filename = remap_func(ann.filename) else: - assert self.is_video_format, f"Non-video annotation has no filename: {ann}" + if not self.is_video_format: + raise ValueError(f"Non-video annotation unexpectedly has no filename: {ann}") ann.filename = new_filenametests/data_engine/annotation_import/test_mot.py (1)
108-108: Minor: Usenext(iter())for single-element access.Static analysis suggests using
next(iter(result.values()))instead oflist(result.values())[0]for slightly better performance and clarity when accessing the single expected element.♻️ Suggested fix
- anns = list(result.values())[0] + anns = next(iter(result.values()))Also applies to: 122-122
tests/data_engine/annotation_import/test_cvat_video.py (2)
35-35: Minor: Usenext(iter())for single-element access.Same suggestion as in test_mot.py - use
next(iter(result.values()))for cleaner single-element access.♻️ Suggested fix
- anns = list(result.values())[0] + anns = next(iter(result.values()))
151-183: Consider extracting shared test helpers to conftest.py.The
_make_video_bboxand_make_qrhelper functions are duplicated across test_mot.py, test_cvat_video.py, and test_coco.py. Consider moving these to a sharedconftest.pyor test utilities module to reduce duplication.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
.gitignoredagshub/data_engine/annotation/importer.pydagshub/data_engine/annotation/metadata.pydagshub/data_engine/model/datapoint.pydagshub/data_engine/model/query_result.pytests/data_engine/annotation_import/test_coco.pytests/data_engine/annotation_import/test_cvat_video.pytests/data_engine/annotation_import/test_mot.py
🧰 Additional context used
🧬 Code graph analysis (3)
tests/data_engine/annotation_import/test_mot.py (3)
dagshub/data_engine/annotation/importer.py (3)
_is_video_annotation_dict(138-143)is_video_format(68-69)_flatten_video_annotations(145-154)dagshub/data_engine/client/models.py (1)
MetadataSelectFieldSchema(82-101)dagshub/data_engine/dtypes.py (1)
MetadataFieldType(20-36)
tests/data_engine/annotation_import/test_coco.py (3)
dagshub/data_engine/annotation/importer.py (2)
AnnotationsNotFoundError(30-32)import_annotations(71-135)dagshub/data_engine/annotation/metadata.py (2)
add_coco_annotation(277-297)value(107-117)dagshub/data_engine/client/models.py (1)
MetadataSelectFieldSchema(82-101)
dagshub/data_engine/annotation/importer.py (1)
dagshub/data_engine/model/datasource_state.py (1)
raw_path(99-104)
🪛 Ruff (0.14.14)
tests/data_engine/annotation_import/test_cvat_video.py
[warning] 35-35: Prefer next(iter(result.values())) over single element slice
Replace with next(iter(result.values()))
(RUF015)
tests/data_engine/annotation_import/test_mot.py
[warning] 108-108: Prefer next(iter(result.values())) over single element slice
Replace with next(iter(result.values()))
(RUF015)
[warning] 122-122: Prefer next(iter(result.values())) over single element slice
Replace with next(iter(result.values()))
(RUF015)
tests/data_engine/annotation_import/test_coco.py
[warning] 45-45: Unused function argument: mock_dagshub_auth
(ARG001)
[warning] 64-64: Unused function argument: mock_dagshub_auth
(ARG001)
dagshub/data_engine/model/query_result.py
[warning] 412-412: Do not catch blind exception: Exception
(BLE001)
[warning] 781-781: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 892-892: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 947-947: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 993-993: Avoid specifying long messages outside the exception class
(TRY003)
dagshub/data_engine/annotation/importer.py
[warning] 133-133: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
- GitHub Check: build (3.9)
- GitHub Check: build (3.12)
- GitHub Check: build (3.13)
🔇 Additional comments (20)
dagshub/data_engine/annotation/metadata.py (2)
23-26: LGTM - VideoRectangle task type registration.Module-level registration of
videorectanglein_task_lookupenables Label Studio video annotation parsing. This follows the established pattern for task type registration.
277-297: LGTM - COCO annotation import with filename normalization.The implementation correctly loads COCO annotations, normalizes filenames to the datapoint's path, and updates the datapoint state. The local import avoids circular dependency issues.
One observation: all annotations from the COCO JSON are assigned to the current datapoint's path regardless of their original image grouping. This appears intentional for the use case of adding annotations to a single datapoint.
dagshub/data_engine/model/query_result.py (4)
772-784: LGTM - Well-implemented helper methods.
_get_all_video_annotationscorrectly filters for video-specific annotation types._resolve_annotation_fieldprovides deterministic field selection via sorted order with proper error handling when no annotation fields exist.
864-910: LGTM - COCO export follows established patterns.The implementation correctly:
- Resolves annotation field with fallback
- Validates non-empty annotations before export
- Prefixes filenames with source prefix
- Creates appropriate directory structure
949-957: Dimension inference from first annotation requires verification of external library constraints.The empty list case is properly guarded at lines 946-947. However, the concern about
Nonedimensions deserves clarification: the code assumesvideo_annotations[0].image_widthandvideo_annotations[0].image_heightare non-Nonewhen falling back from the provided parameters. All test cases constructIRVideoBBoxAnnotationwith explicit non-Nonedimensions (e.g.,image_width=1920, image_height=1080), but sinceIRVideoBBoxAnnotationis from the externaldagshub_annotation_converterlibrary, its type constraints cannot be verified within this repository. If the external library allowsNonefor these fields, they could propagate toMOTContext.
964-1005: No action needed. The function properly handlesNonedimensions by inferring them from the video annotation objects, which include width and height metadata. Tests confirm this behavior works as expected.dagshub/data_engine/annotation/importer.py (3)
67-69: LGTM - Clean video format detection.The
is_video_formatproperty provides a clear way to check if the annotation type is video-oriented.
137-143: LGTM - Robust video annotation detection.The method correctly handles edge cases: non-dict inputs, empty dicts, and determines video vs image annotations by checking the type of the first key.
371-385: LGTM - Video task conversion with proper filtering.The method correctly filters for
IRVideoBBoxAnnotationinstances and skips entries without video annotations, preventing empty tasks from being created.tests/data_engine/annotation_import/test_mot.py (4)
1-24: LGTM - Well-structured test setup.Good use of
autousefixture to mocksource_prefixfor all tests in the module, ensuring consistent test isolation.
29-46: LGTM - Comprehensive edge case coverage for _is_video_annotation_dict.Tests cover all key scenarios: int keys, str keys, empty dict, non-dict input, and mixed keys (checking first key behavior).
153-180: LGTM - Thorough export tests.Good coverage of MOT export including directory structure validation, explicit dimension handling, and error cases for missing annotations.
186-236: LGTM - Clean helper functions.The helper functions are well-designed for test reusability, with sensible defaults and clear parameter names.
tests/data_engine/annotation_import/test_cvat_video.py (2)
43-81: LGTM - Good coverage of video annotation filtering and aggregation.Tests properly verify that
_get_all_video_annotationsfilters out image annotations and correctly aggregates across multiple datapoints.
87-145: LGTM - Comprehensive CVAT video export tests.Good coverage of success cases, error handling (no annotations, image-only annotations), custom parameters, and multi-datapoint scenarios.
tests/data_engine/annotation_import/test_coco.py (5)
29-42: LGTM - Solid COCO import tests.Tests cover successful import and proper error handling for non-existent files.
45-58: LGTM - The unused mock_dagshub_auth parameter is intentional.The static analysis warning (ARG001) is a false positive. The
mock_dagshub_authfixture is included to ensure authentication is mocked during these tests via pytest's fixture mechanism, even though it's not directly referenced in the test body.Also applies to: 64-71
77-103: LGTM - Thorough _resolve_annotation_field tests.Excellent coverage including explicit field override, auto-resolution, error on no fields, and alphabetical ordering verification.
109-182: LGTM - Comprehensive COCO export tests.Good coverage including:
- Bbox coordinate verification (denormalized format)
- Error handling for empty annotations
- Explicit class mapping
- Custom output filename
- Multi-datapoint export with count verification
188-218: Helper functions follow established pattern.Same comment as for other test files - these helpers could be shared via conftest.py to reduce duplication across the test modules.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| for attempt in Retrying( | ||
| retry=retry_if_exception_type(RuntimeError), | ||
| stop=stop_after_attempt(5), | ||
| wait=wait_exponential(multiplier=1, min=4, max=10), | ||
| before_sleep=before_sleep_log(logger, logging.WARNING), | ||
| ): | ||
| with attempt: | ||
| content = get() |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
tenacity.Retrying default behavior reraise RetryError vs original exception
💡 Result:
In tenacity, the Retrying/@retry default is:
reraise=False(default) → when the stop condition is reached and the last attempt still failed, tenacity raisestenacity.RetryError(or whatever you set inretry_error_cls). The original exception is stored on theRetryError(e.g., viaretry_err.last_attempt...) but is not what gets raised. [1][2]reraise=True→ when retries are exhausted, tenacity re-raises the last attempt’s original exception instead of raisingRetryError(so callers seeValueError,IOError, etc., as if there were no retry wrapper). [2]
If you kept the default (reraise=False) but want the underlying exception type/traceback, you can catch RetryError and re-raise the stored “last attempt” exception (illustrated in tenacity’s issue tracker). [3]
Sources: [1] tenacity API reference (defaults incl. reraise: bool = False) [2] tenacity docs “Error Handling” section [3] jd/tenacity issue #24 discussion on RetryError vs last exception.
🏁 Script executed:
# Find and examine the _get_blob method and its context
fd -t f "datapoint.py" | head -5Repository: DagsHub/client
Length of output: 132
🏁 Script executed:
# Search for callers of _get_blob to understand exception handling
rg "_get_blob" -A 3 -B 3 -t pyRepository: DagsHub/client
Length of output: 4556
🏁 Script executed:
# Get the complete _get_blob function implementation
sed -n '297,350p' dagshub/data_engine/model/datapoint.pyRepository: DagsHub/client
Length of output: 1294
🏁 Script executed:
# Search for any exception handling that specifically catches RuntimeError
rg "except RuntimeError" -A 2 -B 2 -t pyRepository: DagsHub/client
Length of output: 325
Consider using reraise=True to preserve exception type for callers.
With the default reraise=False, when RuntimeError retries are exhausted, tenacity.Retrying raises RetryError instead of the original RuntimeError. Most callers in query_result.py catch broad Exception and are unaffected, but the direct call in datapoint.py (line ~210) propagates the exception without handling, changing the exception type. To maintain consistent error semantics and avoid surprising exception type changes, add reraise=True:
for attempt in Retrying(
retry=retry_if_exception_type(RuntimeError),
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=4, max=10),
before_sleep=before_sleep_log(logger, logging.WARNING),
+ reraise=True,
):
Code Review by Qodo
✅ 1.
|
| # Add the source prefix to all annotations | ||
| for ann in annotations: | ||
| ann.filename = os.path.join(self.datasource.source.source_prefix, ann.filename) | ||
|
|
There was a problem hiding this comment.
5. Export mutates filenames 🐞 Bug ✓ Correctness
export_as_yolo/export_as_coco modify annotation objects in-place by rewriting ann.filename to include source_prefix. This permanently mutates datapoint metadata (double-prefixing on repeated exports) and can generate OS-specific separators due to os.path.join with a PurePosixPath prefix.
Agent Prompt
### Issue description
`export_as_yolo()` / `export_as_coco()` rewrite `ann.filename` in-place, which mutates datapoint metadata and can cause repeated exports to double-prefix paths. Additionally, using `os.path.join()` with a `PurePosixPath` prefix can produce platform-specific separators.
### Issue Context
Annotations come from `_get_all_annotations()`, which returns references to the `MetadataAnnotations.annotations` list stored on datapoints.
### Fix Focus Areas
- dagshub/data_engine/model/query_result.py[762-770]
- dagshub/data_engine/model/query_result.py[846-852]
- dagshub/data_engine/model/query_result.py[898-905]
- dagshub/data_engine/model/datasource_state.py[106-113]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
|
Note to ourselves - need to update the docs: To have the correct list of supported formats |
Based on changes to dagshub-annotation-converter to the same effect: DagsHub/dagshub-annotation-converter#25
Implementation Details
Annotation Format Support Expansion
The importer now supports five annotation formats (previously two): YOLO, CVAT, COCO, MOT, and CVAT Video. The import logic includes dedicated loaders for each format:
load_coco_from_fileload_mot_from_dir/load_mot_from_zipVideo Annotation Pipeline
A new video annotation processing pipeline was introduced with dedicated helpers:
is_video_formatproperty identifies MOT and CVAT Video formats for special handling_is_video_annotation_dictdetects when annotation loaders return video-style results (distinguished by integer keys for frame indices vs. string keys for image filenames)_flatten_video_annotationsconsolidates per-frame annotations into single video-entry dictionaries keyed by video name_convert_to_ls_video_tasksroutes video annotations to Label Studio usingvideo_ir_to_ls_video_tasksfor conversionExport Capabilities
Three new export methods were added to QueryResult:
export_as_coco(): Exports with optional custom class mappings and output filenameexport_as_mot(): Exports with inferred or explicit image dimensions and generates MOT directory structure with seqinfo.iniexport_as_cvat_video(): Exports video XML with configurable video name and dimensionsThese exports include automatic data downloads and proper annotation field resolution via the new
_resolve_annotation_fieldmethod, which defaults to the first available annotation field when none is explicitly specified and selects alphabetically among multiples.Metadata Integration
A new
add_coco_annotation()method was added to MetadataAnnotations, enabling direct COCO JSON ingestion from strings. Label Studio video support was integrated by registeringVideoRectangleAnnotationin the task lookup registry.Robustness Improvements
get_blob_fieldsnow logging warnings and gracefully skipping unavailable fields instead of propagating exceptionsTesting Coverage
Comprehensive test modules validate: