-
Notifications
You must be signed in to change notification settings - Fork 235
[fix] automatically regenerate the cache when it doesn't match #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: TRAIN
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -32,7 +32,8 @@ def __init__(self, data_cfg: DataConfig, dataset_cfg: DatasetConfig, phase: str | |
| transforms = [eval(aug)(prob) for aug, prob in augment_cfg.items()] | ||
| self.transform = AugmentationComposer(transforms, self.image_size) | ||
| self.transform.get_more_data = self.get_more_data | ||
| self.data = self.load_data(Path(dataset_cfg.path), phase_name) | ||
| self.data_root = Path(dataset_cfg.path) | ||
| self.data = self.load_data(self.data_root, phase_name) | ||
|
|
||
| def load_data(self, dataset_path: Path, phase_name: str): | ||
| """ | ||
|
|
@@ -54,6 +55,14 @@ def load_data(self, dataset_path: Path, phase_name: str): | |
| else: | ||
| data = torch.load(cache_path) | ||
| logger.info("📦 Loaded {} cache", phase_name) | ||
| # Validate cache | ||
| if data[0][0].parent == Path("images")/phase_name: | ||
| logger.info("✅ Cache validation successful") | ||
| else: | ||
| logger.warning("⚠️ Cache validation failed, regenerating") | ||
| data = self.filter_data(dataset_path, phase_name) | ||
| torch.save(data, cache_path) | ||
|
|
||
| return data | ||
|
|
||
| def filter_data(self, dataset_path: Path, phase_name: str) -> list: | ||
|
|
@@ -100,7 +109,7 @@ def filter_data(self, dataset_path: Path, phase_name: str) -> list: | |
|
|
||
| labels = self.load_valid_labels(image_id, image_seg_annotations) | ||
|
|
||
| img_path = images_path / image_name | ||
| img_path = Path("images") / phase_name / image_name | ||
| data.append((img_path, labels)) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of The current check for cache is just checking whether the train/test split are same as
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a good idea that solved the input info and the question of validation. |
||
| valid_inputs += 1 | ||
| logger.info("Recorded {}/{} valid inputs", valid_inputs, len(images_list)) | ||
|
|
@@ -133,6 +142,7 @@ def load_valid_labels(self, label_path: str, seg_data_one_img: list) -> Union[Te | |
|
|
||
| def get_data(self, idx): | ||
| img_path, bboxes = self.data[idx] | ||
| img_path = self.data_root / Path(img_path) | ||
| img = Image.open(img_path).convert("RGB") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the current proposed solution, we are splitting the image_path in
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if it's possible to achieve a similar effect as in yolov8, leaving these paths to be specified by the users. The existing code is overly strict in its requirements for the data format.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When open image, it's the same that get the iamge path from image_id.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure, if I fully got your point. If you are suggesting that using image_id (image file name minus the extension) would make things easier, I agree.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe setting the key to the actual image name with file extension is better? Extions will help resolve .jpg/.png ambiguity. |
||
| return img, bboxes, img_path | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not verifying the content of the cache. This is just checking whether the
phase_nameis the same in thedataread from cache.Assuming we only save the
image_id(just the file name without extension), we should be verifying two things:data.setof image names (excluding extension) form the json file is equal to thesetofimage_ids recovered from cache.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the second question, I'd like to make an addition. If a complete check is made to see if the cache is consistent with the json file, when dealing with a relatively large data set, the efficiency is similar to regenerating the cache. At present, I only make a comparison for one of them, and there are still quite a few problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. The second part will be costly.