Optimize Detection Performance and enhance Blockchain Integrity#388
Optimize Detection Performance and enhance Blockchain Integrity#388RohanExploit wants to merge 4 commits intomainfrom
Conversation
- Implement `detect_all_clip_optimized` in `hf_api_service.py` to batch 5 detection categories into a single HF API call. - Update `UnifiedDetectionService.detect_all` to use the optimized method when HF backend is active. - Configure SQLAlchemy connection pooling (`pool_size=20`, `max_overflow=10`) for PostgreSQL in `database.py`. - Add `previous_integrity_hash` to `Issue` model and update `create_issue` and `verify_blockchain_integrity` for robust ledger verification. - Add migration logic in `init_db.py`. - Fix `backend/utils.py` to handle missing `python-magic` dependency gracefully. Co-authored-by: RohanExploit <178623867+RohanExploit@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
✅ Deploy Preview for fixmybharat canceled.
|
🙏 Thank you for your contribution, @RohanExploit!PR Details:
Quality Checklist:
Review Process:
Note: The maintainers will monitor code quality and ensure the overall project flow isn't broken. |
📝 WalkthroughWalkthroughAdds previous_integrity_hash to Issues and migrates DB, introduces an optimized single-call Hugging Face detector with fallback, makes python-magic and scikit-learn optional, and enables optional PostgreSQL connection pooling via engine_kwargs. Changes
Sequence DiagramsequenceDiagram
participant Client
participant Router as issues.py
participant DB as Database
participant Hash as Hash Logic
Client->>Router: POST /issues (description, category)
Router->>DB: SELECT latest issue.integrity_hash (for prev_hash)
DB-->>Router: prev_hash (or None)
Router->>Hash: compute_hash(description, category, prev_hash)
Hash-->>Router: integrity_hash
Router->>DB: INSERT Issue(integrity_hash, previous_integrity_hash=prev_hash, ...)
DB-->>Router: created_issue
Router-->>Client: 201 Created (issue)
Note over Router,DB: Verification flow
Client->>Router: GET /issues/{id}/verify
Router->>DB: SELECT Issue.previous_integrity_hash, integrity_hash, description, category
DB-->>Router: issue_row
alt previous_integrity_hash exists
Router->>Hash: compute_hash(description, category, previous_integrity_hash)
else previous_integrity_hash missing
Router->>DB: SELECT latest prior integrity_hash as fallback
DB-->>Router: fallback_prev_hash
Router->>Hash: compute_hash(description, category, fallback_prev_hash)
end
Hash-->>Router: computed_hash
Router->>Router: compare computed_hash == stored integrity_hash
Router-->>Client: verification result (ok|mismatch)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
3 issues found across 7 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/utils.py">
<violation number="1" location="backend/utils.py:14">
P2: Missing startup warning: when `python-magic` fails to import, this should log a warning at module load time so operators see it in deployment logs and can install the missing dependency.</violation>
<violation number="2" location="backend/utils.py:83">
P1: Security check silently bypassed: when `python-magic` is unavailable, the MIME type content validation is skipped entirely without logging a warning. This degrades the upload security posture silently. At minimum, add an `else` branch that logs a warning so operators are aware the check is inactive.</violation>
</file>
<file name="backend/hf_api_service.py">
<violation number="1" location="backend/hf_api_service.py:473">
P2: `list(set(...))` produces non-deterministic label ordering. None of these labels are duplicated across groups, so `set()` is unnecessary. It randomizes label order on each call, which hurts reproducibility and makes debugging harder. Use simple list concatenation (as done in `detect_civic_eye_clip`) instead.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| if magic: | ||
| detected_mime = magic.from_buffer(file_content, mime=True) | ||
|
|
||
| if detected_mime not in ALLOWED_MIME_TYPES: | ||
| raise HTTPException( | ||
| status_code=400, | ||
| detail=f"Invalid file type. Only image files are allowed. Detected: {detected_mime}" | ||
| ) | ||
| if detected_mime not in ALLOWED_MIME_TYPES: | ||
| raise HTTPException( | ||
| status_code=400, | ||
| detail=f"Invalid file type. Only image files are allowed. Detected: {detected_mime}" |
There was a problem hiding this comment.
P1: Security check silently bypassed: when python-magic is unavailable, the MIME type content validation is skipped entirely without logging a warning. This degrades the upload security posture silently. At minimum, add an else branch that logs a warning so operators are aware the check is inactive.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/utils.py, line 83:
<comment>Security check silently bypassed: when `python-magic` is unavailable, the MIME type content validation is skipped entirely without logging a warning. This degrades the upload security posture silently. At minimum, add an `else` branch that logs a warning so operators are aware the check is inactive.</comment>
<file context>
@@ -77,13 +80,14 @@ def _validate_uploaded_file_sync(file: UploadFile) -> Optional[Image.Image]:
file.file.seek(0) # Reset file pointer
- detected_mime = magic.from_buffer(file_content, mime=True)
+ if magic:
+ detected_mime = magic.from_buffer(file_content, mime=True)
</file context>
| if magic: | |
| detected_mime = magic.from_buffer(file_content, mime=True) | |
| if detected_mime not in ALLOWED_MIME_TYPES: | |
| raise HTTPException( | |
| status_code=400, | |
| detail=f"Invalid file type. Only image files are allowed. Detected: {detected_mime}" | |
| ) | |
| if detected_mime not in ALLOWED_MIME_TYPES: | |
| raise HTTPException( | |
| status_code=400, | |
| detail=f"Invalid file type. Only image files are allowed. Detected: {detected_mime}" | |
| if magic: | |
| detected_mime = magic.from_buffer(file_content, mime=True) | |
| if detected_mime not in ALLOWED_MIME_TYPES: | |
| raise HTTPException( | |
| status_code=400, | |
| detail=f"Invalid file type. Only image files are allowed. Detected: {detected_mime}" | |
| ) | |
| else: | |
| logger.warning("python-magic is not installed; skipping MIME type validation for %s", file.filename) |
| try: | ||
| import magic | ||
| except ImportError: | ||
| magic = None |
There was a problem hiding this comment.
P2: Missing startup warning: when python-magic fails to import, this should log a warning at module load time so operators see it in deployment logs and can install the missing dependency.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/utils.py, line 14:
<comment>Missing startup warning: when `python-magic` fails to import, this should log a warning at module load time so operators see it in deployment logs and can install the missing dependency.</comment>
<file context>
@@ -8,7 +8,10 @@
+try:
+ import magic
+except ImportError:
+ magic = None
from typing import Optional
</file context>
| magic = None | |
| magic = None | |
| logging.getLogger(__name__).warning("python-magic is not installed; MIME type validation for uploads will be skipped") |
|
|
||
| neutral = ["safe", "clean street", "smooth road", "dry street", "normal scene", "intact window", "clean wall", "intact infrastructure"] | ||
|
|
||
| all_labels = list(set(vandalism_pos + infrastructure_pos + flooding_pos + garbage_pos + fire_pos + neutral)) |
There was a problem hiding this comment.
P2: list(set(...)) produces non-deterministic label ordering. None of these labels are duplicated across groups, so set() is unnecessary. It randomizes label order on each call, which hurts reproducibility and makes debugging harder. Use simple list concatenation (as done in detect_civic_eye_clip) instead.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/hf_api_service.py, line 473:
<comment>`list(set(...))` produces non-deterministic label ordering. None of these labels are duplicated across groups, so `set()` is unnecessary. It randomizes label order on each call, which hurts reproducibility and makes debugging harder. Use simple list concatenation (as done in `detect_civic_eye_clip`) instead.</comment>
<file context>
@@ -456,3 +456,50 @@ async def detect_abandoned_vehicle_clip(image: Union[Image.Image, bytes], client
+
+ neutral = ["safe", "clean street", "smooth road", "dry street", "normal scene", "intact window", "clean wall", "intact infrastructure"]
+
+ all_labels = list(set(vandalism_pos + infrastructure_pos + flooding_pos + garbage_pos + fire_pos + neutral))
+
+ img_bytes = _prepare_image_bytes(image)
</file context>
| all_labels = list(set(vandalism_pos + infrastructure_pos + flooding_pos + garbage_pos + fire_pos + neutral)) | |
| all_labels = vandalism_pos + infrastructure_pos + flooding_pos + garbage_pos + fire_pos + neutral |
There was a problem hiding this comment.
Pull request overview
This PR aims to optimize detection performance by consolidating multiple Hugging Face API calls and enhance blockchain integrity by explicitly storing previous hashes. It also adds a fallback for the python-magic library.
Changes:
- Added
detect_all_clip_optimizedfunction to consolidate 5 separate HF API calls into a single CLIP call for better performance - Enhanced blockchain verification by storing
previous_integrity_hashin Issue records, making verification robust against deleted intermediate records - Added try/except fallback for python-magic import to prevent crashes in minimal environments
- Enabled PostgreSQL connection pooling with pool_size=20 and max_overflow=10 for better concurrency handling
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| backend/utils.py | Added try/except wrapper around python-magic import with conditional MIME validation |
| backend/unified_detection_service.py | Added optimized detection path that calls detect_all_clip_optimized when backend is huggingface |
| backend/hf_api_service.py | Added detect_all_clip_optimized function that combines labels from 5 detection categories into a single API call |
| backend/routers/issues.py | Updated create_issue to store previous_integrity_hash and enhanced verify_blockchain_integrity to use stored hash with fallback |
| backend/models.py | Added previous_integrity_hash column to Issue model |
| backend/init_db.py | Added migration to create previous_integrity_hash column in database |
| backend/database.py | Added connection pooling configuration for PostgreSQL with pool_size=20 and max_overflow=10 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if magic: | ||
| detected_mime = magic.from_buffer(file_content, mime=True) | ||
|
|
||
| if detected_mime not in ALLOWED_MIME_TYPES: | ||
| raise HTTPException( | ||
| status_code=400, | ||
| detail=f"Invalid file type. Only image files are allowed. Detected: {detected_mime}" | ||
| ) | ||
| if detected_mime not in ALLOWED_MIME_TYPES: | ||
| raise HTTPException( | ||
| status_code=400, | ||
| detail=f"Invalid file type. Only image files are allowed. Detected: {detected_mime}" | ||
| ) |
There was a problem hiding this comment.
When python-magic is unavailable, MIME type validation is silently skipped. This creates a security vulnerability where malicious users could upload non-image files that pass through this validation layer. The fallback should either raise an error indicating that validation cannot be performed, or use an alternative validation method. At minimum, a warning should be logged when MIME validation is skipped so security monitoring can detect this reduced security posture.
| if magic: | ||
| detected_mime = magic.from_buffer(file_content, mime=True) | ||
|
|
||
| if detected_mime not in ALLOWED_MIME_TYPES: | ||
| raise HTTPException( | ||
| status_code=400, | ||
| detail=f"Invalid file type. Only image files are allowed. Detected: {detected_mime}" | ||
| ) |
There was a problem hiding this comment.
When python-magic is unavailable, MIME type validation is silently skipped. This creates a security vulnerability where malicious users could upload non-image files that pass through this validation layer. The fallback should either raise an error indicating that validation cannot be performed, or use an alternative validation method. At minimum, a warning should be logged when MIME validation is skipped so security monitoring can detect this reduced security posture.
|
|
||
| if backend == "huggingface": | ||
| try: | ||
| from backend.hf_api_service import detect_all_clip_optimized |
There was a problem hiding this comment.
The import statement references 'hf_service' but should reference 'backend.hf_api_service'. The hf_service.py module is deprecated (see line 2 of backend/hf_service.py: "DEPRECATED: This module is no longer used"). The correct import should be 'from backend.hf_api_service import detect_all_clip_optimized'. This will cause an ImportError when the optimized path is executed.
| message = "Integrity check failed! The chain might be broken due to deleted records, or data tampering occurred." | ||
| else: | ||
| message = "Integrity check failed! The report data does not match its cryptographic seal." |
There was a problem hiding this comment.
Inconsistent indentation: These lines use extra spaces for indentation instead of following the consistent 4-space indentation pattern used throughout the file. This should use 12 spaces to align with the 'if' block indentation level.
| message = "Integrity check failed! The chain might be broken due to deleted records, or data tampering occurred." | |
| else: | |
| message = "Integrity check failed! The report data does not match its cryptographic seal." | |
| message = "Integrity check failed! The chain might be broken due to deleted records, or data tampering occurred." | |
| else: | |
| message = "Integrity check failed! The report data does not match its cryptographic seal." |
| integrity_hash=integrity_hash, | ||
| previous_integrity_hash=prev_hash |
There was a problem hiding this comment.
The new previous_integrity_hash column and robust verification logic are not tested. Consider adding test cases that: 1) verify that new issues store previous_integrity_hash correctly, 2) test verification when intermediate records are deleted (to demonstrate the robustness improvement), and 3) verify the different error messages for records with vs without previous_integrity_hash. This would ensure the blockchain enhancement works as intended.
| async def detect_all_clip_optimized(image: Union[Image.Image, bytes], client: httpx.AsyncClient = None): | ||
| """ | ||
| Optimized detection for all categories in a single API call. | ||
| """ | ||
| # Define label groups | ||
| vandalism_pos = ["graffiti", "broken window", "vandalized wall"] | ||
| infrastructure_pos = ["pothole", "cracked road", "broken pavement"] | ||
| flooding_pos = ["flooded street", "waterlogging"] | ||
| garbage_pos = ["garbage pile", "trash overflow", "scattered waste"] | ||
| fire_pos = ["fire", "smoke", "flames"] | ||
|
|
||
| neutral = ["safe", "clean street", "smooth road", "dry street", "normal scene", "intact window", "clean wall", "intact infrastructure"] | ||
|
|
||
| all_labels = list(set(vandalism_pos + infrastructure_pos + flooding_pos + garbage_pos + fire_pos + neutral)) | ||
|
|
||
| img_bytes = _prepare_image_bytes(image) | ||
| results = await query_hf_api(img_bytes, all_labels, client=client) | ||
|
|
||
| if not isinstance(results, list): | ||
| return { | ||
| "vandalism": [], | ||
| "infrastructure": [], | ||
| "flooding": [], | ||
| "garbage": [], | ||
| "fire": [] | ||
| } | ||
|
|
||
| # Helper to extract detections for a category | ||
| def extract_detections(positive_labels): | ||
| detected = [] | ||
| for res in results: | ||
| if isinstance(res, dict) and res.get('label') in positive_labels and res.get('score', 0) > 0.4: | ||
| detected.append({ | ||
| "label": res['label'], | ||
| "confidence": res['score'], | ||
| "box": [] | ||
| }) | ||
| return detected | ||
|
|
||
| return { | ||
| "vandalism": extract_detections(vandalism_pos), | ||
| "infrastructure": extract_detections(infrastructure_pos), | ||
| "flooding": extract_detections(flooding_pos), | ||
| "garbage": extract_detections(garbage_pos), | ||
| "fire": extract_detections(fire_pos) | ||
| } |
There was a problem hiding this comment.
The new detect_all_clip_optimized function lacks test coverage. The existing test in test_local_ml_service.py::test_detect_all_returns_dict tests the wrapper function but doesn't verify the optimization (single API call vs 5 separate calls) or validate the label definitions and response structure. Consider adding tests that: 1) mock query_hf_api to verify it's called once with all labels, 2) validate the response structure matches the expected format with all 5 categories, and 3) test the threshold filtering (score > 0.4).
| engine_kwargs["pool_size"] = 20 | ||
| engine_kwargs["max_overflow"] = 10 |
There was a problem hiding this comment.
The connection pooling configuration may not be optimal for all deployment scenarios. A pool_size of 20 and max_overflow of 10 (30 connections total) could be too high for small PostgreSQL instances or too low for high-traffic deployments. Consider making these configurable via environment variables (e.g., DB_POOL_SIZE and DB_MAX_OVERFLOW) to allow tuning per environment without code changes.
| engine_kwargs["pool_size"] = 20 | |
| engine_kwargs["max_overflow"] = 10 | |
| # Allow tuning via environment variables, with sensible defaults | |
| default_pool_size = 20 | |
| default_max_overflow = 10 | |
| try: | |
| pool_size = int(os.environ.get("DB_POOL_SIZE", default_pool_size)) | |
| except (TypeError, ValueError): | |
| pool_size = default_pool_size | |
| try: | |
| max_overflow = int(os.environ.get("DB_MAX_OVERFLOW", default_max_overflow)) | |
| except (TypeError, ValueError): | |
| max_overflow = default_max_overflow | |
| engine_kwargs["pool_size"] = pool_size | |
| engine_kwargs["max_overflow"] = max_overflow |
| # Define label groups | ||
| vandalism_pos = ["graffiti", "broken window", "vandalized wall"] | ||
| infrastructure_pos = ["pothole", "cracked road", "broken pavement"] | ||
| flooding_pos = ["flooded street", "waterlogging"] | ||
| garbage_pos = ["garbage pile", "trash overflow", "scattered waste"] | ||
| fire_pos = ["fire", "smoke", "flames"] | ||
|
|
||
| neutral = ["safe", "clean street", "smooth road", "dry street", "normal scene", "intact window", "clean wall", "intact infrastructure"] |
There was a problem hiding this comment.
The label definitions in detect_all_clip_optimized are simplified compared to the individual detection functions in the deprecated hf_service.py. For example, vandalism uses ["graffiti", "broken window", "vandalized wall"] here vs ["graffiti", "vandalism", "spray paint", "street art", "clean wall", "public property", "normal street"] in the deprecated function, and infrastructure uses ["pothole", "cracked road", "broken pavement"] here vs ["broken streetlight", "damaged traffic sign", "fallen tree", "damaged fence", "pothole", "clean street", "normal infrastructure"] there. This could result in different detection results when using the optimized path versus the fallback. Consider aligning the label sets or documenting why they differ.
| # Define label groups | |
| vandalism_pos = ["graffiti", "broken window", "vandalized wall"] | |
| infrastructure_pos = ["pothole", "cracked road", "broken pavement"] | |
| flooding_pos = ["flooded street", "waterlogging"] | |
| garbage_pos = ["garbage pile", "trash overflow", "scattered waste"] | |
| fire_pos = ["fire", "smoke", "flames"] | |
| neutral = ["safe", "clean street", "smooth road", "dry street", "normal scene", "intact window", "clean wall", "intact infrastructure"] | |
| # Define label groups (kept aligned with deprecated per-category detectors) | |
| vandalism_pos = [ | |
| "graffiti", | |
| "broken window", | |
| "vandalized wall", | |
| "vandalism", | |
| "spray paint", | |
| "street art", | |
| ] | |
| infrastructure_pos = [ | |
| "pothole", | |
| "cracked road", | |
| "broken pavement", | |
| "broken streetlight", | |
| "damaged traffic sign", | |
| "fallen tree", | |
| "damaged fence", | |
| ] | |
| flooding_pos = ["flooded street", "waterlogging"] | |
| garbage_pos = ["garbage pile", "trash overflow", "scattered waste"] | |
| fire_pos = ["fire", "smoke", "flames"] | |
| neutral = [ | |
| "safe", | |
| "clean street", | |
| "smooth road", | |
| "dry street", | |
| "normal scene", | |
| "intact window", | |
| "clean wall", | |
| "intact infrastructure", | |
| "public property", | |
| "normal street", | |
| "normal infrastructure", | |
| ] |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/unified_detection_service.py (1)
290-306:⚠️ Potential issue | 🔴 CriticalFix import paths and missing functions for HF backend fallback.
The fallback to individual calls will crash, not fail silently. Lines 130, 158, and 186 attempt to import from
hf_service(e.g.,from hf_service import detect_vandalism_clip), but:
hf_servicemust be imported asfrom backend.hf_service importwhen called from within the backend packagedetect_vandalism_clipdoes not exist inbackend/hf_service.py— onlydetect_infrastructure_clipanddetect_flooding_clipare definedWhen the optimized call fails and falls back to
asyncio.gather()at line 301, callingself.detect_vandalism()will raiseModuleNotFoundErrororImportError. Sinceasyncio.gather()lacksreturn_exceptions=Trueand individual methods have no exception handling around imports, the entire fallback will crash unhandled.Either:
- Correct the import paths to
from backend.hf_service importand implementdetect_vandalism_clipinhf_service.py, OR- Use
backend.hf_api_servicefor all detectors (like garbage and fire already do at lines 215 and 251)
🤖 Fix all issues with AI agents
In `@backend/hf_api_service.py`:
- Around line 464-469: The optimized detector label lists (vandalism_pos,
infrastructure_pos, flooding_pos, garbage_pos, fire_pos) diverge from the
individual detector functions (e.g., detect_fire_clip includes "burning"),
causing inconsistent results between detect_all_clip_optimized and the
asyncio.gather fallback; fix by centralizing and reusing a single canonical set
of target labels (or constants) used by both detect_all_clip_optimized and each
individual detector (e.g., update detect_all_clip_optimized to use the same
label set as detect_fire_clip or move the label arrays into shared constants and
import/use them across detect_fire_clip and other detect_* functions), ensure
fire_pos includes "burning", and update any related tests or module-level docs
to reflect the unified label definitions.
In `@backend/routers/issues.py`:
- Around line 624-632: The fallback logic using
current_issue.previous_integrity_hash and the run_in_threadpool query to
reconstruct prev_hash is fine but susceptible to a race where concurrent
create_issue calls can read the same prev_hash and fork the chain; to fix, make
chain construction atomic in create_issue by wrapping the
read-of-latest-hash-and-insert in a DB transaction with a row-level lock (e.g.,
SELECT ... FOR UPDATE on the latest Issue row) or use a DB advisory lock around
the prev-hash read+insert, and add a short comment near
current_issue.previous_integrity_hash / the run_in_threadpool fallback noting
this race and pointing to the transactional/advisory-lock approach (reference:
create_issue, current_issue.previous_integrity_hash,
run_in_threadpool/db.query(Issue.integrity_hash)).
🧹 Nitpick comments (4)
backend/database.py (1)
18-29: Consider making pool parameters configurable via environment variables.The hardcoded
pool_size=20andmax_overflow=10are reasonable defaults, but in production you may want to tune these without redeploying. This is optional for now.♻️ Suggested change
# Optimize connection pooling for PostgreSQL (Production) if "postgresql" in SQLALCHEMY_DATABASE_URL: - engine_kwargs["pool_size"] = 20 - engine_kwargs["max_overflow"] = 10 + engine_kwargs["pool_size"] = int(os.environ.get("DB_POOL_SIZE", "20")) + engine_kwargs["max_overflow"] = int(os.environ.get("DB_MAX_OVERFLOW", "10"))backend/utils.py (1)
11-14: Consider logging a warning whenpython-magicis unavailable.When
magicisNone, MIME-type validation is silently skipped, which weakens the upload security posture. A startup warning would help operators notice the degraded validation.💡 Suggested change
try: import magic except ImportError: magic = None + logging.getLogger(__name__).warning( + "python-magic not installed; MIME-type validation for uploads will be skipped." + )backend/unified_detection_service.py (1)
288-296: Optimized path with fallback is well-structured, but consider passing theclientfor connection reuse.
detect_all_clip_optimizedaccepts an optionalclientparameter, but it's not passed here. This means a newhttpx.AsyncClientis created and torn down per call. While consistent with other callers in this file, passing a shared client would reduce overhead — especially since this optimization is about reducing latency.backend/hf_api_service.py (1)
473-473:list(set(...))produces non-deterministic label ordering.
setdoes not preserve insertion order. While CLIP's zero-shot classification is unlikely to be sensitive to label order, using a deterministic deduplication preserves reproducibility for debugging and testing.♻️ Suggested change
- all_labels = list(set(vandalism_pos + infrastructure_pos + flooding_pos + garbage_pos + fire_pos + neutral)) + all_labels = list(dict.fromkeys(vandalism_pos + infrastructure_pos + flooding_pos + garbage_pos + fire_pos + neutral))
| # Define label groups | ||
| vandalism_pos = ["graffiti", "broken window", "vandalized wall"] | ||
| infrastructure_pos = ["pothole", "cracked road", "broken pavement"] | ||
| flooding_pos = ["flooded street", "waterlogging"] | ||
| garbage_pos = ["garbage pile", "trash overflow", "scattered waste"] | ||
| fire_pos = ["fire", "smoke", "flames"] |
There was a problem hiding this comment.
Label sets diverge from individual detector functions — results may differ between optimized and fallback paths.
The optimized function's fire_pos is ["fire", "smoke", "flames"], but detect_fire_clip (line 105) also includes "burning" as a target label. Similarly, the vandalism/infrastructure labels here don't match the individual detectors (which may use different label sets). This means the unified detection service's optimized path and its asyncio.gather fallback can produce different results for the same image.
Consider aligning the label sets, or at minimum documenting this as a known trade-off.
#!/bin/bash
# Compare label sets between optimized and individual detectors
echo "=== detect_fire_clip labels ==="
rg -A2 'def detect_fire_clip' backend/hf_api_service.py
echo ""
echo "=== detect_all_clip_optimized fire_pos ==="
rg 'fire_pos' backend/hf_api_service.py
echo ""
echo "=== All individual detector target labels ==="
rg -A2 'targets = \[' backend/hf_api_service.py🤖 Prompt for AI Agents
In `@backend/hf_api_service.py` around lines 464 - 469, The optimized detector
label lists (vandalism_pos, infrastructure_pos, flooding_pos, garbage_pos,
fire_pos) diverge from the individual detector functions (e.g., detect_fire_clip
includes "burning"), causing inconsistent results between
detect_all_clip_optimized and the asyncio.gather fallback; fix by centralizing
and reusing a single canonical set of target labels (or constants) used by both
detect_all_clip_optimized and each individual detector (e.g., update
detect_all_clip_optimized to use the same label set as detect_fire_clip or move
the label arrays into shared constants and import/use them across
detect_fire_clip and other detect_* functions), ensure fire_pos includes
"burning", and update any related tests or module-level docs to reflect the
unified label definitions.
| # Use the stored previous hash if available (robust against deletion), otherwise try to reconstruct | ||
| if current_issue.previous_integrity_hash is not None: | ||
| prev_hash = current_issue.previous_integrity_hash | ||
| else: | ||
| # Fallback for old records without previous_integrity_hash | ||
| prev_issue_hash = await run_in_threadpool( | ||
| lambda: db.query(Issue.integrity_hash).filter(Issue.id < issue_id).order_by(Issue.id.desc()).first() | ||
| ) | ||
| prev_hash = prev_issue_hash[0] if prev_issue_hash and prev_issue_hash[0] else "" |
There was a problem hiding this comment.
Fallback logic for old records is sound, but note the race condition in chain construction.
The fallback query (Issue.id < issue_id, ordered desc) correctly reconstructs the chain for pre-migration records. The is not None check properly distinguishes between new records (which store "" for the first issue) and old records (which have NULL).
However, be aware that the chain construction during create_issue (lines 172-179) is not atomic — concurrent issue creations can both read the same prev_hash, forking the chain. This is a pre-existing design limitation, not introduced here, but worth noting as the chain verification feature matures.
🤖 Prompt for AI Agents
In `@backend/routers/issues.py` around lines 624 - 632, The fallback logic using
current_issue.previous_integrity_hash and the run_in_threadpool query to
reconstruct prev_hash is fine but susceptible to a race where concurrent
create_issue calls can read the same prev_hash and fork the chain; to fix, make
chain construction atomic in create_issue by wrapping the
read-of-latest-hash-and-insert in a DB transaction with a row-level lock (e.g.,
SELECT ... FOR UPDATE on the latest Issue row) or use a DB advisory lock around
the prev-hash read+insert, and add a short comment near
current_issue.previous_integrity_hash / the run_in_threadpool fallback noting
this race and pointing to the transactional/advisory-lock approach (reference:
create_issue, current_issue.previous_integrity_hash,
run_in_threadpool/db.query(Issue.integrity_hash)).
…ements - Removed `python-magic` and `scikit-learn` from `backend/requirements-render.txt` to prevent build failures on Render. - Updated `backend/spatial_utils.py` to gracefully handle missing `scikit-learn` (DBSCAN) by falling back to treating points as individual clusters. - Verified `backend/utils.py` correctly handles missing `python-magic`. Co-authored-by: RohanExploit <178623867+RohanExploit@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@backend/spatial_utils.py`:
- Around line 134-141: The code computes eps_degrees from eps_meters but then
passes that degree value to sklearn's DBSCAN with metric='haversine' (in
DBSCAN(..., eps=eps_degrees)), however haversine expects eps in radians; replace
eps_degrees with an eps in radians (e.g., eps_radians = eps_meters /
EARTH_RADIUS_METERS where EARTH_RADIUS_METERS ≈ 6_371_000) and pass eps_radians
to DBSCAN, removing or deprecating the eps_degrees calculation; update
references around the DBSCAN call and np.radians(coordinates) usage so eps and
coordinates are consistently in radians.
🧹 Nitpick comments (3)
backend/spatial_utils.py (3)
152-155: Uselogger.exceptionand narrow the except clause slightly.
logger.errorwon't include the traceback;logger.exceptionis the idiomatic alternative for logging inside anexceptblock. Also, catching bareExceptionis flagged by Ruff (BLE001) — consider narrowing toValueError, RuntimeErroror at least adding a comment justifying the broad catch.♻️ Proposed fix
- except Exception as e: - logger.error(f"DBSCAN clustering failed: {e}") + except Exception: + logger.exception("DBSCAN clustering failed")
150-151: Noise-label filter is a no-op withmin_samples=1.With
min_samples=1every point is a core point, so DBSCAN never assigns label-1. Theif label != -1guard is dead code. Not harmful, but a comment clarifying the intent (or removing the filter) would reduce future confusion.
156-159: Fallback behaviour is reasonable but worth documenting in the docstring.When sklearn is absent, callers silently get single-issue clusters, which may affect downstream deduplication or aggregation logic. Consider adding a note in the docstring (or a
logger.warningon first call) so operators know clustering is degraded.
| eps_degrees = eps_meters / 111000 # Rough approximation | ||
|
|
||
| # Perform DBSCAN clustering | ||
| db = DBSCAN(eps=eps_degrees, min_samples=1, metric='haversine').fit( | ||
| np.radians(coordinates) | ||
| ) | ||
|
|
||
| # Group issues by cluster | ||
| clusters = {} | ||
| for i, label in enumerate(db.labels_): | ||
| if label not in clusters: | ||
| clusters[label] = [] | ||
| clusters[label].append(valid_issues[i]) | ||
|
|
||
| # Return clusters as list of lists (exclude noise points labeled as -1) | ||
| return [cluster for label, cluster in clusters.items() if label != -1] | ||
| if HAS_SKLEARN: | ||
| try: | ||
| # Perform DBSCAN clustering | ||
| db = DBSCAN(eps=eps_degrees, min_samples=1, metric='haversine').fit( | ||
| np.radians(coordinates) | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
sklearn DBSCAN haversine metric eps radians units documentation
💡 Result:
In sklearn.cluster.DBSCAN, eps is interpreted in the same units as the distance returned by metric (“choose appropriately for your data set and distance function”). [1]
For metric="haversine" in scikit-learn:
- Your input must be 2D as [latitude, longitude] in radians. [2]
- The resulting haversine distance is an angular distance in radians (i.e.,
epsmust be in radians, not meters/kilometers). [2][3]
So for an Earth-distance threshold, convert like:
eps_rad = eps_meters / earth_radius_meters(or km/km)- and pass coordinates as
np.radians([[lat_deg, lon_deg], ...]). [2]
Sources: DBSCAN parameter definition [1], haversine distance definition + radians requirement [2], DistanceMetric note that haversine inputs/outputs are radians [3].
References:
[1] (sklearn.org)
[2] (scikit-learn.org)
[3] (scikit-learn.org)
Citations:
- 1: https://sklearn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html?utm_source=openai
- 2: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.haversine_distances.html?utm_source=openai
- 3: https://scikit-learn.org/0.19/modules/generated/sklearn.neighbors.DistanceMetric.html?utm_source=openai
🏁 Script executed:
find . -name "spatial_utils.py" -type fRepository: RohanExploit/VishwaGuru
Length of output: 93
🏁 Script executed:
cat -n ./backend/spatial_utils.py | sed -n '120,160p'Repository: RohanExploit/VishwaGuru
Length of output: 1819
🏁 Script executed:
python3 << 'EOF'
# Verify the math from the review
eps_meters = 30
eps_degrees = eps_meters / 111000
earth_radius_m = 6371000
# If interpreted as radians
eps_as_radians_in_meters = eps_degrees * earth_radius_m
# Ratio
ratio = eps_as_radians_in_meters / eps_meters
print(f"eps_degrees: {eps_degrees:.10f}")
print(f"If interpreted as radians: {eps_as_radians_in_meters:.2f} meters")
print(f"Ratio (actual vs intended): {ratio:.1f}×")
print(f"\nCorrect conversion:")
print(f"eps_radians = {eps_meters} / {earth_radius_m} = {eps_meters / earth_radius_m:.10f}")
EOFRepository: RohanExploit/VishwaGuru
Length of output: 231
Fix eps unit mismatch — clusters will be ~57× larger than intended.
The eps_degrees value is computed in degrees (~0.000270 for 30 m), but sklearn's haversine metric interprets eps in radians. DBSCAN therefore treats 0.000270 as radians ≈ 1,722 m — roughly 57× the intended 30 m radius.
Convert eps to radians using Earth's radius:
- eps_degrees = eps_meters / 111000 # Rough approximation
+ eps_radians = eps_meters / 6371000.0
if HAS_SKLEARN:
try:
# Perform DBSCAN clustering
- db = DBSCAN(eps=eps_degrees, min_samples=1, metric='haversine').fit(
+ db = DBSCAN(eps=eps_radians, min_samples=1, metric='haversine').fit(🤖 Prompt for AI Agents
In `@backend/spatial_utils.py` around lines 134 - 141, The code computes
eps_degrees from eps_meters but then passes that degree value to sklearn's
DBSCAN with metric='haversine' (in DBSCAN(..., eps=eps_degrees)), however
haversine expects eps in radians; replace eps_degrees with an eps in radians
(e.g., eps_radians = eps_meters / EARTH_RADIUS_METERS where EARTH_RADIUS_METERS
≈ 6_371_000) and pass eps_radians to DBSCAN, removing or deprecating the
eps_degrees calculation; update references around the DBSCAN call and
np.radians(coordinates) usage so eps and coordinates are consistently in
radians.
There was a problem hiding this comment.
2 issues found across 2 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/spatial_utils.py">
<violation number="1" location="backend/spatial_utils.py:17">
P2: Silent degradation: when `sklearn` import fails, no warning is logged. Spatial clustering (deduplication) will be completely disabled with no diagnostic output, making it hard to detect in production.</violation>
<violation number="2" location="backend/spatial_utils.py:139">
P1: Bug: `eps` is in degrees but the `haversine` metric expects radians. This makes clusters ~57x larger than intended (e.g., ~1.7 km instead of 30 m). Convert using Earth's radius: `eps = eps_meters / 6_371_000.0`.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| if HAS_SKLEARN: | ||
| try: | ||
| # Perform DBSCAN clustering | ||
| db = DBSCAN(eps=eps_degrees, min_samples=1, metric='haversine').fit( |
There was a problem hiding this comment.
P1: Bug: eps is in degrees but the haversine metric expects radians. This makes clusters ~57x larger than intended (e.g., ~1.7 km instead of 30 m). Convert using Earth's radius: eps = eps_meters / 6_371_000.0.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/spatial_utils.py, line 139:
<comment>Bug: `eps` is in degrees but the `haversine` metric expects radians. This makes clusters ~57x larger than intended (e.g., ~1.7 km instead of 30 m). Convert using Earth's radius: `eps = eps_meters / 6_371_000.0`.</comment>
<file context>
@@ -125,20 +133,30 @@ def cluster_issues_dbscan(issues: List[Issue], eps_meters: float = 30.0) -> List
+ if HAS_SKLEARN:
+ try:
+ # Perform DBSCAN clustering
+ db = DBSCAN(eps=eps_degrees, min_samples=1, metric='haversine').fit(
+ np.radians(coordinates)
+ )
</file context>
| db = DBSCAN(eps=eps_degrees, min_samples=1, metric='haversine').fit( | |
| db = DBSCAN(eps=eps_meters / 6_371_000.0, min_samples=1, metric='haversine').fit( |
| from sklearn.cluster import DBSCAN | ||
| HAS_SKLEARN = True | ||
| except ImportError: | ||
| HAS_SKLEARN = False |
There was a problem hiding this comment.
P2: Silent degradation: when sklearn import fails, no warning is logged. Spatial clustering (deduplication) will be completely disabled with no diagnostic output, making it hard to detect in production.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/spatial_utils.py, line 17:
<comment>Silent degradation: when `sklearn` import fails, no warning is logged. Spatial clustering (deduplication) will be completely disabled with no diagnostic output, making it hard to detect in production.</comment>
<file context>
@@ -3,11 +3,19 @@
+ from sklearn.cluster import DBSCAN
+ HAS_SKLEARN = True
+except ImportError:
+ HAS_SKLEARN = False
+
</file context>
| HAS_SKLEARN = False | |
| HAS_SKLEARN = False | |
| logger.warning("scikit-learn is not installed; DBSCAN clustering will be unavailable. " | |
| "Each issue will be treated as its own cluster.") |
- Re-added `scikit-learn` and `joblib` to `backend/requirements-render.txt`. - These are required by `backend/ml/train_grievance.py` and `backend/grievance_classifier.py` for loading the `grievance_model.joblib`. - Their absence was causing `ModuleNotFoundError` during startup or model loading in production. Co-authored-by: RohanExploit <178623867+RohanExploit@users.noreply.github.com>
🔍 Quality Reminder |
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/requirements-render.txt">
<violation number="1" location="backend/requirements-render.txt:20">
P2: `scikit-learn` is being **added** to the Render production requirements, but the PR description states it was **removed** and the codebase (`spatial_utils.py`) already has a graceful fallback for when it's missing. This contradicts the PR's stated design goal of making it optional. If the intent is for spatial clustering to fall back gracefully, `scikit-learn` should not be listed here — it adds ~150 MB to the production image unnecessarily.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
… gracefully - Removed `scikit-learn` and `joblib` from `backend/requirements-render.txt` to solve deployment failures caused by slug size or build timeouts. - Updated `backend/grievance_classifier.py` to wrap `joblib` imports in `try-except`. - Updated `backend/ml/train_grievance.py` to wrap `sklearn` imports in `try-except`. - These changes ensure the application can start and run in production even if these heavy dependencies are missing, simply disabling the classification feature instead of crashing. Co-authored-by: RohanExploit <178623867+RohanExploit@users.noreply.github.com>
This PR addresses performance and robustness issues.
previous_integrity_hashexplicitly. This ensures that verification remains valid even if intermediate records are deleted (broken chain detection vs data tampering).python-magicimport to prevent crashes in environments where the library is missing (e.g., specific CI/CD runners or minimal containers).PR created automatically by Jules for task 17076509378211890828 started by @RohanExploit
Summary by cubic
Speeds up Detect All by batching into one Hugging Face CLIP call and enables PostgreSQL pooling for higher concurrency. Adds previous_integrity_hash for stronger ledger verification and makes heavy dependencies optional with safe fallbacks.
New Features
Migration
Written for commit 86533b5. Summary will update on new commits.
Summary by CodeRabbit
New Features
Improvements
Chores