cometchat · swapnil-cometchat · Jan 6, 2026 · Jan 8, 2026 · Jan 13, 2026 · Jan 16, 2026
diff --git a/docs.json b/docs.json
diff --git a/fundamentals/cometchat-on-prem/docker/air-gapped-deployment.mdx b/fundamentals/cometchat-on-prem/docker/air-gapped-deployment.mdx
@@ -0,0 +1,23 @@
+---
+title: "Air-Gapped Deployment"
+sidebarTitle: "Air-Gapped"
+---
+
+Guidelines for deploying the platform in offline or isolated (air-gapped) environments.
+
+## Offline installation steps
+
+- Export required Docker images with `docker save`
+- Transfer images via removable media, secure copy (SSH), or an isolated internal network
+- Import images on the target system with `docker load`
+
+## Local registry
+
+- Host images in Harbor, Nexus, or a private Docker registry
+- Enforce role-based access control (RBAC) and image retention policies
+
+## Limitations in air-gapped mode
+
+- No access to external push notification services
+- No S3 or other cloud object storage unless internally emulated
+- No cloud-hosted analytics, logging, or monitoring integrations
diff --git a/fundamentals/cometchat-on-prem/docker/configuration-reference.mdx b/fundamentals/cometchat-on-prem/docker/configuration-reference.mdx
@@ -0,0 +1,120 @@
+---
+title: "Configuration Reference"
+sidebarTitle: "Configuration"
+---
+
+Use this reference when updating domains, migrating environments, troubleshooting misconfiguration, or performing production deployments. Values are sourced from `docker-compose.yml`, service-level `.env` files, and the domain update guide.
+
+Use this when:
+- Updating domains
+- Migrating environments
+- Troubleshooting service misconfiguration
+- Performing production deployments
+
+## Global notes
+
+- All services read environment variables from their respective directories.
+- Domain values must be updated consistently across API, WebSocket, Notifications, Webhooks, and NGINX configurations.
+- Changing the primary domain impacts reverse proxy routing, OAuth headers, CORS, webhook endpoints, and TiDB host references.
+
+## Chat API
+
+Update these values when changing domains:
+
+- `MAIN_DOMAIN="<your-domain>"`
+- `EXTENSION_DOMAIN="<your-domain>"`
+- `WEBHOOKS_BASE_URL="https://webhooks.<your-domain>/v1/webhooks"`
+- `TRIGGERS_BASE_URL="https://webhooks.<your-domain>/v1/triggers"`
+- `EXTENSION_BASE_URL="https://notifications.<your-domain>"`
+- `MODERATION_ENABLED=true`
+- `RULES_BASE_URL="https://moderation.<your-domain>/v1/moderation-service"`
+- `ADMIN_API_HOST="api.<your-domain>"`
+- `CLIENT_API_HOST="apiclient.<your-domain>"`
+- `ALLOWED_API_DOMAINS="<your-domain>,<additional-domain>"`
+- `DB_HOST="tidb.<your-domain>"`
+- `DB_HOST_CREATOR="tidb.<your-domain>"`
+- `V3_CHAT_HOST="websocket.<your-domain>"`
+
+## Management API (MGMT API)
+
+- `ADMIN_API_HOST="api.<your-domain>"`
+- `CLIENT_API_HOST="apiclient.<your-domain>"`
+- `APP_HOST="dashboard.<your-domain>"`
+- `API_HOST="https://mgmt-api.<your-domain>"`
+- `MGMT_DOMAIN="<your-domain>"`
+- `MGMT_DOMAIN_TO_REPLACE="<your-domain>"`
+- `RULES_BASE_URL="https://moderation.<your-domain>/v1/moderation"`
+- `ACCESS_CONTROL_ALLOW_ORIGIN="<your-domain>,<additional-domain>"`
+
+## WebSocket
+
+Hostnames are derived automatically from NGINX and Chat API configuration; no manual domain updates are required.
+
+## Notifications service
+
+- `CC_DOMAIN="<your-domain>"` (controls routing, token validation, and push delivery)
+
+## Moderation service
+
+- `CHAT_API_URL="<your-domain>"` for rule evaluation, metadata retrieval, and decision submission
+
+## Webhooks service
+
+- `CHAT_API_DOMAIN="<your-domain>"` - must match the Chat API domain exactly to avoid retries or signature verification failures
+
+## Extensions
+
+```json
+"DOMAINS": [
+  "<allowed-domain-1>",
+  "<allowed-domain-2>",
+  "<your-domain>"
+],
+"DOMAIN_NAME": "<your-domain>"
+```
+
+Defines CORS and allowed origins for extension traffic.
+
+## Receipt Updater
+
+- `RECEIPTS_MYSQL_HOST="tidb.<your-domain>"` for delivery receipts, read receipts, and thread metadata
+
+## SQL Consumer
+
+```json
+"CONNECTION_CONFIG": {
+  "host": "<tidb-host>"
+},
+"ALTER_USER_CONFIG": {
+  "host": "<tidb-host>"
+},
+"API_CONFIG": {
+  "API_DOMAIN": "<api-domain>"
+}
+```
+
+Controls database migrations, multi-tenant provisioning, and internal requests to Chat API.
+
+## NGINX configuration files
+
+Update domain values in:
+
+- chatapi.conf
+- extensions.conf
+- mgmtapi.conf
+- notifications.conf
+- dashboard.conf
+- globalwebhooks.conf
+- moderation.conf
+- websocket.conf
+
+These govern TLS termination, routing, reverse proxy rules, and WebSocket upgrades.
+
+## Summary of domain values to update
+
+- Chat API, Client API, and Management API
+- Notifications, Moderation, Webhooks, and Extensions services
+- NGINX reverse proxy hostnames
+- TiDB host references
+- WebSocket host configuration in Chat API
+
diff --git a/fundamentals/cometchat-on-prem/docker/monitoring.mdx b/fundamentals/cometchat-on-prem/docker/monitoring.mdx
@@ -0,0 +1,175 @@
+---
+title: "Monitoring"
+sidebarTitle: "Monitoring"
+---
+
+Monitoring ensures system health, operational visibility, and SLA compliance for CometChat On-Prem deployments.
+
+## Monitoring stack
+
+The following open-source tools form the monitoring and observability stack for CometChat On-Prem deployments:
+
+- **Prometheus**: Collects and stores metrics from all services
+- **Grafana**: Visualizes metrics with dashboards and alerts
+- **Loki**: Stores and queries logs from all containers
+- **Promtail**: Tails logs from Docker containers and pushes them to Loki
+- **Node Exporter**: Collects host-level metrics (CPU, memory, disk, network)
+- **cAdvisor**: Collects container-level resource usage metrics
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                         Grafana                              │
+│              (Dashboards & Visualization)                    │
+└──────────────┬─────────────────────────┬────────────────────┘
+               │                         │
+               │ Queries                 │ Queries
+               ▼                         ▼
+    ┌──────────────────┐      ┌──────────────────┐
+    │   Prometheus     │      │      Loki        │
+    │ (Metrics Store)  │      │   (Log Store)    │
+    └────────┬─────────┘      └────────┬─────────┘
+             │                         │
+             │ Scrapes (/metrics)      │ Pushes
+             ▼                         ▼
+    ┌─────────────────────────────────────────┐
+    │  Node Exporter  │  cAdvisor  │ Promtail │
+    │  (Host Metrics) │ (Container)│  (Logs)  │
+    └─────────────────────────────────────────┘
+             │                │              │
+             └────────────────┴──────────────┘
+                            │
+                    ┌───────▼────────┐
+                    │  Docker Swarm  │
+                    │   CometChat    │
+                    │    Services    │
+                    └────────────────┘
+```
+
+## Key metrics to monitor
+
+### Infrastructure
+- CPU usage per node
+- Memory usage per node
+- Disk space and I/O
+- Network traffic
+- Container resource usage
+
+### Application services
+- WebSocket active connections
+- Chat API request rate and latency
+- API error rates (4xx, 5xx)
+- Service uptime
+
+### Data stores
+- **Kafka**: Consumer lag, message throughput
+- **Redis**: Memory usage, cache hit ratio, connected clients
+- **MongoDB**: Operation latency, connections, replication lag
+- **TiDB**: Query duration, region health, storage capacity
+
+### Load balancer
+- NGINX request rate
+- Response status codes
+- Active connections
+
+## Alerting
+
+Alerts should focus on user impact, capacity risks, and data integrity rather than raw metric noise.
+
+Set up alerts for these critical conditions:
+
+- CPU usage > 80% for 5 minutes
+- Memory usage > 85% for 5 minutes
+- Disk space < 15%
+- Service down for 2 minutes
+- Database query latency > 100ms
+- Kafka consumer lag > 10,000 messages
+- Redis memory > 90%
+- WebSocket connection errors > 10/second
+- API error rate > 5%
+- Container restarts
+
+These thresholds are recommended starting points and should be adjusted based on workload characteristics and environment scale.
+
+## Grafana dashboards
+
+Create dashboards to visualize:
+
+1. **Overview**: System health, active users, request rates, error rates
+2. **Infrastructure**: CPU, memory, disk, network per node
+3. **WebSocket**: Active connections, message throughput, errors
+4. **API**: Request rate, latency, error rates by endpoint
+5. **Databases**: Query performance, connections, replication status
+6. **Kafka**: Consumer lag, throughput, partition health
+7. **Logs & Error Analysis**: Error aggregation, log volume, search, and correlation with metrics
+
+### Logs & Error Analysis Dashboard
+
+This dashboard provides centralized visibility into application errors, log patterns, and system anomalies for rapid troubleshooting and incident investigation.
+
+**Key Visualizations:**
+
+- **Error Volume by Service**: Time-series graph showing error log count per service, helping identify which components are experiencing issues
+- **Top Error Messages**: Table displaying the most frequent error messages with occurrence counts, enabling quick identification of recurring problems
+- **Log Volume Trends**: Track total log volume over time to detect unusual spikes that may indicate issues or attacks
+- **Error Rate by Severity**: Breakdown of errors by severity level (CRITICAL, ERROR, WARNING) for prioritization
+- **Service Health Correlation**: Side-by-side view of error logs and service metrics (CPU, memory, latency) to correlate errors with resource constraints
+- **Search & Filter**: Interactive LogQL query panel for ad-hoc log searches and pattern matching
+- **Recent Critical Errors**: Live feed of the latest critical errors across all services for immediate awareness
+
+**Use Cases:**
+- Rapid incident investigation by correlating errors with metric anomalies
+- Identifying error patterns and root causes across distributed services
+- Monitoring error trends to detect degradation before user impact
+- Post-incident analysis and root cause identification
+- Compliance and audit trail review
+
+## Log queries
+
+Use Loki's LogQL to search and filter logs across all services:
+
+```logql
+# View all errors
+{service="chat-api"} |= "error"
+
+# WebSocket connection issues
+{service="websocket"} |~ "connection.*failed"
+
+# API 5xx errors
+{service="nginx"} |~ "HTTP/[0-9.]+ 5[0-9]{2}"
+
+# High latency requests
+{service="chat-api"} | json | latency > 1000
+```
+
+## Troubleshooting
+
+### First check Grafana dashboards
+
+Start with the Overview dashboard to determine blast radius before drilling into component-level dashboards. Confirm whether the issue is node-level, service-level, or data-store related before diving into individual components.
+
+### Check Prometheus targets
+```bash
+curl http://localhost:9090/api/v1/targets
+```
+
+### Check Loki status
+```bash
+curl http://localhost:3100/ready
+```
+
+### View Promtail logs
+```bash
+docker service logs promtail
+```
+
+### Check service metrics
+```bash
+# Node Exporter
+curl http://localhost:9100/metrics
+
+# cAdvisor
+curl http://localhost:8080/metrics
+```
+