Skip to content
122 changes: 76 additions & 46 deletions docs.json

Large diffs are not rendered by default.

23 changes: 23 additions & 0 deletions fundamentals/cometchat-on-prem/docker/air-gapped-deployment.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
title: "Air-Gapped Deployment"
sidebarTitle: "Air-Gapped"
---

Guidelines for deploying the platform in offline or isolated (air-gapped) environments.

## Offline installation steps

- Export required Docker images with `docker save`
- Transfer images via removable media, secure copy (SSH), or an isolated internal network
- Import images on the target system with `docker load`

## Local registry

- Host images in Harbor, Nexus, or a private Docker registry
- Enforce role-based access control (RBAC) and image retention policies

## Limitations in air-gapped mode

- No access to external push notification services
- No S3 or other cloud object storage unless internally emulated
- No cloud-hosted analytics, logging, or monitoring integrations
120 changes: 120 additions & 0 deletions fundamentals/cometchat-on-prem/docker/configuration-reference.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
---
title: "Configuration Reference"
sidebarTitle: "Configuration"
---

Use this reference when updating domains, migrating environments, troubleshooting misconfiguration, or performing production deployments. Values are sourced from `docker-compose.yml`, service-level `.env` files, and the domain update guide.

Use this when:
- Updating domains
- Migrating environments
- Troubleshooting service misconfiguration
- Performing production deployments

## Global notes

- All services read environment variables from their respective directories.
- Domain values must be updated consistently across API, WebSocket, Notifications, Webhooks, and NGINX configurations.
- Changing the primary domain impacts reverse proxy routing, OAuth headers, CORS, webhook endpoints, and TiDB host references.

## Chat API

Update these values when changing domains:

- `MAIN_DOMAIN="<your-domain>"`
- `EXTENSION_DOMAIN="<your-domain>"`
- `WEBHOOKS_BASE_URL="https://webhooks.<your-domain>/v1/webhooks"`
- `TRIGGERS_BASE_URL="https://webhooks.<your-domain>/v1/triggers"`
- `EXTENSION_BASE_URL="https://notifications.<your-domain>"`
- `MODERATION_ENABLED=true`
- `RULES_BASE_URL="https://moderation.<your-domain>/v1/moderation-service"`
- `ADMIN_API_HOST="api.<your-domain>"`
- `CLIENT_API_HOST="apiclient.<your-domain>"`
- `ALLOWED_API_DOMAINS="<your-domain>,<additional-domain>"`
- `DB_HOST="tidb.<your-domain>"`
- `DB_HOST_CREATOR="tidb.<your-domain>"`
- `V3_CHAT_HOST="websocket.<your-domain>"`

## Management API (MGMT API)

- `ADMIN_API_HOST="api.<your-domain>"`
- `CLIENT_API_HOST="apiclient.<your-domain>"`
- `APP_HOST="dashboard.<your-domain>"`
- `API_HOST="https://mgmt-api.<your-domain>"`
- `MGMT_DOMAIN="<your-domain>"`
- `MGMT_DOMAIN_TO_REPLACE="<your-domain>"`
- `RULES_BASE_URL="https://moderation.<your-domain>/v1/moderation"`
- `ACCESS_CONTROL_ALLOW_ORIGIN="<your-domain>,<additional-domain>"`

## WebSocket

Hostnames are derived automatically from NGINX and Chat API configuration; no manual domain updates are required.

## Notifications service

- `CC_DOMAIN="<your-domain>"` (controls routing, token validation, and push delivery)

## Moderation service

- `CHAT_API_URL="<your-domain>"` for rule evaluation, metadata retrieval, and decision submission

## Webhooks service

- `CHAT_API_DOMAIN="<your-domain>"` - must match the Chat API domain exactly to avoid retries or signature verification failures

## Extensions

```json
"DOMAINS": [
"<allowed-domain-1>",
"<allowed-domain-2>",
"<your-domain>"
],
"DOMAIN_NAME": "<your-domain>"
```

Defines CORS and allowed origins for extension traffic.

## Receipt Updater

- `RECEIPTS_MYSQL_HOST="tidb.<your-domain>"` for delivery receipts, read receipts, and thread metadata

## SQL Consumer

```json
"CONNECTION_CONFIG": {
"host": "<tidb-host>"
},
"ALTER_USER_CONFIG": {
"host": "<tidb-host>"
},
"API_CONFIG": {
"API_DOMAIN": "<api-domain>"
}
```

Controls database migrations, multi-tenant provisioning, and internal requests to Chat API.

## NGINX configuration files

Update domain values in:

- chatapi.conf
- extensions.conf
- mgmtapi.conf
- notifications.conf
- dashboard.conf
- globalwebhooks.conf
- moderation.conf
- websocket.conf

These govern TLS termination, routing, reverse proxy rules, and WebSocket upgrades.

## Summary of domain values to update

- Chat API, Client API, and Management API
- Notifications, Moderation, Webhooks, and Extensions services
- NGINX reverse proxy hostnames
- TiDB host references
- WebSocket host configuration in Chat API

175 changes: 175 additions & 0 deletions fundamentals/cometchat-on-prem/docker/monitoring.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
---
title: "Monitoring"
sidebarTitle: "Monitoring"
---

Monitoring ensures system health, operational visibility, and SLA compliance for CometChat On-Prem deployments.

## Monitoring stack

The following open-source tools form the monitoring and observability stack for CometChat On-Prem deployments:

- **Prometheus**: Collects and stores metrics from all services
- **Grafana**: Visualizes metrics with dashboards and alerts
- **Loki**: Stores and queries logs from all containers
- **Promtail**: Tails logs from Docker containers and pushes them to Loki
- **Node Exporter**: Collects host-level metrics (CPU, memory, disk, network)
- **cAdvisor**: Collects container-level resource usage metrics

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│ Grafana │
│ (Dashboards & Visualization) │
└──────────────┬─────────────────────────┬────────────────────┘
│ │
│ Queries │ Queries
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Prometheus │ │ Loki │
│ (Metrics Store) │ │ (Log Store) │
└────────┬─────────┘ └────────┬─────────┘
│ │
│ Scrapes (/metrics) │ Pushes
▼ ▼
┌─────────────────────────────────────────┐
│ Node Exporter │ cAdvisor │ Promtail │
│ (Host Metrics) │ (Container)│ (Logs) │
└─────────────────────────────────────────┘
│ │ │
└────────────────┴──────────────┘
┌───────▼────────┐
│ Docker Swarm │
│ CometChat │
│ Services │
└────────────────┘
```

## Key metrics to monitor

### Infrastructure
- CPU usage per node
- Memory usage per node
- Disk space and I/O
- Network traffic
- Container resource usage

### Application services
- WebSocket active connections
- Chat API request rate and latency
- API error rates (4xx, 5xx)
- Service uptime

### Data stores
- **Kafka**: Consumer lag, message throughput
- **Redis**: Memory usage, cache hit ratio, connected clients
- **MongoDB**: Operation latency, connections, replication lag
- **TiDB**: Query duration, region health, storage capacity

### Load balancer
- NGINX request rate
- Response status codes
- Active connections

## Alerting

Alerts should focus on user impact, capacity risks, and data integrity rather than raw metric noise.

Set up alerts for these critical conditions:

- CPU usage > 80% for 5 minutes
- Memory usage > 85% for 5 minutes
- Disk space < 15%
- Service down for 2 minutes
- Database query latency > 100ms
- Kafka consumer lag > 10,000 messages
- Redis memory > 90%
- WebSocket connection errors > 10/second
- API error rate > 5%
- Container restarts

These thresholds are recommended starting points and should be adjusted based on workload characteristics and environment scale.

## Grafana dashboards

Create dashboards to visualize:

1. **Overview**: System health, active users, request rates, error rates
2. **Infrastructure**: CPU, memory, disk, network per node
3. **WebSocket**: Active connections, message throughput, errors
4. **API**: Request rate, latency, error rates by endpoint
5. **Databases**: Query performance, connections, replication status
6. **Kafka**: Consumer lag, throughput, partition health
7. **Logs & Error Analysis**: Error aggregation, log volume, search, and correlation with metrics

### Logs & Error Analysis Dashboard

This dashboard provides centralized visibility into application errors, log patterns, and system anomalies for rapid troubleshooting and incident investigation.

**Key Visualizations:**

- **Error Volume by Service**: Time-series graph showing error log count per service, helping identify which components are experiencing issues
- **Top Error Messages**: Table displaying the most frequent error messages with occurrence counts, enabling quick identification of recurring problems
- **Log Volume Trends**: Track total log volume over time to detect unusual spikes that may indicate issues or attacks
- **Error Rate by Severity**: Breakdown of errors by severity level (CRITICAL, ERROR, WARNING) for prioritization
- **Service Health Correlation**: Side-by-side view of error logs and service metrics (CPU, memory, latency) to correlate errors with resource constraints
- **Search & Filter**: Interactive LogQL query panel for ad-hoc log searches and pattern matching
- **Recent Critical Errors**: Live feed of the latest critical errors across all services for immediate awareness

**Use Cases:**
- Rapid incident investigation by correlating errors with metric anomalies
- Identifying error patterns and root causes across distributed services
- Monitoring error trends to detect degradation before user impact
- Post-incident analysis and root cause identification
- Compliance and audit trail review

## Log queries

Use Loki's LogQL to search and filter logs across all services:

```logql
# View all errors
{service="chat-api"} |= "error"

# WebSocket connection issues
{service="websocket"} |~ "connection.*failed"

# API 5xx errors
{service="nginx"} |~ "HTTP/[0-9.]+ 5[0-9]{2}"

# High latency requests
{service="chat-api"} | json | latency > 1000
```

## Troubleshooting

### First check Grafana dashboards

Start with the Overview dashboard to determine blast radius before drilling into component-level dashboards. Confirm whether the issue is node-level, service-level, or data-store related before diving into individual components.

### Check Prometheus targets
```bash
curl http://localhost:9090/api/v1/targets
```

### Check Loki status
```bash
curl http://localhost:3100/ready
```

### View Promtail logs
```bash
docker service logs promtail
```

### Check service metrics
```bash
# Node Exporter
curl http://localhost:9100/metrics

# cAdvisor
curl http://localhost:8080/metrics
```

Loading