Monitoring in the ClickHouse Cloud Console

Services in ClickHouse Cloud come with out-of-the-box monitoring components that serve users with dashboards and notifications. By default, all users in the Cloud Console can access these dashboards.

Dashboards

Service health

The Service Health dashboard can be used to monitor the high-level health of a service. ClickHouse Cloud scrapes and stores metrics displayed on this dashboard from system tables so they can be viewed when a service is idled.

Resource utilization

The Infrastructure dashboard provides a detailed view of resources being used by the ClickHouse process. ClickHouse Cloud scrapes and stores metrics displayed on this dashboard from system tables so they can be viewed when a service is idled.

Memory and CPU

The Allocated CPU and Allocated Memory graphs display the total compute resources available for each replica in your service. These allocations can be changed by using ClickHouse Cloud's scaling features.

The Memory Usage and CPU Usage graphs estimate how much CPU and memory is actually being utilized by ClickHouse processes in each replica, including queries as well as background processes like merges.

Performance degradation

If the memory or CPU utilization is approaching the allocated memory or CPU, you may begin to experience performance degradation. To resolve, we recommend:

Optimizing your queries
Changing the partitioning of your table engines
Adding more compute resources to your service using scaling

These are the corresponding system table metrics displayed on these graphs:

Graph	Corresponding metric name	Aggregation	Notes
Allocated memory	`CGroupMemoryTotal`	Max
Allocated CPU	`CGroupMaxCPU`	Max
Memory used	`MemoryResident`	Max
CPU used	System CPU metric	Max	`ClickHouseServer_UsageCores` via Prometheus endpoint

Data transfer

Graphs display data ingress and egress from ClickHouse Cloud. Learn more about network data transfer.

Advanced dashboard

This dashboard is a modified version of the built-in advanced observability dashboard, with each series representing metrics per replica. This dashboard can be useful for monitoring and troubleshooting ClickHouse-specific issues.

Note

ClickHouse Cloud scrapes and stores metrics displayed on this dashboard from system tables so they can be viewed even when a service is idled. Accessing these metrics does not issue a query to the underlying service and will not wake idle services.

The table below maps each graph in the Advanced Dashboard to its corresponding ClickHouse metric, system table source, and aggregation type:

Graph	Corresponding ClickHouse metric name	System table	Aggregation Type
Queries/sec	`ProfileEvent_Query`	`metric_log`	Sum / bucketSizeSeconds
Queries running	`CurrentMetric_Query`	`metric_log`	Avg
Merges running	`CurrentMetric_Merge`	`metric_log`	Avg
Selected bytes/sec	`ProfileEvent_SelectedBytes`	`metric_log`	Sum / bucketSizeSeconds
IO Wait	`ProfileEvent_OSIOWaitMicroseconds`	`metric_log`	Sum / bucketSizeSeconds
S3 read wait	`ProfileEvent_ReadBufferFromS3Microseconds`	`metric_log`	Sum / bucketSizeSeconds
S3 read errors/sec	`ProfileEvent_ReadBufferFromS3RequestsErrors`	`metric_log`	Sum / bucketSizeSeconds
CPU wait	`ProfileEvent_OSCPUWaitMicroseconds`	`metric_log`	Sum / bucketSizeSeconds
OS CPU usage (userspace, normalized)	`OSUserTimeNormalized`	`asynchronous_metric_log`
OS CPU usage (kernel, normalized)	`OSSystemTimeNormalized`	`asynchronous_metric_log`
Read from disk	`ProfileEvent_OSReadBytes`	`metric_log`	Sum / bucketSizeSeconds
Read from filesystem	`ProfileEvent_OSReadChars`	`metric_log`	Sum / bucketSizeSeconds
Memory (tracked, bytes)	`CurrentMetric_MemoryTracking`	`metric_log`
Total MergeTree parts	`TotalPartsOfMergeTreeTables`	`asynchronous_metric_log`
Max parts for partition	`MaxPartCountForPartition`	`asynchronous_metric_log`
Read from S3	`ProfileEvent_ReadBufferFromS3Bytes`	`metric_log`	Sum / bucketSizeSeconds
Filesystem cache size	`CurrentMetric_FilesystemCacheSize`	`metric_log`
Disk S3 write req/sec	`ProfileEvent_DiskS3PutObject` + `ProfileEvent_DiskS3UploadPart` + `ProfileEvent_DiskS3CreateMultipartUpload` + `ProfileEvent_DiskS3CompleteMultipartUpload`	`metric_log`	Sum / bucketSizeSeconds
Disk S3 read req/sec	`ProfileEvent_DiskS3GetObject` + `ProfileEvent_DiskS3HeadObject` + `ProfileEvent_DiskS3ListObjects`	`metric_log`	Sum / bucketSizeSeconds
FS cache hit rate	`sum(ProfileEvent_CachedReadBufferReadFromCacheBytes) / (sum(ProfileEvent_CachedReadBufferReadFromCacheBytes) + sum(ProfileEvent_CachedReadBufferReadFromSourceBytes))`	`metric_log`
Page cache hit rate	`greatest(0, (sum(ProfileEvent_OSReadChars) - sum(ProfileEvent_OSReadBytes)) / (sum(ProfileEvent_OSReadChars) + sum(ProfileEvent_ReadBufferFromS3Bytes)))`	`metric_log`
Network receive bytes/sec	`NetworkReceiveBytes`	`asynchronous_metric_log`	Sum / bucketSizeSeconds
Network send bytes/sec	`NetworkSendBytes`	`asynchronous_metric_log`	Sum / bucketSizeSeconds
Concurrent TCP connections	`CurrentMetric_TCPConnection`	`metric_log`
Concurrent MySQL connections	`CurrentMetric_MySQLConnection`	`metric_log`
Concurrent HTTP connections	`CurrentMetric_HTTPConnection`	`metric_log`

For detailed information on each visualization and how to use them for troubleshooting, see the advanced dashboard documentation.

Query insights

The Query Insights feature makes ClickHouse's built-in query log easier to use through various visualizations and tables. ClickHouse's system.query_log table is a key source of information for query optimization, debugging, and monitoring overall cluster health and performance.

After selecting a service, the Monitoring navigation item in the left sidebar expands to reveal a Query insights sub-item:

Top-level metrics

The stat boxes at the top represent basic query metrics over the selected time period. Beneath them, time-series charts show query volume, latency, and error rate broken down by query kind (select, insert, other). The latency chart can be adjusted to display p50, p90, and p99 latencies:

Recent queries

A table displays query log entries grouped by normalized query hash and user over the selected time window. Recent queries can be filtered and sorted by any available field, and the table can be configured to display or hide additional fields such as tables, p90, and p99 latencies:

Query drill-down

Selecting a query from the recent queries table will open a flyout containing metrics and information specific to the selected query:

All metrics in the Query info tab are aggregated metrics, but we can also view metrics from individual runs by selecting the Query history tab:

From this pane, the Settings and Profile Events items for each query run can be expanded to reveal additional information.

Notifications — Configure alerts for scaling events, errors, and billing
Advanced dashboard — Detailed reference for each dashboard visualization
Querying system tables — Run custom SQL queries against system tables for deep introspection
Prometheus endpoint — Export metrics to Grafana, Datadog, or other Prometheus-compatible tools

Dashboards​

Service health​

Resource utilization​

Memory and CPU​

Data transfer​

Advanced dashboard​

Query insights​

Top-level metrics​

Recent queries​

Query drill-down​

Related pages​