Circuit Breaker

CrateDB’s circuit breakers prevent queries and background processes from exhausting node memory and destabilizing the cluster. This works by estimating the memory each operation will require and aborting the operation before the JVM heap space is exhausted.

What is a Circuit Breaker

A circuit breaker is a safeguard that halts work when resource usage crosses a dangerous threshold. Think of the miniature breakers inside a household fuse box: if too many appliances pull current on the same line, the breaker trips and cuts power to prevent the wires from melting. The same principle applies in software, only the resource under pressure is memory, CPU, file descriptors, or an external service.

In CrateDB, the critical resource is RAM. Queries run in parallel across many shards; a single oversize aggregation or JOIN can allocate gigabytes in milliseconds. The breaker detects this and aborts the query with a CircuitBreakingException instead of letting the JVM run out of heap and crash the node.

How Circuit Breakers Work in CrateDB

A query executes as an ordered set of operations. Before running each stage, CrateDB estimates the extra memory that step will need. If the projected total exceeds the breaker limit, the system aborts the query and returns a CircuitBreakingException. This pre-emptive trip prevents the JVM’s garbage collector from reaching an unrecoverable out-of-memory state.

It is important to understand CrateDB doesn’t aspire to do a fully accurate memory accounting, but instead opts for a best-effort approach, since a precise estimate is tricky to achieve.

Types of Circuit Breakers

There are six different Circuit Breaker types which are described in detail in the cluster settings documentation page: query, request, jobs_log, operations_log, total and accounting, which was deprecated and will be removed soon. The total Circuit Breaker, also known as parent, accounts for all others, meaning that it controls the general use of memory, tripping an operation if a combination of the circuit breakers threatens the cluster.

Monitoring & Observability

To monitor the Circuit Breaker usage, follow the JMX monitoring guide. In particular, consult the CircuitBreakers MXBean section.

For hosted deployments, see Cloud monitoring and for self-managed clusters, the on-prem monitoring guide. Both describe the complete path from cluster deployment to collecting metrics and displaying them on a Grafana dashboard.

Exception Handling

CircuitBreakingException[Allocating 2mb for 'query: mergeOnHandler' failed, breaker would use 976.4mb in total. Limit is 972.7mb. Either increase memory and limit, change the query or reduce concurrent query load]
  • Understanding the error

    The memory estimate for mergeOnHandler exceeded the indices.breaker.query.limit, so the query was aborted and the exception returned.

  • Immediate actions

    • Optimize the query - see Query Optimization 101 for detailed guidance.

    • Identify memory-hungry queries - run:

      SELECT  js.id,
              stmt,
              username,
              sum(used_bytes) sum_bytes
      FROM sys.operations op
      JOIN sys.jobs js ON op.job_id = js.id
      GROUP BY js.id, stmt, username
      ORDER BY sum_bytes DESC;
      

      The query above will return all jobs that are currently being executed in the cluster. To check completed jobs and operations, query the corresponding system tables sys.jobs_log and sys.operations_log. Access to these tables are subject to table permissions.

    • Scale the cluster - if breakers trip frequently even after query tuning, evaluate scaling your cluster to get more resources.

Similar exceptions exist for the other breaker types [request], [parent], [jobs_log], etc. If you experience a CircuitBreakingException [parent] it is because other queries/tasks were running simultaneously and their summed estimate exceeded indices.breaker.total.limit.

Further Reading

The Circuit Breaker Mechanism in CrateDB