Digital Domain: Real-time systems monitoring
Developing performance monitoring for large-scale virtual reality service using CrateDB and Grafana.
Pioneering Streaming Virtual Reality
Digital Domain is a global leader in visual effects, interactive content and creating “virtual humans” for use in films and live events. In 2016 they were pioneering the development of new immersive virtual reality experiences.
The engine behind the VR offerings is a cloud system that delivers virtual reality content, including streaming of video and audio from live events.
Uptime and content quality and fidelity are vitally important to the service. Therefore, they extensively monitor their system for problems like outages, load, audio and video sync problems, latency, lag, and so on.
Building your own Performance Monitoring...
Rather than use a commercial monitoring system, they chose to build their own. “For a system as large as ours, it’s more economical to build your own monitoring. Time plus effort plus open source software on a couple of AWS instances is way cheaper than commercial monitoring,” said Joe Hacobian, infrastructure engineer at Digital Domain.
The two biggest decisions the team had to make were what to use for visualization and what database to use.
Grafana was the visualization choice. It’s designed for their use case, and they loved the correlated cursors, which keep all the charts in the dashboard synchronized as the user drills down into data. It made it easy to explore the cause and effect of system events.
MySQL? InfluxDB? Elasticsearch? CrateDB?....CrateDB
For the database, they needed something that could process a lot of system metrics data, a time-series database with real-time insights. And in a next phase of the project, support streaming log analytics, which required more search capability.
They wanted to avoid using multiple specialized databases. According to Joe, “I’m starting with monitoring metrics, but I plan to add log streaming and have all my ad-hoc analytics in one database.”
They considered the following database options:
- MySQL, but didn’t believe it could handle the data volume and real-time requirements.
- Elasticsearch, capable and great for logs, but aggregating time series wasn’t easy.
- InfluxDB, but felt it was too one-dimensional, and log analytics was too far outside its sweet spot.
Ultimately they chose CrateDB.
The monitoring system, called SAGE, acts as an API checking network. Heartbeat and metrics data are stored in 1-second intervals in CrateDB and visualized in Grafana, nearly 1GB of data per day. As the resolution changes in the Grafana dashboard, from weekly, drilled down to a specific minute or second, the graphs update instantly with data from CrateDB.
The whole team has access to the Grafana panel and ask questions all the time about what’s causing weird behavior in the system. Dashboard queries are so easy to understand in CrateDB (SQL) vs. noSQL sources like Elasticsearch. “I’m glad it’s SQL behind those charts. If I had to go through Elasticsearch language to figure out what what’s going on behind a graph or answer new questions, we wouldn’t be nearly as responsive to new requirements.”