Monte Carlo is the world’s first fully integrated platform for data reliability monitoring. In the same way that New Relic and DataDog help engineering teams manage application reliability, Monte Carlo helps data teams manage data reliability. By creating observability around data, Monte Carlo allows data teams to:
- Know when data breaks as soon as it happens
- Find root cause, fast
- Prevent data issues from occurring
Ultimately, Monte Carlo radically reduces data downtime, allowing data teams to mitigate the impact of bad data on their business while reducing the effort and time spent by data engineers, data scientists and data analysts monitoring, identifying and troubleshooting data issues.
This quick start guide will help you get familiar with the Monte Carlo platform by walking you through some common use cases and examples. By the end, you should be up and running and able to navigate the platform with ease!
Monte Carlo monitors freshness, volume, and schema changes out of the box, no configuration required except to connect us to your data environment. Once that happens, we start observing data trends, how often ETL cycles run, how much volume to expect after every job, and after a week or two, we start to notify you of those anomalies. Monte Carlo allows you to send alerts about these anomalies through several different routes, so if you receive an alert through your chosen channel and want to learn more, you can navigate to the "Incidents" tab in your workspace:
Find the incident you were notified about by filtering for the type of incident (e.g. freshness anomalies, schema changes), the status (e.g. investigating, expected), and even the dataset:
Once you have found the incident you are looking for, you are able to see some high level information at a glance, but to get even more information, click on the blue carrot to navigate to the Incident IQ:
In this view, you can modify incident summary information like the status and assigning an owner, and get an in depth analysis of the incident, including an incident timeline, recent query logs to and from the table, and BI reports affected downstream:
From here, you can also easily navigate to the Catalog or Pipelines view of the table, click through past incidents, add to the incident timeline, add comments, or use our pre-outlined runbook for quick, efficient incident management.
Many of the routes we send alerts through (e.g. Slack) provide a handy button in the alert so that instead of navigating through the UI to get to your incident, all you have to do is click the button in the alert, and it will take you to the Incident IQ page directly!
You can find a list of all tables, views, and BI dashboards in the “Catalog” tab. To find a specific resource, navigate to the “Catalog” and search for your table:
Click on your desired table to get a high level overview of the lineage, recent incidents, reports connected to it, freshness, size, and more:
You can also add or update tags from here, see a list of fields, queries loading data to and from the table! If you would like to make changes to a table, this is a great place to start to see what the table currently looks like and how your ETL(s) will be affected.
Monte Carlo offers the option to create Service Level Indicator (“SLI”) monitors. If you have any tables that need to be updated by a certain time or be a certain size, you have the ability to set up a Freshness or Volume SLI.
Enable this monitor by navigating to your home page, “Monitors”, and click on the “CREATE NEW MONITOR” button:
Then, choose “Freshness and volume SLI”:
Select your table or view and SLI type:
Then you can pick the schedule and threshold:
Add a description like “7am SLI for Exec Dashboard Data”, and you are all set to go!
One of the most used aspects of Monte Carlo is our Pipeline product. It allows you to view the tables, views, and reports that are upstream and downstream of any node in your pipeline. You can see what is upstream of a dashboard by searching for it in the “Pipelines” tab:
This will default to showing you everything that is upstream of the dashboard:
You have the option to select the types of nodes seen on the page, like Tables, External, and Looker Dashboards, and the depth or degrees away from the dashboard if you want to clean up the graph for easier use. In addition, any incidents that impact the dashboard will be clearly shown so that you are able to see at a glance if your report is up to date or impact by upstream incidents:
Monte Carlo offers a range of notification routing out of the box, including Slack, email, PagerDuty, Opsgenie, Mattermost, and webhooks. To set up alerting for your team, navigate to the “Settings” tab, and click on “Notifications” in the left hand sidebar:
Click on “ADD NOTIFICATION” and pick the channel you want your alert to route to. It is a common use case to set up a PagerDuty or Opsgenie notification for the most important tables for your team, and a slack or email notification for your less important tables:
Next, choose the incidents you want routed to this channel. You can have all incidents route by default, or you can choose a custom selection, like only custom rule breaches:
If you select rule breaches, we will default to sending all rule breaches to your channel or you have the option to pick a select few rules to be notified about:
Finally, you can choose “Custom” for “Datasets & tables” to narrow down the datasets or tables to be notified about:
If you have a large group of tables or datasets you want to be notified about, our API accepts a regex input. Please check out our API docs for information about how to use our API or contact us using the Intercom bot or at [email protected]
Monte Carlo offers a variety of reports to help you conceptualize and comprehend your data. If you navigate to the “Dashboard” tab, and click on “Insights Reports” tab, you can see all the reports available for use:
One of the most useful reports is called Key Assets which is a collection of your tables and views that we have determined to be “important”. This determination is based on a number of factors like number of read/writes a day, number of dependencies, number of users querying, etc. It is on a scale of 0.0 - 1.0 with 1.0 being the most important tables. This is a great place to start if you are unsure where to add custom monitors and what to receive notifications about! Many of our users use this report to start customizing the product by identifying their most important tables to monitor, set up SQL rules/Field Health/Dimension Tracking, configure notification routing for them, and work their way down from there as they figure out what workflow works best for their teams.
Updated 11 months ago