What is BigQuery?

BigQuery is a fully managed enterprise data warehouse that helps you manage and analyze your data with built-in features like machine learning, geospatial analysis, and business intelligence. BigQuery's serverless architecture lets you use SQL queries to answer your organization's biggest questions with zero infrastructure management. BigQuery's scalable, distributed analysis engine lets you query terabytes in seconds and petabytes in minutes.

Why Connect BigQuery to Monte Carlo?

Connecting your BigQuery account to Monte Carlo, and giving us access to the data within is essential for Monte Carlo to function. Monte Carlo needs to be able to collect the metadata of your tables in BigQuery to enable our OOTB Freshness and Volume Monitors. With access to the query logs of those tables, we will be able to build table and field lineage. With the ability to run SQL queries on those tables, Monte Carlo can run advanced monitors to analyze the data in those tables.

Monte Carlo is a Data Observability tool after all, so the biggest step in getting value from it is letting it observe your data!

After connecting BigQuery here is what you can expect to see Out of the Box:

Freshness and Volume:

Our freshness and volume monitors are active out of the box. By collecting metadata on your tables, Monte Carlo’s machine learning algorithm can set automatic thresholds to alert when a table has gone too long without an update, and when an abnormal amount of rows are added or deleted.

Query Logs:

When giving access to query logs Monte Carlo is able to show the amount of read and write queries on a table, and give you access to the details of those queries. Monte Carlo can even highlight when a query appears to have failed.

Table Lineage:

Monte Carlo is able to map the connections between your tables, and can show upstream and downstream connections. We can even connect anomalies that are detected in upstream tables to anomalies in downstream tables.

Connecting BigQuery

Now that we know the value connecting BigQuery to Monte Carlo can bring, here are the necessary steps to set up the integration.

πŸ“˜

Admin credentials required

To make the read only service user for Monte Carlo, you will need owner permissions on BigQuery.

This guide explains how to create a read-only service account for Monte Carlo on BigQuery.

To review all steps necessary to integrate a data warehouse with Monte Carlo, please see here.

Creating a service account for a single BigQuery project

First, create a role for Monte Carlo's service account:

  1. Under IAM & Admin, go to the Roles section in your Google Cloud Platform console.
  2. Select the project to which your BigQuery warehouse belongs using the combo box on the top left of your dashboard.
  3. Click the Create Role button at the top of the tab.
  4. Give the new role a name. We recommend "Data Reliability Monitor".
  5. Change the Role launch stage to "General Availability".
  6. Click Add Permissions and add the permissions specified below to the role. To make the process faster, consider filtering the permission list by the role BigQuery Admin.

Now, create a service account:

  1. Under IAM & Admin, go to the Service Accounts section in your Google Cloud Platform console.
  2. Click the Create Service Account button at the top of the tab.
  3. Give the account a name and continue. We recommend naming the account "monte-carlo".
  4. Assign the role you previously created to the service account and continue.
  5. Click the Create Key button, select JSON as the type and click Create. A JSON file will download – please keep it safe as it grants access to your BigQuery data.
  6. Click Done to complete the creation of Monte Carlo's service account.

Finally, upload the JSON file you downloaded into Monte Carlo's onboarding wizard to finalize the integration.

Permissions Monte Carlo requires

bigquery.datasets.get
bigquery.datasets.getIamPolicy
bigquery.jobs.get
bigquery.jobs.list
bigquery.jobs.listAll
bigquery.jobs.create
bigquery.tables.get
bigquery.tables.getData
bigquery.tables.list
storage.buckets.list
storage.buckets.get
storage.objects.list
storage.objects.get
resourcemanager.projects.get

Monitoring multiple projects using Monte Carlo

Allowing Monte Carlo's service account to access multiple projects can help with the following use cases:

  • Tracking multiple datasets spread across more than one BigQuery project.
  • Tracking query logs and lineage generated by users associated with more than one project.

To add an additional project to Monte Carlo's service role:

  1. Select the project using the combo box on the top left of the Google Cloud Platform console.
  2. Create a role that grants BigQuery access with the appropriate permissions following the instructions above. Optionally, you may use the following command in your terminal to copy the role over from another project (this is much quicker!):
gcloud iam roles copy
  1. Under Access, go to the IAM section in your Google Cloud Platform console.
  2. Click the Add button, provide Monte Carlo's service account email address and click Save.

You may add any number of projects to Monte Carlo's service account.

BigQuery Permissions

BigQuery PermissionMonte Carlo Requirement
bigquery.datasets.getGet metadata about a dataset.

Allows Monte Carlo to see the INFORMATION_SCHEMA views as well as list out available Datasets in BigQuery.
bigquery.datasets.getIamPolicyRead a table's IAM policy.

Allows Monte Carlo to read a list of available Datasets in BigQuery. Datasets store tables in a project.
bigquery.jobs.getGet data and metadata on any job (query) in BigQuery.

Allows Monte Carlo to read BigQuery query logs.
bigquery.jobs.listList all jobs and retrieve metadata on any job submitted by any user.
For jobs submitted by other users, details and metadata are redacted.

Allows Monte Carlo to get Metadata on BigQuery query logs. This includes data such as start time, end time, and amount of data processed.
bigquery.jobs.listAllList all jobs and retrieve metadata on any job submitted by any user.

Allows Monte Carlo to get Metadata on BigQuery query logs.
bigquery.jobs.createRun jobs (including queries) within the project.

Allows Monte Carlo to run SELECT queries on the Information Schema and tables to be monitored.
bigquery.tables.getGet BigQuery table metadata.

Allows Monte Carlo to get table Metadata. Includes table creation time, number of rows, and byte size of the data.
bigquery.tables.getDataGet table data. This permission is required for querying table data.

Allows Monte Carlo to run select queries for opt-in monitors on aggregated statistics.
bigquery.tables.listList tables and metadata on tables.

Allows Monte Carlo to see table Metadata and see all tables in a dataset.
resourcemanager.projects.getRead project metadata.

Allows Monte Carlo to list available GCP Projects to iterate through in order to collect BigQuery Metadata. Note that Monte Carlo Service Account can only see the list of the Projects that the Service Account has been provided access to, therefore this permission is necessary.

External Table Permissions

The following permissions are required for External Tables in BigQuery.

BigQuery PermissionMonte Carlo Requirement
storage.buckets.listList buckets in a project. Also read bucket metadata, excluding IAM policies, when listing.

Allows Monte Carlo to see metadata of External Tables.
storage.buckets.getRead bucket metadata, excluding IAM policies, and list or read the Pub/Sub notification configurations on a bucket.

Allows Monte Carlo to get metadata of External Tables.
storage.objects.listList objects in a bucket. Also read object metadata, excluding ACLs, when listing.

Allows Monte Carlo to see metadata of External Tables.
storage.objects.getRead object data and metadata, excluding ACLs.

Allows Monte Carlo to get metadata of External Tables.

Limitations

  • With nested struct data types, Field Lineage is not available

How Do I Know it Worked? (BigQuery Connection Validation Tests)

After the initial integration of BigQuery and Monte Carlo, you should see your assets in Monte Carlo within a few minutes. If they are still missing you can run a validation test on your Snowflake connection by going to the Integration tab on the Settings page, and selecting Test from the menu next to your BigQuery Connection:

When running the tests we expect to see all green checkmarks as below. Any error that is encountered should come with steps to fix it:

If assets from a specific database or schema are missing you can use the Data Asset validation to test our ability to access them: