EMR/Presto logs

🚧

S3 events is the recommended mechanism to fetch query logs from EMR.

πŸ“˜

Prerequisites

Requires permission to create IAM roles and policies in AWS.

To enable query log ingestion by Monte Carlo from an S3 location, follow these steps:

  1. Create a role that allows S3 access for Monte Carlo's data collector.
  2. Provide the role's information to Monte Carlo to validate and complete the integration.

Log formats supported by Monte Carlo

Monte Carlo can currently ingest and process the following formats:

  1. Hive logs created by AWS EMR using its default logging configuration.
  2. Presto query logs exported to S3. The logs are expected to have the following schema:
{
   "queryId":"20200219_173831_00731_6rarz",
   "query":"\nselect * from some_table\n\n",
   "sessionSchema":"default",
   "sessionCatalog":"hive",
   "user":"joe",
   "userAgent":"python-requests/2.18.4",
   "principal":null,
   "sourceIp":"1.2.3.4",
   "coordinatorIp":"1.2.3.5",
   "connectorType":"pyhive",
   "environment":"prod",
   "startTime":1582133911984,
   "endTime":1582134280539,
   "outputRows":5955227,
   "outputBytes":5884105195,
   "writtenRows":0,
   "writtenBytes":0,
   "peakUserMemoryBytes":6864015462,
   "cpuTime":1547293,
   "queryFailureType":null,
   "queryFailureMessage":null,
   "queryFailureCode":null
}

Creating an IAM role for log access

In order to provide access to logs on S3, you will create an IAM role with the necessary API permissions:

  1. Copy the policy below. Please specify the location of your query logs where <bucket> appears.
{
    "Statement": [
        {
            "Action": [
                "s3:GetObjectAcl",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::<bucket>/*",
                "arn:aws:s3:::<bucket>"
            ]
        }
    ],
    "Version": "2012-10-17"
}
  1. Follow the steps outlined here to create the IAM role. You will attach the policy from step 1 to this role as part of the process.

Providing role information to Monte Carlo

You will provide connection details for EMR/Presto logs using Monte Carlo's CLI:

  1. Please follow this guide to install and configure the CLI.
  2. Please use the command montecarlo integrations add-presto-logs to set up Presto logs or the command montecarlo integrations add-hive-logs to set up EMR logs. For reference, see help below:
$ montecarlo integrations add-presto-logs --help
Usage: montecarlo integrations add-presto-logs [OPTIONS]

  Setup a Presto logs integration (S3). For query logs.

Options:
  --bucket TEXT        S3 Bucket where query logs are contained.  [required]
  --prefix TEXT        Path to query logs.  [required]
  --role TEXT          Assumable role ARN to use for accessing AWS resources.
  --external-id TEXT   An external id, per assumable role conditions.
  --collector-id UUID  ID for the data collector. To disambiguate accounts
                       with multiple collectors.

  --skip-validation    Skip all connection tests. This option cannot be used
                       with 'validate-only'.

  --validate-only      Run connection tests without adding. This option cannot
                       be used with 'skip-validation'.

  --auto-yes           Skip any interactive approval.  [default: False]
  --option-file FILE   Read configuration from FILE.
  --help               Show this message and exit.
$ montecarlo integrations add-hive-logs --help
Usage: montecarlo integrations add-hive-logs [OPTIONS]

  Setup a Hive EMR logs integration (S3). For query logs.

Options:
  --bucket TEXT        S3 Bucket where query logs are contained.  [required]
  --prefix TEXT        Path to query logs.  [required]
  --role TEXT          Assumable role ARN to use for accessing AWS resources.
  --external-id TEXT   An external id, per assumable role conditions.
  --collector-id UUID  ID for the data collector. To disambiguate accounts
                       with multiple collectors.

  --skip-validation    Skip all connection tests. This option cannot be used
                       with 'validate-only'.

  --validate-only      Run connection tests without adding. This option cannot
                       be used with 'skip-validation'.

  --auto-yes           Skip any interactive approval.  [default: False]
  --option-file FILE   Read configuration from FILE.
  --help               Show this message and exit.