S3 Events - Query Logs

📘

Prerequisites

To complete this guide, you will need admin credentials for AWS.

To fetch query logs at scale for tables stored on S3, follow these steps:

  1. Identify data lake buckets.
  2. Configure event notifications for data lake buckets.
  3. Enable events to complete the integration.

Identify Data Lake Buckets

Identify buckets that store query logs for your EMR data lake. Your representative can help with this.

Configure Event Notifications

👍

The CLI can be used to configure event notifications automatically. See here.

If you prefer using the AWS console to configure S3 event notification for a data lake bucket, please follow one of the guides below, based on your environment.

Guide

Are the bucket and the data collector in the same region?

Are the bucket and the data collector in the same AWS account?

Are there existing S3 event triggers?

Scenario one

Yes

Doesn't matter

No

Scenario two

No

Yes

No

Scenario three

Doesn't matter

Yes

Yes

Scenario four

No

No

No

Scenario five

No

No

Yes

Enable events

You will enable events using Monte Carlo's CLI:

  1. Please follow this guide to install and configure the CLI.
  2. Please use the command montecarlo integrations toggle-ql-events to enable events.

📘

If you configured access to the bucket using an assumable role please use the options --role and --external-id.

$ montecarlo integrations toggle-ql-events --help
Usage: montecarlo integrations toggle-ql-events [OPTIONS]

  Toggle S3 query log events for a Hive lake in EMR. For tracking data
  freshness and volume at scale. Requires s3 notifications to be configured
  first.

Options:
  --enable / --disable  Enable or disable events. Enables if not specified.
  --role TEXT           Assumable role ARN to use for accessing AWS resources.
  --external-id TEXT    An external id, per assumable role conditions.
  --type [hive-emr]     Type of the integration.  [default: hive-emr;
                        required]

  --option-file FILE    Read configuration from FILE.
  --help                Show this message and exit.

Scenario One

Follow these steps to enable S3 events if your needs fit under "scenario one":

  1. Retrieve relevant SQS ARNs
  2. Retrieve your account ID
  3. Open the S3 event management pane
  4. Update the SQS access policy
  5. Create event notification
  6. Set up permissions to the bucket

Retrieve Relevant SQS ARNs
Follow these steps to get the relevant SQS ARNs. If the data collector is managed by Monte Carlo, please reach out to your representative for these values instead.

  1. Open the Cloudformation console and search for the Monte Carlo data collector. Select the stack:
  1. Select the “Outputs” tab:
  1. Save the Metadata Queue ARN for later
    Key: MetadataEventQueue

Retrieve your Account ID
Follow these steps to retrieve your account ID. If the data collector is managed by Monte Carlo, please reach out to your representative for these values instead.

Be sure you are logged in the same account as the Monte Carlo Collector before proceeding.

  1. From the console, select your username in the upper right corner.
  2. Select “My Account”.
  3. Save the Account Id (without dashes) for later.

Open the S3 Event Management Pane
Follow these steps to help locate the event configuration page for the bucket you want to enable events for.

  1. Open the S3 Console and search for the bucket that you would like to enable events for.
  2. Select the bucket.
  3. Save the bucket ARN by selecting “Copy Bucket ARN” for later.
  4. Select the “Properties” tab. Leave this page open you will come back to it later.

Update the SQS Access Policy
Follow these steps to allow your S3 bucket to write to the relevant queue. If the data collector is managed by Monte Carlo these steps can be skipped by just sending the S3 Bucket ARN to your representative. Your representative will in turn send you the SQS ARN and relevant account ID.

  1. Open the SQS console in the account the Monte Carlo Collector was deployed to
  2. Search for the queue. The name follows this structure: {CF_STACK}-QueryLogEventQueue-{RANDOM_STR}
  3. Select the queue and confirm the the ARN matches the ARN you saved previously
  4. Select the “Access Policy” Tab and Select “Edit”.

If the access policy is empty or looks something like this:

{
  "Version": "2012-10-17",
  "Id": "arn:aws:sqs:<region>:<account>:<name>/SQSDefaultPolicy"
}

Paste the following (replacing any values in brackets):

  • The COLLECTOR_ACCOUNT_ID is the account ID you saved in the "Retrieve your account ID" subsection.
  • The EVENT_QUEUE_ARN is the ARN you saved in the "Retrieve relevant SQS ARNs subsection".
  • The S3_ARN is the bucket ARN, which you saved in the "Locate the S3 event management pane" subsection.
{
   "Version":"2008-10-17",
   "Statement":[
      {
         "Sid":"__owner",
         "Effect":"Allow",
         "Principal":{
            "AWS":"arn:aws:iam::<COLLECTOR_ACCOUNT_ID>:root"
         },
         "Action":"SQS:*",
         "Resource":"<EVENT_QUEUE_ARN>"
      },
      {
         "Sid":"__sender",
         "Effect":"Allow",
         "Principal":{
            "AWS":"*"
         },
         "Action":"SQS:SendMessage",
         "Resource":"<EVENT_QUEUE_ARN>",
         "Condition":{
            "ArnLike":{
               "aws:SourceArn":[
                  "<S3_ARN>"
               ]
            }
         }
      }
   ]
}

But, if the access policy already has a SID with “__sender” (i.e. looks like above) append your S3_ARN to the SourceArn list instead. The S3_ARN was saved in the "Locate the S3 event management pane" subsection.

"aws:SourceArn": [
            "arn:aws:s3:::existing_bucket",
            "<S3_ARN>"
          ]

Create Event Notifications
Depending on the type of integration, you will create one or more S3 event notifications. These will differ by the prefixes and suffixes used as filters.

Integration

Default prefix

Suffix

Hive / EMR
(2 event notifications)

elasticmapreduce/
elasticmapreduce/

/hive.log.gz
/hive-server2.log.gz

Identify in the table the typo of integration you are configuring and repeat these steps for each prefix/suffix pair.

  1. Return to the page you had opened in step 4 of the "Open the S3 event management pane" subsection.
  2. Select “Create event notification” under Event notifications.
  3. Fill in a meaningful name.
  4. Specify the prefix and the suffix.
  5. Select “All object create events” under Event types.
  1. Enter the SQS queue ARN you had saved from the "Retrieve relevant SQS ARNs" subsection as the Destination queue ARN.
  1. Save changes.

Set Up Permissions to the Bucket
In order to provide access to the query log bucket, you will create a policy and attach it to a role.

On the IAM console, create the policy below. Replace <S3_ARN> with the bucket ARN which you saved in the "Locate the S3 event management pane". Please notice that the resource needs to end with /*, otherwise notifications will not work.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "<S3_ARN>/*"
        }
    ]
}

You can choose between two alternatives to use this policy:

  1. Use an existing role: on the IAM console, search for the role {CF_STACK}-EventLambdaExecutionRole-{RANDOM_STR} and attach the policy to it.

  2. [Recommended] Create an assumable role: follow the steps outlined here to create a role.

👍

That's it! You are now set up with s3 query logs.

Scenario Two

Follow these steps to enable S3 events if your needs fit under "scenario two":

  1. Retrieve relevant SQS ARNs
  2. Retrieve your account ID
  3. Open the S3 event management pane
  4. Create a SNS Topic
  5. Update the SQS access policy
  6. Create event notification
  7. Create a SNS subscription
  8. Set up permissions to the bucket

Retrieve Relevant SQS ARNs
Follow these steps to get the relevant SQS ARNs. If the data collector is managed by Monte Carlo, please reach out to your representative for these values instead.

  1. Open the Cloudformation console and search for the Monte Carlo data collector. Select the stack:
  1. Select the “Outputs” tab:
  1. Save the Metadata Queue ARN for later.
    Key: MetadataEventQueue

Retrieve your Account ID
Follow these steps to retrieve your account ID. If the data collector is managed by Monte Carlo, please reach out to your representative for these values instead.

Be sure you are logged in the same account as the Monte Carlo Collector before proceeding.

  1. From the console, select your username in the upper right corner.
  2. Select “My Account”.
  3. Save the Account Id (without dashes) for later.

Open the S3 Event Management Pane
Follow these steps to help locate the event configuration page for the bucket you want to enable events for.

  1. Open the S3 Console and search for the bucket that you would like to enable events for.
  2. Select the bucket.
  3. Save the bucket ARN by selecting “Copy Bucket ARN” for later.
  4. Select the “Properties” tab. Leave this page open you will come back to it later.

Create a SNS Topic

📘

What region should I create my topic in?

Make sure you are in the same region as the bucket you want to add an event for.

  1. Open the SNS console and select “Topics”
  2. Select “Create Topic”. Choose "Standard" type, enter a meaningful name and fill any optional fields

🚧

It’s highly recommended to enable delivery status logging for SQS.

  1. Select “Create Topic” and save the Topic ARN for later.
  2. Update (append) the topic you just created with the following policy statement:
  • SNS_ARN is the the ARN from above.
  • S3_ARN is the bucket ARN, which you saved in the "Locate the S3 event management pane" subsection.
{
    "Effect": "Allow",
    "Principal": {
        "AWS": "*"
    },
    "Action": "SNS:Publish",
    "Resource": "<SNS_ARN>",
    "Condition": {
        "StringEquals": {
            "aws:SourceArn": "<S3_ARN>"
        }
    }
}

You may need to include a "Sid" here too.

  1. Save changes.

Update the SQS Access Policy
Follow these steps to allow your SNS topic to write to the relevant queue. If the data collector is managed by Monte Carlo these steps can be skipped by just sending the SNS Topic ARN to your representative. Your representative will in turn send you the SQS ARN and relevant account ID.

  1. Open the SQS console in the account the Monte Carlo Collector was deployed to.
  2. Search for the queue. The name follows this structure: {CF_STACK}-MetadataEventQueue-{RANDOM_STR}.
  3. Select the queue and confirm the the ARN matches the ARN you saved previously.
  4. Select the “Access Policy” Tab and Select “Edit”.

If the access policy is empty or looks something like this:

{
  "Version": "2012-10-17",
  "Id": "arn:aws:sqs:<region>:<account>:<name>/SQSDefaultPolicy"
}

Paste the following (replacing any values in brackets):

  • The COLLECTOR_ACCOUNT_ID is the account ID you saved in the "Retrieve your account ID" subsection.
  • The EVENT_QUEUE_ARN is the ARN you saved in the "Retrieve relevant SQS ARNs subsection".
  • The SNS_ARN is the SNS ARN, which you saved in the "Create a SNS Topic" subsection.

🚧

Be sure to use the SNS topic ARN and not the S3 bucket ARN here!

{
   "Version":"2008-10-17",
   "Statement":[
      {
         "Sid":"__owner",
         "Effect":"Allow",
         "Principal":{
            "AWS":"arn:aws:iam::<COLLECTOR_ACCOUNT_ID>:root"
         },
         "Action":"SQS:*",
         "Resource":"<EVENT_QUEUE_ARN>"
      },
      {
         "Sid":"__sender",
         "Effect":"Allow",
         "Principal":{
            "AWS":"*"
         },
         "Action":"SQS:SendMessage",
         "Resource":"<EVENT_QUEUE_ARN>",
         "Condition":{
            "ArnLike":{
               "aws:SourceArn":[
                  "<SNS_ARN>"
               ]
            }
         }
      }
   ]
}

But, if the access policy already has a SID with “__sender” (i.e. looks like above) append your SNS_ARN to the SourceArn list instead. The SNS_ARN was saved in the "Create a SNS Topic" subsection.

"aws:SourceArn": [
            "arn:aws:s3:::existing_bucket",
            "<SNS_ARN>"
          ]

Create Event Notifications
Depending on the type of integration, you will create one or more S3 event notifications. These will differ by the prefixes and suffixes used as filters.

Integration

Default prefix

Suffix

Hive / EMR
(2 event notifications)

elasticmapreduce/
elasticmapreduce/

/hive.log.gz
/hive-server2.log.gz

Identify in the table the typo of integration you are configuring and repeat these steps for each pair of prefix and suffix:

  1. Return to the page you had opened in step 4 of the "Open the S3 event management pane" subsection.
  2. Select “Create event notification” under Event notifications.
  3. Fill in a meaningful name.
  4. Specify the prefix and the suffix.
  5. Select “All object create events” under Event types.
  1. Enter the SNS queue ARN you had saved from the "Create a SNS Topic" subsection as the Destination topic ARN.
  1. Save changes.

Create a SNS subscription
Follow these steps subscribe SQS to the SNS topic you created and enabled notifications for.

  1. Open the SNS console and select “Subscriptions”.
  2. Select “Create Subscription”.
  3. Select the topic ARN you saved above in the "Create a SNS Topic" subsection.
  4. Select Amazon SQS as the protocol.
  5. Select (or paste) the SQS ARN you saved above in the "Retrieve relevant SQS ARNs" subsection.

🚧

Be sure to select “Enable raw message delivery”!

  1. Select “Create Subscription”.
  2. Validate the status is “Confirmed”.

Set Up Permissions to the Bucket
In order to provide access to the query log bucket, you will create a policy and attach it to a role.

On the IAM console, create the policy below. Replace <S3_ARN> with the bucket ARN which you saved in the "Locate the S3 event management pane". Please notice that the resource needs to end with /*, otherwise notifications will not work.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "<S3_ARN>/*"
        }
    ]
}

You can choose between two alternatives to use this policy:

  1. Use an existing role: on the IAM console, search for the role {CF_STACK}-EventLambdaExecutionRole-{RANDOM_STR} and attach the policy to it.

  2. [Recommended] Create an assumable role: follow the steps outlined here to create a role.

👍

That's it! You are now set up with s3 query logs.

Scenario Three

🚧

Heads up

These steps may temporarily affect (disable) existing event notifications!

Follow these steps to enable S3 events if your needs fit under "scenario three":

Same steps as Scenario two, except you would also need to temporarily delete the conflicting trigger and subscribe it to the SNS topic that was created. This is known as SNS Fanout and allows you to publish from one endpoint to multiple destinations, thus allowing for parallel asynchronous processing.

👍

That's it! You are now set up with s3 query logs.

Scenario Four

Follow these steps to enable S3 events if your needs fit under "scenario four":

  1. Retrieve relevant SQS ARNs
  2. Retrieve your account ID
  3. Open the S3 event management pane
  4. Create a SNS Topic
  5. Update the SQS access policy
  6. Create event notification
  7. Create a SNS subscription
  8. Set up permissions to the bucket

Retrieve relevant SQS ARNs
Follow these steps to get the relevant SQS ARNs. If the data collector is managed by Monte Carlo, please reach out to your representative for these values instead.

  1. Open the Cloudformation console and search for the Monte Carlo data collector. Select the stack:
  1. Select the “Outputs” tab:
  1. Save the Metadata Queue ARN for later.
    Key: MetadataEventQueue

Retrieve your Account ID
Follow these steps to retrieve your account ID. If the data collector is managed by Monte Carlo, please reach out to your representative for these values instead.

Be sure you are logged in the same account as the Monte Carlo Collector before proceeding.

  1. From the console, select your username in the upper right corner.
  2. Select “My Account”.
  3. Save the Account Id (without dashes) for later.

Open the S3 event management pane
Follow these steps to help locate the event configuration page for the bucket you want to enable events for.

  1. Open the S3 Console and search for the bucket that you would like to enable events for.
  2. Select the bucket.
  3. Save the bucket ARN by selecting “Copy Bucket ARN” for later.
  4. Select the “Properties” tab. Leave this page open you will come back to it later.

Create a SNS Topic

📘

What region should I create my topic in?

Make sure you are in the same region as the bucket you want to add an event for.

  1. Open the SNS console and select “Topics”.
  2. Select “Create Topic”. Choose "Standard" type, enter a meaningful name and fill any optional fields.

🚧

It’s highly recommended to enable delivery status logging for SQS!

  1. Select “Create Topic” and save the Topic ARN for later.
  2. Update (append) the topic you just created with the following policy statement:
  • SNS_ARN is the the ARN from above.
  • S3_ARN is the bucket ARN, which you saved in the "Locate the S3 event management pane" subsection.
{
    "Effect": "Allow",
    "Principal": {
        "AWS": "*"
    },
    "Action": "SNS:Publish",
    "Resource": "<SNS_ARN>",
    "Condition": {
        "StringEquals": {
            "aws:SourceArn": "<S3_ARN>"
        }
    }
}

You may need to include a "Sid" here too.

  1. Update (append) the topic you just created with the following policy statement:
  • COLLECTOR_ACCOUNT_ID is the account ID you saved in the "Retrieve your account ID" subsection.
  • SNS_ARN is the the ARN from above.
{ 
  "Sid": "__dc_sub",
    "Effect": "Allow",
    "Principal": {
        "AWS": "<COLLECTOR_ACCOUNT_ID>"
    },
    "Action": "sns:Subscribe",
    "Resource": "<SNS_ARN>"
}
  1. Save changes

Update the SQS Access Policy
Follow these steps to allow your SNS topic to write to the relevant queue. If the data collector is managed by Monte Carlo these steps can be skipped by just sending the SNS Topic ARN to your representative. Your representative will in turn send you the SQS ARN and relevant account ID.

  1. Open the SQS console in the account the Monte Carlo Collector was deployed to.
  2. Search for the queue. The name follows this structure: {CF_STACK}-MetadataEventQueue-{RANDOM_STR}.
  3. Select the queue and confirm the the ARN matches the ARN you saved previously.
  4. Select the “Access Policy” Tab and Select “Edit”.

If the access policy is empty or looks something like this:

{
  "Version": "2012-10-17",
  "Id": "arn:aws:sqs:<region>:<account>:<name>/SQSDefaultPolicy"
}

Paste the following (replacing any values in brackets):

  • The COLLECTOR_ACCOUNT_ID is the account ID you saved in the "Retrieve your account ID" subsection.
  • The EVENT_QUEUE_ARN is the ARN you saved in the "Retrieve relevant SQS ARNs subsection".
  • The SNS_ARN is the SNS ARN, which you saved in the "Create a SNS Topic" subsection.

🚧

Be sure to use the SNS topic ARN and not the S3 bucket ARN here!

{
   "Version":"2008-10-17",
   "Statement":[
      {
         "Sid":"__owner",
         "Effect":"Allow",
         "Principal":{
            "AWS":"arn:aws:iam::<COLLECTOR_ACCOUNT_ID>:root"
         },
         "Action":"SQS:*",
         "Resource":"<EVENT_QUEUE_ARN>"
      },
      {
         "Sid":"__sender",
         "Effect":"Allow",
         "Principal":{
            "AWS":"*"
         },
         "Action":"SQS:SendMessage",
         "Resource":"<EVENT_QUEUE_ARN>",
         "Condition":{
            "ArnLike":{
               "aws:SourceArn":[
                  "<SNS_ARN>"
               ]
            }
         }
      }
   ]
}

But, if the access policy already has a SID with “__sender” (i.e. looks like above) append your SNS_ARN to the SourceArn list instead. The SNS_ARN was saved in the "Create a SNS Topic" subsection.

"aws:SourceArn": [
            "arn:aws:s3:::existing_bucket",
            "<SNS_ARN>"
          ]

Create Event Notifications
Depending on the type of integration, you will create one or more S3 event notifications. These will differ by the prefixes and suffixes used as filters.

Integration

Default prefix

Suffix

Hive / EMR
(2 event notifications)

elasticmapreduce/
elasticmapreduce/

/hive.log.gz
/hive-server2.log.gz

Identify in the table the typo of integration you are configuring and repeat these steps for each pair of prefix and suffix:

  1. Return to the page you had opened in step 4 of the "Open the S3 event management pane" subsection.
  2. Select “Create event notification” under Event notifications.
  3. Fill in a meaningful name.
  4. Specify the prefix and the suffix.
  5. Select “All object create events” under Event types.
  1. Enter the SNS queue ARN you had saved from the "Create a SNS Topic" subsection as the Destination topic ARN.
  1. Save changes.

Create a SNS Subscription
Follow these steps subscribe SQS to the SNS topic you created and enabled notifications for.

  1. Open the SNS console and select “Subscriptions”.
  2. Select “Create Subscription”.
  3. Select the topic ARN you saved above in the "Create a SNS Topic" subsection.
  4. Select Amazon SQS as the protocol.
  5. Select (or paste) the SQS ARN you saved above in the "Retrieve relevant SQS ARNs" subsection.

🚧

Be sure to select “Enable raw message delivery”!

  1. Select “Create Subscription”
  2. Validate the status is “Confirmed”.

📘

The subscription may not necessarily confirm right away. Please check back in 24 hours.

Set Up Permissions to the Bucket
In order to provide access to the query log bucket, you will create a policy and attach it to a role.

On the IAM console, create the policy below. Replace <S3_ARN> with the bucket ARN which you saved in the "Locate the S3 event management pane". Please notice that the resource needs to end with /*, otherwise notifications will not work.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "<S3_ARN>/*"
        }
    ]
}

You can choose between two alternatives to use this policy:

  1. Use an existing role: on the IAM console, search for the role {CF_STACK}-EventLambdaExecutionRole-{RANDOM_STR} and attach the policy to it.

  2. [Recommended] Create an assumable role: follow the steps outlined here to create a role.

👍

That's it! You are now set up with s3 query logs.

Scenario Five

🚧

Heads up

These steps may temporarily affect (disable) existing event notifications!

Follow these steps to enable S3 events if your needs fit under "scenario five":

Same steps as Scenario four, except you would also need to temporarily delete the conflicting trigger and subscribe it to the SNS topic that was created. This is known as SNS Fanout and allows you to publish from one endpoint to multiple destinations, thus allowing for parallel asynchronous processing.

👍

That's it! You are now set up with s3 query logs.


Did this page help you?