DynamoDB - MapReduce

DynamoDB - MapReduce

Amazon Elastic MapReduce (EMR) lets you process big data quickly and efficiently. EMR runs Apache Hadoop on EC2 instances but simplifies the process. You are using Apache Hive to query the map in order to reduce job flows through HiveQL , a SQL-like query language. Apache Hive serves as a way to optimize queries and your applications.

You can use the EMR tab of the management console, the EMR CLI, the API, or the SDK to start a workflow. You also have the option to run Hive interactively or use a script.

EMR read/write operations affect bandwidth consumption, however, in large queries, it does retries with the protection of the rollback algorithm. In addition, performing EMR at the same time as other operations and tasks can lead to throttling.

The DynamoDB/EMR integration does not support binary and binary set attributes.

DynamoDB / EMR Integration Prerequisites

Review this checklist of required items before using EMR −

  • AWS account
  • Completed table under the same account used in EMR operations
  • Special version of Hive with DynamoDB connection
  • DynamoDB connection support
  • Bucket S3 (optional)
  • SSH client (optional)
  • EC2 key pair (optional)

Setting up the hive

Before using EMR, generate a key pair to run Hive interactively. The key pair allows you to connect to EC2 instances and master job flow nodes.

You can accomplish this by following the steps −

  • Sign in to the management console and open the EC2 console located at https://console.aws.amazon.com/ec2/.

  • Select the area at the top right of the console. Make sure the region matches the DynamoDB region.

  • On the navigation bar, select Key Pairs .

  • Select Generate Key Pair .

  • In the Key pair name field, enter a name and select Create .

  • Download the resulting private key file, which uses the following format: filename.pem.

Sign in to the management console and open the EC2 console located at https://console.aws.amazon.com/ec2/.

Select the area at the top right of the console. Make sure the region matches the DynamoDB region.

On the navigation bar, select Key Pairs .

Select Generate Key Pair .

In the Key pair name field, enter a name and select Create .

Download the resulting private key file, which uses the following format: filename.pem.

Note. You cannot connect to EC2 instances without a key pair.

Hive cluster

Create a Hive enabled cluster to run Hive. It creates the necessary application and infrastructure environment for the Hive-DynamoDB connection.

You can accomplish this task using the following steps −

  • Access the EMR console.

  • Select Create Cluster .

  • On the create screen, set the cluster configuration with a descriptive name for the cluster, select Yes for interrupt protection, and check Enabled for logging, S3 Destination for the location of the S3 log folder, and Enabled for debugging.

  • On the Software Configuration screen, make sure the fields are: Amazon for Hadoop distribution, latest version for AMI version, default Hive version for apps to be installed, Hive, and default Pig version for apps to be installed. Pig.

  • On the Hardware Configuration screen, make sure the Run in EC2-Classic for Network fields are not set to "No preference for EC2 Availability Zone", default for Master-Amazon EC2 instance type, no check for "Spot request" instances, default for Core-Amazon EC2 instance. Type: 2 for counter, no check for spot request instances, default for Task-Amazon EC2 instance type, 0 for counter, and no check for spot request instances.

Access the EMR console.

Select Create Cluster .

On the create screen, set the cluster configuration with a descriptive name for the cluster, select Yes for interrupt protection, and check Enabled for logging, S3 Destination for the location of the S3 log folder, and Enabled for debugging.

On the Software Configuration screen, make sure the fields are: Amazon for Hadoop distribution, latest version for AMI version, default Hive version for apps to be installed, Hive, and default Pig version for apps to be installed. Pig.

On the Hardware Configuration screen, make sure the Run in EC2-Classic for Network fields are not set to "No preference for EC2 Availability Zone", default for Master-Amazon EC2 instance type, no check for "Spot request" instances, default for Core-Amazon EC2 instance. Type: 2 for counter, no check for spot request instances, default for Task-Amazon EC2 instance type, 0 for counter, and no check for spot request instances.

Be sure to set a limit that provides enough capacity to prevent cluster failure.

  • On the Security and Access screen, make sure the fields contain your key pair in EC2 key pair, No other IAM users in IAM user access , and Continue without roles in IAM role .

  • Review the Bootstrap Actions screen, but don't change it.

  • Review the settings and select Create Cluster when you're done.

On the Security and Access screen, make sure the fields contain your key pair in EC2 key pair, No other IAM users in IAM user access , and Continue without roles in IAM role .

Review the Bootstrap Actions screen, but don't change it.

Review the settings and select Create Cluster when you're done.

The dashboard appears at the beginning of the cluster.

Activate SSH session

You need an active SSH session to connect to the master node and perform CLI operations. Find the master node by selecting the cluster in the EMR console. This lists the master node as Primary Public DNS Name .

Install PuTTY if you don't have it. Then launch PuTTYgen and select Download . Select the PEM file and open it. PuTTYgen will notify you if the import was successful. Select Save Private Key to save as PuTTY Private Key (PPK) format and select Yes to save without a password. Then enter a name for the PuTTY key, click Save and close PuTTYgen.

Use PuTTY to establish a connection to the master node by first launching PuTTY. Select a session from the list of categories. Enter hadoop @ DNS in the Hostname field. Expand Connection > SSH in the category list and select Auth . On the management options screen, select Search Private Key File for authentication. Then select the private key file and open it. Select Yes for the security warning popup.

When you connect to the master node, the Hadoop command prompt appears, which means you can start an interactive Hive session.

Beehive Table

Hive serves as a data warehouse tool that allows you to query EMR clusters using HiveQL . The previous settings give you a working prompt. Run Hive commands interactively by simply typing "hive" followed by any commands you want. See our Hive tutorial for more information on Hive .