Main Page
 The gatekeeper of reality is
 quantified imagination.

Stay notified when site changes by adding your email address:

Your Email:

Bookmark and Share
Email Notification
Project MySQL CSV to DynamoDB
Purpose
The purpose of this project is to show a way to take an RDS CSV export of a mySQL table that is on S3 and import that into DynamoDB. In this tutorial AWS EMR (Elastic Map Reduce) along with HIVE (the AWS version) will be utilized. If your data is not already in RDS, or perhaps you cannot migrate it into RDS, there is a more "hands-on" approach to importing your data into DynamoDB that will be discussed at the bottom of this page; this approach would be required for those tables you may have in which you must preserve the existing ordering of an array of values (that is, you do not want the ordering changed automatically when the data is imported using AWS EMR/HIVE).

IMPORTANT UPDATE NOTE:
Amazon AWS now has a rather convenient feature to migrate virtually any Relational DB (like mySQL or even Microsoft SQL) directly into DynamoDB with considerably less effort and complexity than manually doing it all yourself. Although it was not available when I was tasked with migrating a 100+ million row database (hence this webpage to document the process), today I would opt for the AWS service.

That service is called DMS (Database Management Service) and has some rather powerful features like coverting your database into another database type (even if you do not want to use DynamoDB), so check it out. Your source database can even remain in operation while the ETL (Extract, Transform, Load) process is happening. Nifty.

Some Notes About DynamoDB:
While I do not want to talk about DynamoDB in great length here, I will mention a few developments that have been made with it which you may or may not be aware of:
  • Fully managed by AWS, stored on SSD, encrypted at REST, point-in-time recovery up to 35 days, incremental backup strategy.
  • You can have a database in multiple regions and have each region database update each other - something referred to as Global Tables (with streams enabled) and auto-replication.
  • Spread over 3 different data centers for automatic fail-over; if fail-over event happens you do not need to update an application plugged into DynamoDB.
  • You can opt to use DAX (DyanamoDB Acceleration) which can further improve speed by up to 10x the current rate.
  • Accessing DynamoDB, such as for administration, can be done via DX (Direct connect), site-to-site VPN, VPC endpoints, and, of course, IAM roles and policies.
Okay! Back to using EMR/HIVE to get a database into DynamoDB the original way.


(Enlarge)
  1. The first thing that needs to be done is to create a table in DynamoDB that the CSV table data (exported from RDS to S3) will go into. At this stage you only need to specify the name of the DynamoDB table along with a key of some sort (unlike mySQL you do not have to define every column in a table). To keep it simple, a standard primary key will be used (known as a hash key in DynamoDB).

(Enlarge)
  1. You can also define a secondary index (an alternate way to link data together for, say, a different method of searching).

(Enlarge)
  1. You will be required to specify the desired responsiveness of your DynamoDB table via reads and writes. This image shows the actual formula that you will need to use in order to make that determination.

(Enlarge)
  1. Like the previous step, this image shows the actual formula that you will need to use for determining writes.
  2. It is important to remember that for each xlarge node in the EMR cluster approximately 10 write capacity units are available. This means that if you have a cluster with a master and one slave node, it will have 10 write capacity units so your DynamoDB table (at least for the data import process) should be set likewise for the write capacity units.

(Enlarge)
  1. At this step you can specify whether streams (aka logging) should be used along with where to send notifications when read or write capacity is exceeded.

(Enlarge)
  1. At this step you can review what you have specified and proceed to create the DynamoDB table.

(Enlarge)
  1. After a few minutes you should be able to see the new DynamoDB table in the AWS console as shown here.

(Enlarge)
  1. Now it is time to go to AWS EMR (under Analytics) and create a small EMR cluster so that we can take the content of the CSV on S3 and put that into the DynamoDB table we created earlier.
  2. As you can see here, the settings are straight-forward.
  3. It is important to remember that for each xlarge node in the EMR cluster approximately 10 write capacity units are available. This means that if you have a cluster with a master and one slave node, it will have 10 write capacity units so your DynamoDB table (at least for the data import process) should be set likewise at least for those write capacity units.

(Enlarge)
  1. After the EMR cluster has been created (may take a few minutes) you will be provided a DNS address (or endpoint as other parts of AWS refer to it as). Since we are using HIVE in the EMR cluster, take the endpoint (ensure you're IP is added to the security group and you have the key-pair) and use a login of hadoop in an SSH client.

(Enlarge)
  1. When you SSH into the cluster you should see the welcome message as shown here. In order to get to HIVE, simply type in hive and press enter.

(Enlarge)
  1. Before continuing, there are some catches you should be aware of as it pertains to how the Data Pipeline exports a mySQL table out of RDS and onto S3.
  2. If your mySQL table contained an array of values in the form of ["1", "2"] as the column value, when that data is exported into a CSV and on to S3, it will be transformed to "[\"1\", \"2\"]". At this point your data is useless from this point forward. By the time that makes it into DynamoDB via HIVE it will be further mangled to something like \"[\\"1\\". Everything after that first comma in your array will be lost as well.
  3. To get around this mangling issue with the CSV export from RDS (using data pipeline and the export to S3 template) and how HIVE interprets the CSV, you should remove the use of all double-quotes in the column and change the comma to a different character such as ^.

(Enlarge)
  1. With HIVE you will need to define all the columns from the table you exported to a CSV as shown here. Here, the original mySQL table structure is shown.
  2. For HIVE syntax, see this webpage.
  3. Also see this webpage.
  4. IMPORTANT NOTE #1: If you have an array of values you are storing in a column, when it is imported into DynamoDB it will be automatically re-ordered alphanumerically with NO option for you to override that behavior. For some this may not be a problem, but if your values need to be preserved in the order that they were originally in, you'll have to find a different way of getting your data into a DynamoDB table.
  5. IMPORTANT NOTE #2: If you have an array of values where ANY value is repeated (such as 1,2,1) DynamoDB will throw an error and stop importing data from HIVE. Every value in your array of values must be unique. This behavior may be related to the non-overridable behavior of the AWS EMR/HIVE import process to perform an alphanumeric sort on an array of values.

(Enlarge)
  1. This step shows an example HIVE syntax that your original mySQL table syntax will need to be transformed to, for this tutorial.

(Enlarge)
  1. The next thing that needs to be done is to take the transformed mySQL table syntax (which should be HIVE compatible) and put that in the HIVE command-line.

(Enlarge)
  1. At this step you create the syntax of the table structure that you will be moving your data into. Remember that DynamoDB contains few types. As well, the AWS implimentation of Hive in their EMR cluster, does not even support all the types that DynamoDB does have (as of 11/2015). For example, DynamoDB supports LISTS and MAPS (for example, to preserve array value ordering if needed) but the AWS implimentation of HIVE only recognizes MAP on the import side of your CSV data and, yet, does not support it on the export side to DynamoDB. So, keep it simple.
  2. Likewise, recent AWS DynamoDB documentation suggests support for decimal data types (announced around August 2015). However, that is NOT actually the case with AWS EMR/HIVE going to DynamoDB. That is, specifying a data type of NUMBER (unique to Dynamo), DECIMAL, DECIMAL (with parameters), FLOAT all fail...the ONLY data type to preserve a mySQL data type of DECIMAL is DOUBLE.

(Enlarge)
  1. At this step, enter the table structure for the data as it will appear in DynamoDB.

(Enlarge)
  1. Now with the mapping in place you can import the data from the CSV on S3 through the AWS EMR / Hive cluster and into DynamoDB by using the INSERT command as shown here.

(Enlarge)
  1. Since we have a small set of data for this tutorial, it only takes an average of 18 seconds for the migration to complete. Once complete you should be able to enter into the AWS DynamoDB console (just a webpage), to see the data that was migrated.
  2. If you have a larger table of approximately 2 million rows that has an equivalent of 2 writes per row then, with a 2 server cluster (1 master - 1 slave), it will take about a week to complete importing into DynamoDB.

(Enlarge)
  1. Finally, one last thing to mention if you have exported a CSV from RDS using the data pipeline and the option to export an mySQL table to S3, and you plan on using AWS EMR/HIVE to import that data into DynamoDB...
  2. If you have a column that contains a NULL value, DynamoDB will throw an error so all your columns for all rows have to have a value present. You may think that in our modern technological age, and given that DynamoDB is a "schema-less" database (e.g., variable column length database) the importation process via AWS HIVE to DynamoDB would simply ignore a column of a row being imported that is NULL. Unfortunately, that is not the case as of 11/2015. Be sure to plan accordingly.


Importing Data Into DynamoDB From MySQL While Preserving Existing Array Value Ordering

(Enlarge)
  1. You can import MySQL table data into DynamoDB and have the existing ordering of array values preserved in one or more columns provided you (1) manually export the data yourself out of mySQL to a text file and then to a folder on S3, (2) make use of LIST, (3) manually import the data file into DynamoDB using AWS Data Pipeline and the option "Import DynamoDB Backup Data From S3".
  2. The string set and number set of LIST is used so you can see the syntax since you may use one or both of them.
  3. Example (represents one row of data in MySQL): {"atestid":{"n":"1"},"values":{"l":[{"s":"pancakes"},{"s":"waffles"},{"s":"syrup"},{"s":"waffles"},{"s":"butter"}]},"names":{"l":[{"n":"0"},{"n":"22"},{"n":"36"},{"n":"22"},{"n":"39"}]}}
  4. You can download a 3-row flat file example here


About Joe