Article

Event-Driven Architectures With AWS DynamoDB Streams

November 2, 2018

We’ve been working with AWS DynamoDB Streams on our customer projects, and we’ve found it to be a really powerful solution for creating event-driven pipelines. DynamoDB is one of the more recent features added to DynamoDB and can help address some of the common challenges when migrating from a relational schema to DynamoDB.

What Are DynamoDB Streams?

If you don’t know the basics around DynamoDB, there are many good articles, blogs, and of course the AWS Documentation. I’d like to focus on some of the use cases, and the benefits of using one of its more recent features: DynamoDB Streams.

DynamoDB Streams enable you to trigger downstream actions based on the activity occurring in a DynamoDB table. Every time an item is added, changed, or removed, a stream event is triggered by capturing that change. This feature is very powerful when integrated with other AWS components.

When streams are enabled for a given table, AWS stands up a new service endpoint for making API requests specific to stream events. You can select from four options for the stream events:

  1. KEYS_ONLY – the key attributes of the changed item
  2. NEW_IMAGE – what the item changed-to
  3. OLD_IMAGE – what the item changed from
  4. NEW_AND_OLD_IMAGES – the item before and after the change

Your use case will drive this choice, for a full changelog, you will want to select NEW_AND_OLD_IMAGES, for many derived data use cases, just the NEW_IMAGE will be enough.

For example, let’s say we have a table called orders, that contains items that make up an order with a given UUID. Each table item contains the quantity and unit price of a given product. Let’s say we insert a new item for an order:

{
"id":"57c191c4-7bdc-40fe-a8e1-2a6c4e762847",
"quantity":1,
"unit_price":100,
"upc":"123ABC"
}</pre>
Internally, DynamoDB stores the above item like this:
<pre class="EnlighterJSRAW" data-enlighter-language="json">{ 
 "id":{ 
   "S":"57c191c4-7bdc-40fe-a8e1-2a6c4e762847"
 },
 "quantity":{ 
   "N":"1"
 },
 "unit_price":{ 
   "N":"100"
 },
 "upc":{ 
   "S":"123ABC"
 }
}

This format is important once we see what the stream events look like. This representation is called DynamoDB JSON, and contains metadata on each attribute. If you are using an AWS SDK, you can work with the simplified format if you prefer. Assuming we have configured the stream to produce both NEW_AND_OLD_IMAGES, the stream event triggered when we add a single item that looks like this:

{ 
 "Records":[ 
 { 
   "eventID":"93a3db2a50027c0fcd34e9ccbb2bcced",
   "eventVersion":"1.1",
   "dynamodb":{ 
     "SequenceNumber":"5200000000001475472295",
     "Keys":{ 
       "id":{ 
          "S":"57c191c4-7bdc-40fe-a8e1-2a6c4e762847"
        }
      },
      "SizeBytes":145,
      "NewImage":{ 
        "quantity":{ 
          "N":"1"
        },
       "upc":{ 
         "S":"123ABC"
       },
       "unit_price":{ 
         "N":"100"
       },
       "id":{ 
         "S":"57c191c4-7bdc-40fe-a8e1-2a6c4e762847"
       }
     },
     "ApproximateCreationDateTime":1540511460.0,
     "StreamViewType":"NEW_AND_OLD_IMAGES"
   },
   "awsRegion":"us-east-2",
   "eventName":"INSERT",
   "eventSource":"aws:dynamodb"
   }
 ]
}

This event captures the item INSERT and contains all of the item’s data. Now, let’s say we update this item with a new quantity of 2. The streaming event that would be generated from this modification would look like this:

{
 "Records":[
 {
   "eventID":"196ded95de8c5e4d9b7a89ed84ad2825",
   "eventVersion":"1.1",
     "dynamodb":{
       "OldImage":{
         "quantity":{
           "N":"1"
         },
         "upc":{
           "S":"123ABC"
         },
         "unit_price":{
           "N":"100"
         },
         "id":{
           "S":"57c191c4-7bdc-40fe-a8e1-2a6c4e762847"
         }
       },
       "SequenceNumber":"5300000000001475557369",
       "Keys":{
         "id":{
           "S":"57c191c4-7bdc-40fe-a8e1-2a6c4e762847"
         }
       },
       "SizeBytes":233,
       "NewImage":{
         "quantity":{
           "N":"2"
         },
         "upc":{
           "S":"123ABC"
         },
         "unit_price":{
           "N":"100"
         },
         "id":{
           "S":"57c191c4-7bdc-40fe-a8e1-2a6c4e762847"
         }
       },
       "ApproximateCreationDateTime":1540512540.0,
       "StreamViewType":"NEW_AND_OLD_IMAGES"
     },
     "awsRegion":"us-east-2",
     "eventName":"MODIFY",
     "eventSource":"aws:dynamodb"
   }
 ]
}

This event has triggered a MODIFY event since we updated the item. Both the before and after versions of the item modified are included since we have the stream configured to produce NEW_AND_OLD_IMAGES.

Processing the stream events can be done using the Kinesis Client Library or in a serverless approach using lambda, basically providing a proxy to other AWS services. Stream events appear exactly once, and they are available for 24 hours on the stream.

Why Do I Care About DynamoDB Streams?

DynamoDB Streams provides an event-driven, scalable, serverless approach to process the changes occurring on a DynamoDB table. The approach allows you to use a scalable, serverless path to data pipelines, derived data, or data-driven actions without needing to set up your events or even scanning.

What Are The Best Use Cases For DynamoDB Streams?

Data-Driven Notifications – Consider a customer order represented as an item in DynamoDB. Triggering notifications based on changes to delivery dates, status, or location can be done using Streams to integrate with your notification pipeline. The 24-hour durability on the stream is likely to be longer than any retry meaningful retry scenario for a reputable notification provider, but we could even stream to an SNS topic that publishes to an SQS queue for watertight durability.

Historical Data Generation – In the same example above, DynamoDB streams could be configured to record all of the new images in another table. By using the same partition key (order uuid), and change timestamp as the sort key, we immediately have the right DynamoDB storage scheme for an Order History API.

Derived Data / Metadata – Now suppose you want to collect metadata or aggregate data around the items ordered, updating product metadata tables with entries recording purchase dates/times or information on the other products purchased in the same order. The data collection could be done by publishing derived records to a table centered around the product and using the product id for partitioning.

Analytics Pipeline – Finally, let’s say you want to perform complex analytical queries on your order and product data as well as derived metadata. DynamoDB itself does not support this access pattern, but it will support a data pipeline to a datastore that excels at these types of queries: AWS Redshift. Building this pipeline can be accomplished by writing the DynamoDB stream records to Kinesis Firehose (using Lambda or a Kinesis Client Library application), then on to S3 for batching into Redshift.

DynamoDB Streams give us the power to build event-driven processing and data pipelines from our DynamoDB data with relative ease. It acts basically as a changelog triggered from table activity, and by piping through and to other AWS components, it can support clean, event-driven architectures for certain use cases.

If you’d like to discuss any questions or talk about how any of these concepts could apply to your problem space, we’d love to chat with you about it. Contact us at hello@spindance.com.