Single Table Python Example with AWS DynamoDB

Single Table design has been around since NoSQL started and it is not unique to AWS DynamoDB. I used it myself starting in 2010 with Apache Cassandra and even gave a talk on it at the 2011 Cassandra conference. I am not a single table purist by any means but it does come in handy for use cases you might have and might make sense at times to use it.

Basically the idea of it is to use your rows as composites and scan on ranges of composites to pull out columns you are interested in. Huh? Let me explain a little bit more with an example and then we will dive into some code. Also if you want to get more into this topic there is a great book by Alex DeBrie called The DynamoDB Book if you want more.

Lets use the case of having to design a database to track the outcomes of the predictions from Machine Learning models day over day. We will get into the ML aspect of that in future posts to tie it together better but for now lets leave that as an implementation detail we will interface with.

Ok, so now back of the napkin here are the entities I can come up with we need. Google has a great writeup about the topic here. For now the actual available go to systems in this space are non-existent (except for SaaS offerings) and everything I believe in the industry is custom into their standard CI/CD or vendor. The problem is the standard CI/CD doesn’t account for having to optimize metrics based on the predictions that are occurring in your data set as outcomes from lets say users that have to be tracked and managed.

  • User – The system using the database will have users to manage.
  • Groups – A way to organize Users and their associations (maybe other things to besides Users???)
  • System – The system using the database will have other systems to manage (such as through an API) and tracing of data movement is desired so we want this as a first class citizen.
  • Workspace – a place for users and systems to have a snapshot of data to run changes to a model’s hyperparameters to tune a cut for a new release to a target environment.
  • Model – our crystal ball.
  • Configuration – these are settings that might apply to data filtering or some other aspect besides the model object’s attributes.
  • Release – change management around code, configs, secretes, controls and other items of interest to audit against.
  • Prediction(s) – from a release, our model is going to be making predictions on new data it sees. We want to represent this intersection with all of the results.
  • Metrics – these are data points in your data sets. You want to optimize these based on the predictions that are happening day over day (or whatever your time frame is but for us here it is daily).

Thats enough for now. There are plenty more to add but this is enough to start to get everything built and move forward to supporting it with more version upgrades. So next lets step back for a second and start to think about all of the associations between these entities and all of the attributes (columns) these entities are going to have. One typical next step is to go out and use some tool that does migrations or maybe you write your DDL yourself either way you are writing out your schema.

We are going to completely skip the step of writing out your schema.

Instead, you are going right to your code. But before we go to your code lets look under the hood at how this magic happens with code I wrote for you to use in your projects in python (or convert to other languages). The table structure is simple. There is a PK (Partition Key) an SK (Sort Key) and then the rest are all of your attributes/columns. With that PK and SK you have all you need from DynamoDB to encode in the data your entities and the schema will be the columns with composite values for PK and SK. Here is example of what the data looks like in the database and what it means.

PK = “wc#11e9828ab35373446530b203bf4b8a16363e4bab/w#11e9978c5bb87c87d9059df0b3befb98fbd7009b”

SK = “w#11e9978c5bb87c87d9059df0b3befb98fbd7009b”

label = “optimizing position in portfolio JIRA-7002”

… (more columns)

The “wc” stands for “Workspace Category” because you need a way to organize all of them. “w” stands for “Workspace” and the ksuid is for making us a unique row to represent this workspace and all of the columns you need about it (up to 400KB if you need more use Apache Cassandra which has up to 2GB). You can store maps, sets, lists, scalars, etc in your attributes/columns.

Now, you usually want to scan on sort key since the scan will be for one partition key. The reason I didn’t do this there is because this gives each child 400KB of space for holding data and if we didn’t do that then all the children would share the 400KB of space. When you have a scan with the sort key this work if you are doing time series lets say and you have something that looks like this.

PK = “w#11e9978c5bb87c87d9059df0b3befb98fbd7009b/day”

SK = “day#2013-11-20”

Now you can scan a group of days like this

PK = “w#11e9978c5bb87c87d9059df0b3befb98fbd7009b/day”

SK (begin) = “day#2013-06-01” SK (end) = “day#2013-06-30”

Lets look at the code now because it will start to make sense when you see it how it works. All of the code for this project is here in telesto. Take a look in example.py which you can run if you have a DynamoDB setup. I have a terraform file in there for you to create one. Assume us-east-2 but you can change that and I didn’t use a state file in the repo so you can deal with that yourself too if you want.

class WorkspaceCategory(Entity):
    def __init__(self, client, pk=None, sk=None, parent=None):
        pk="wc"
        super().__init__(client=client, pk=pk, sk=sk, parent=parent)

boto3_client = boto3.resource("dynamodb", region_name="us-east-2")

client = Client(boto3_client, "pogo-main")
wc = WorkspaceCategory(client=client)
columns = EntityAttribute(label="testing")

testing = wc.create(columns)

#Now we have a testing row create in the database for this workspace category, next lets increment a counter on it.

wc.inc(testing, "example") # example=1 
wc.inc(testing, "example") # example=2 

#Now lets create a new workspace within the category

class Workspace(Entity):
    def __init__(self, client, parent):
        sk = "w"
        super().__init__(client=client, sk=sk, parent=parent)

workspace = Workspace(client=client, parent=testing) #we make testing entity item the parent so everything done with this Workspace entity is under it.

my_workspace = workspace.create(DomainAttribute(label="my_workspace", parent_counter="count_of_workspaces") #Two things are happening here. First, we are creating the workspace like we have in our data example above with PK and SK and here is how simple the code is because of telesto. Second, parent_counter="count_of_workspaces" is telling the underlying code to inc a counter on the parent object so we can denormalize our counts and we do this within a transaction too. 

#Lets create one more workspace so we have two.
workspace.create(DomainAttribute(label="JIRA10024", parent_counter="count_of_workspaces")

#now we can see all the workspaces for the testing workspace category
response = workspace.all(testing) # we pass the instance of "Workspace" Entity as the child so the get_children functions knows the secrete string to use in the compose of "w" that we set in the class definition. 

Again, you can get the source code here and I plan to keep working on the client and a sample implementation of the code. And I will blog as I go with updates that are meaningful.

Thanx =8^) Joe Stein

https://www.twitter.com/charmalloc

https://www.linkedin.com/in/charmalloc

Leave a Reply

Discover more from Hello BitsNBytes World

Subscribe now to keep reading and get access to the full archive.

Continue reading