For a side project I’m building an application for mileage registration, something you need to do when driving a company car. Normally you’ll have to pay tax for the private usage of a company car, unless you drive less than 500 private kilometers a year. To qualify for not paying these taxes you need to be able to hand over a mileage registration (kilometerregistratie) to the tax authorities that proves you drove less than 500 private kilometers. This registration is quite comprehensive: Every trip needs to be registred including the mileage and location on start and end.
My idea is to build a web site (or app, I’m not sure yet) where one could enter this data. This site or app talks to a web API that stores its data in DynamoDB. This projects I do mainly to learn new stuff like Node, Angular and as I already mentioned DynamoDB.
Natural or Surrogate?
One of the things I really can’t wrap my head around is the key design in DynamoDB. As most developers I have always stored my data in a relational database. Almost without any exception you would use an auto-incremental integer as the primary key. There would hardly be a case where you would use a natural key. There are a lot of reasons for doing so, mainly because of performance, but also because natural keys tend to change anyways.
To get some inspiration I looked at some of the examples given by Amazon. The interesting thing is that in these examples they tend to use natural keys like date and title. This freeks me out a little: what will happen when the value of one of these keys change? Do you delete the old item and add a copy with the new key? And what about reference to that item? It will be quite impossible to change those values later. ‘Sorry, you can’t change the title of this forum.’ That would be quite limiting.
At first I came up with the following table design:
|Hash key||Range key|
|Hash key||Range key|
As you can see this design was based on the Forums example from Amazon.
Although the design looked okay at first I ran into quite some issues while implementing this in a web API. The main problem was in the key design of the Trip table. The keys were choosen this way because of the way the trips would be displayed in the application. You would always want to show a list of trips for a particular vehicle ordered by date, for instance for handing it over to the tax authorities. This scenario was covered pretty well by this key design.
The issue was in the Range Key. Choosing the trip date as the Range Key ment that it became really hard or virtually impossible to change this value. One would want to be able to update the trip date. For instance when a trip is saved by default with today’s date you could by mistake enter the wrong date while saving a trip you made yesterday.
Also the sequence of API requests for updating the date would become a bit unnatural. The client should either delete and recreate:
-> DELETE /api/vehicles/67-PK-KS/trips/2015-10-02T14:34Z
<- 200 OK
-> POST /api/vehicles/67-PK-KS/trips
<- 201 Created
.. or I had to come up with a new URL for the resource after the date changed. I could have done it with one of the 3xx HTTP statuses, but it didn’t sound like a good idea either.
And what about two trips on the same date? I could include the time at milliseconds level to avoid collitions, but it all started to feel a bit odd.
Surrogate it has to be
I experimented a little with other natural keys like start mileage or a combination of mileage start and end. But all of the these had the same issue: unability to change these values later. Actually the Trip data entity didn’t have any succesfull natural key candidates so I had to introduce a surrogate key, in this case some sort of UUID.
Now I had sorted the problems with updating trips I introduced a new problem. Hash keys are stored in random order so it would become much harder to get all trips for a particular vehicle ordered by date. A Scan operation would get me all the trips filtered by vehicle, but they wouldn’t be in the right order yet. Sorting would still have to be done after all data is returned. This ment that to show a page of say 10 trips on each request all trips for a vehicle have to be loaded and sorted. Performance would be really bad and it would also become really pricy because you pay Amazon for data transfer and you would run out of the predefined capacity really fast.
Luckely there’s a solution for this problem. DynamoDB now comes with secondary indexes. There are two types: first are local secondary indexes which have the same hash key but a different range key. Second there are the global secondary indexes which allow you to choose another hash key. The downside of using global secondary indexes is that when data is written to a table having a global index some extra capacity is used. So you need to reserve more capacity for tables that have this type of indexes and writing is more expensive.
In my case a global secondary index was required:
|Hash key||Range key|
This looks exactly like my first table design, although this one is maintained automatically by DynamoDB. You might wonder where does the UserID-RegistrationNumber key come from. This is an attribute I need to add to the Trips table. It’s value should be constructed by combining the UserID and the vehicle’s registration number. This is needed to be able to perform a Query request directly on the index and not having to filter on UserID (either client side or using a Scan operation)
Key design for DynamoDB tables can quite challenging and can be really different from relational database design. You need to carefully consider all scenarios in which the data is used. This mainly defines which keys you pick and which secondary indexes you would create. These are also things you need to consider when designing a relational database but DynamoDB’s limitations make it even more important.
Amazing post, i’ve had all of the same issues and more like yours.
So if you have more learnings please let me know