Storing 7 million records in Azure

Large numbers are hard to imagine. A thousand sounds like a lot and a million even more. I recently read an example of how to get a better feel for the magnitude of difference between these numbers:

A thousand seconds is nearly 17 minutes.
A million seconds is 11.5 days.
A billion seconds is nearly 32 years.

I wish I had read this before I started on my latest assignment. In that case I would have known that seven million records are a massive amount to process.

Back into computer history

The assignment was to import the data from a old mainframe system into a shiny new system that was build in .NET using the latest technologies. This old system runs on a HP 3000 computer that has the size of a refrigerator. The data is stored in KSAM format. These files would be exported and fed into the component I worked on. This component would ingest the data and import it into the new system.

1973 Promo Photo – HP 3000. Courtesy of HP Computer Museum.

The old system was developed over the last 40 years and it is a big task to build a new system to replace it. For the time being both systems would run side by side. Data from the old system has to be imported into the new system on a daily basis to keep everything in sync.

My first challenge was to read the individual records from the files and store them into a database for further processing. There were about 20 of these KSAM files. The smallest file contained 56 records and the largest had more than 6 million records. In total there were more than 7 million records.

The team had worked on a earlier version of the importer that would process much smaller files. They experienced issues with inserting many rows into a SQL Server database. It was just not fast enough to process the larger files within 24 hours. My task was to look for a database system that was fast enough to do this.

Azure DocumentDB is not fast enough

The new system is running in Azure so we first looked at the storage options of Azure. Our first candidate was Azure DocumentDB. We liked the flexibility of this database system and it was very similar to MongoDB – a database system some team members already had experience with.

After creating a first implementation we found out that inserting into Azure DocumentDB was a lot faster than in SQL Server, but not quite fast enough. One of the major drawbacks was that batch operations are not supported out of the box. For every insert a separate request had to be made and although we ran multiple requests in parallel, it still was too slow. Inserting the records from the largest file took about 12 hours, leaving not enough time for further processing.

Azure Table Storage does it

Our next try was by using Azure Table Storage. This database system is designed for high scalability and availability. It comes with some constraints or limitations to reach this design goal. Each entry needs to have two keys which are used for storing and retrieving the data in an efficient way. Also the querying options are limited. There is no way to sort the data. The data is sorted by its two keys and that is the way it is returned. There is no option to create secondary indexes, so queries by properties other than these keys will lead to a table scan. This can be quite slow when there are a lot of records. This all forces you to think carefully about how you want to store your data in Azure Table Storage. They way the data is used defines the design of the table structures. You may need to store the same data in multiple tables using different keys to be able to query by these keys.

We expected that the batch operation capabilities of Azure Table Storage would help speed up the inserts of our millions of records. This was inspired by an article from Troy Hunt in which he writes that he was able to store 154 million records with a speed of 22,500 rows per second. That sounded pretty impressive to us and if we would be able to do something similar then our problem would be solved.

So after working around the limitations of Azure Table Storage, like the character limitation for property names and key values we were able to store all seven million records in 4 hours and 20 minutes. This comes down to about 470 records per second. Not as impressive as Troy Hunt’s results, but his records were quite small and our records contained a lot more data. The data rate was probably about the same.

Next…

The next challenge was to compare the current import to the previous to detect any changes. I’ll write about that in my next post.

Wrapping My Head Around DynamoDB Keys

Mileage Registration

For a side project I’m building an application for mileage registration, something you need to do when driving a company car. Normally you’ll have to pay tax for the private usage of a company car, unless you drive less than 500 private kilometers a year. To qualify for not paying these taxes you need to be able to hand over a mileage registration (kilometerregistratie) to the tax authorities that proves you drove less than 500 private kilometers. This registration is quite comprehensive: Every trip needs to be registred including the mileage and location on start and end.

My idea is to build a web site (or app, I’m not sure yet) where one could enter this data. This site or app talks to a web API that stores its data in DynamoDB. This projects I do mainly to learn new stuff like Node, Angular and as I already mentioned DynamoDB.

Natural or Surrogate?

One of the things I really can’t wrap my head around is the key design in DynamoDB. As most developers I have always stored my data in a relational database. Almost without any exception you would use an auto-incremental integer as the primary key. There would hardly be a case where you would use a natural key. There are a lot of reasons for doing so, mainly because of performance, but also because natural keys tend to change anyways.

To get some inspiration I looked at some of the examples given by Amazon. The interesting thing is that in these examples they tend to use natural keys like date and title. This freeks me out a little: what will happen when the value of one of these keys change? Do you delete the old item and add a copy with the new key? And what about reference to that item? It will be quite impossible to change those values later. ‘Sorry, you can’t change the title of this forum.’ That would be quite limiting.

First Design

At first I came up with the following table design:

User
Hash key
UserID
Vehicle
Hash key Range key
UserID RegistrationNumber
Trip
Hash key Range key
UserID-RegistrationNumber TripDatetime

As you can see this design was based on the Forums example from Amazon.

Although the design looked okay at first I ran into quite some issues while implementing this in a web API. The main problem was in the key design of the Trip table. The keys were choosen this way because of the way the trips would be displayed in the application. You would always want to show a list of trips for a particular vehicle ordered by date, for instance for handing it over to the tax authorities. This scenario was covered pretty well by this key design.

The issue was in the Range Key. Choosing the trip date as the Range Key ment that it became really hard or virtually impossible to change this value. One would want to be able to update the trip date. For instance when a trip is saved by default with today’s date you could by mistake enter the wrong date while saving a trip you made yesterday.
Also the sequence of API requests for updating the date would become a bit unnatural. The client should either delete and recreate:

-> DELETE /api/vehicles/67-PK-KS/trips/2015-10-02T14:34Z
<- 200 OK
-> POST /api/vehicles/67-PK-KS/trips
<- 201 Created

.. or I had to come up with a new URL for the resource after the date changed. I could have done it with one of the 3xx HTTP statuses, but it didn’t sound like a good idea either.

And what about two trips on the same date? I could include the time at milliseconds level to avoid collitions, but it all started to feel a bit odd.

Surrogate it has to be

I experimented a little with other natural keys like start mileage or a combination of mileage start and end. But all of the these had the same issue: unability to change these values later. Actually the Trip data entity didn’t have any succesfull natural key candidates so I had to introduce a surrogate key, in this case some sort of UUID.

Trip
Hash key
TripID

Now I had sorted the problems with updating trips I introduced a new problem. Hash keys are stored in random order so it would become much harder to get all trips for a particular vehicle ordered by date. A Scan operation would get me all the trips filtered by vehicle, but they wouldn’t be in the right order yet. Sorting would still have to be done after all data is returned. This ment that to show a page of say 10 trips on each request all trips for a vehicle have to be loaded and sorted. Performance would be really bad and it would also become really pricy because you pay Amazon for data transfer and you would run out of the predefined capacity really fast.

Secondary Indexes

Luckely there’s a solution for this problem. DynamoDB now comes with secondary indexes. There are two types: first are local secondary indexes which have the same hash key but a different range key. Second there are the global secondary indexes which allow you to choose another hash key. The downside of using global secondary indexes is that when data is written to a table having a global index some extra capacity is used. So you need to reserve more capacity for tables that have this type of indexes and writing is more expensive.

In my case a global secondary index was required:

Trip Index
Hash key Range key
UserID-RegistrationNumber TripDatetime

This looks exactly like my first table design, although this one is maintained automatically by DynamoDB. You might wonder where does the UserID-RegistrationNumber key come from. This is an attribute I need to add to the Trips table. It’s value should be constructed by combining the UserID and the vehicle’s registration number. This is needed to be able to perform a Query request directly on the index and not having to filter on UserID (either client side or using a Scan operation)

Conclusion

Key design for DynamoDB tables can quite challenging and can be really different from relational database design. You need to carefully consider all scenarios in which the data is used. This mainly defines which keys you pick and which secondary indexes you would create. These are also things you need to consider when designing a relational database but DynamoDB’s limitations make it even more important.

Return empty collection instead of null

In the solution I’m currently working on I see the following code a lot:

public IEnumerable<Order> GetOrdersByStatus(OrderStatus status)
{
    DataSet orderSet;

    // Get orders from database

    if (DataHelper.IsEmpty(orderSet))
    {
        return null;
    }
    List<Order> orders = new List<Order>();
    
    // convert to Orders

    return orders;
}

As you see on line 9 the method returns a null value if there are no results. There is a problem with this and that becomes obvious when you use the method:

IEnumerable<Order> orders = GetOrdersByStatus(OrderStatus.Open);
if (orders != null)
{
    foreach(Order order in orders)
    {
        // do something with orders
    }
}

The caller of the method always has to check for a null value. If one forgets, you end up with a NullReferenceException, which is a pain to debug.

I prefer to always to return an empty collection when there are no orders, for instance

public IEnumerable<Order> GetOrdersByStatus(OrderStatus status)
{
    DataSet orderSet;

    // Get orders from database

    if (DataHelper.IsEmpty(orderSet))
    {
        return Enumerable.Empty<Order>();
    }
    List<Order> orders = new List<Order>();
    
    // convert to Orders

    return orders;
}

This will make it much easier for the caller to process the result:

IEnumerable<Order> orders = GetOrdersByStatus(OrderStatus.Open);
foreach(Order order in orders)
{
    // do something with orders
}

The behavior will be the same, but it’s easier to get right.

How to Copy SQL Server Database in Amazon RDS

The easiest way of duplicating a database in SQL Server is by using backup and restore. Unfortunately this feature is not available in Amazon RDS because there is no file system access.

This is a walk through for copying a database in RDS:

Open SQL Server Management Studio

Open SQL Server Management Studio and connect to the database server. All steps are also available in Express Edition.

Generate script

First you need to generate a script for all the objects in the database, excluding the data. You do this by right clicking on the source database on Tasks > Generate Scripts…

Select the objects you want to duplicate in your database.

Generate-Scripts-1

If you don’t have any DDL triggers, don’t select it. Amazon adds a trigger to each database to prevent creating a backup called rds_deny_backups_trigger. Otherwise make sure this trigger is excluded from the generated script.

Also don’t select the master user created when initializing the RDS database server. Every new database will already have this user.

Click Next.

Generate-Scripts-2

Select Save to new query window. You could opt for saving it to file, but in my case I only use the script once.

Click on Advanced.

Generate-Scripts-3
  • Set Script USE DATABASE to False
  • Set Script Indexes to True
  • Set Script Triggers to True

Click Next and Next.

Wait until the script is created.

Create new database

Now create the new empty database by right clicking on Databases. Select New Database…

Enter the name of the database and press OK.

Change the connection of the generated script window to the new database and run the script.

Disable triggers and foreign key constraints

Run the following script on the new database to disable triggers and foreign key constraints:

--Disable foreign keys on all tables
DECLARE @table_name SYSNAME;
DECLARE @schema_name SYSNAME;
DECLARE @cmd NVARCHAR(MAX);
DECLARE table_cursor CURSOR FOR SELECT sys.tables.name,sys.schemas.name as schemaName FROM sys.tables inner join sys.schemas on sys.tables.schema_id=sys.schemas.schema_id;
OPEN table_cursor;
FETCH NEXT FROM table_cursor INTO @table_name,@schema_name;
WHILE @@FETCH_STATUS = 0 BEGIN
SELECT @cmd = 'ALTER TABLE '+QUOTENAME(@schema_name)+'.'+QUOTENAME(@table_name)+' NOCHECK CONSTRAINT ALL';
EXEC (@cmd);
FETCH NEXT FROM table_cursor INTO @table_name,@schema_name;
END
CLOSE table_cursor;
DEALLOCATE table_cursor;
GO 

--Disable triggers on all tables
DECLARE @enable BIT = 0;
DECLARE @trigger SYSNAME;
DECLARE @table SYSNAME;
DECLARE @cmd NVARCHAR(MAX);
DECLARE trigger_cursor CURSOR FOR SELECT trigger_object.name trigger_name,
 table_object.name table_name
FROM sysobjects trigger_object
JOIN sysobjects table_object ON trigger_object.parent_obj = table_object.id
WHERE trigger_object.type = 'TR';

OPEN trigger_cursor;
FETCH NEXT FROM trigger_cursor INTO @trigger, @table;

WHILE @@FETCH_STATUS = 0 BEGIN
 IF @enable = 1
 SET @cmd = 'ENABLE ';
 ELSE
 SET @cmd = 'DISABLE ';

 SET @cmd = @cmd + ' TRIGGER dbo.'+QUOTENAME(@trigger)+' ON dbo.'+QUOTENAME(@table)+' ';
 EXEC (@cmd);
 FETCH NEXT FROM trigger_cursor INTO @trigger, @table;
END

CLOSE trigger_cursor;
DEALLOCATE trigger_cursor;

GO

Copy all data

Next step is to copy all the data from the source database to the new database. For this you have to use the SQL Server Import and Export Wizard. You open this by right clicking on the source database and click Tasks > Export…

First enter the connection details for the source database.

import-export-1

Then enter the connection details for the target database.

import-export-2

Highlight all tables by holding the Shift key and clicking on the first and last table. Click on one of the check boxes left to the tables. This will select all tables. Make sure not to select any views because there are no destination tables to insert into.

import-export-4

Now keep the highlight and click on Edit Mappings…

import-export-5

Check Enable identity insert. Close the dialog by clicking OK.

Click Next.

Choose Run immidiately,

import-export-6

Click Next, Next and Finish.

Wait until the import completes.

Enable triggers and foreign key constraints

Enable all triggers and foreign key constraints on the target database by running this script:

--Enable foreign keys on all tables
DECLARE @table_name SYSNAME;
DECLARE @schema_name SYSNAME;
DECLARE @cmd NVARCHAR(MAX);
DECLARE table_cursor CURSOR FOR SELECT sys.tables.name,sys.schemas.name as schemaName FROM sys.tables inner join sys.schemas on sys.tables.schema_id=sys.schemas.schema_id;
OPEN table_cursor;
FETCH NEXT FROM table_cursor INTO @table_name,@schema_name;
WHILE @@FETCH_STATUS = 0 BEGIN
SELECT @cmd = 'ALTER TABLE '+QUOTENAME(@schema_name)+'.'+QUOTENAME(@table_name)+' CHECK CONSTRAINT ALL';
EXEC (@cmd);
FETCH NEXT FROM table_cursor INTO @table_name,@schema_name;
END
CLOSE table_cursor;
DEALLOCATE table_cursor;
GO
--Enable triggers on all tables
DECLARE @enable BIT = 1;
DECLARE @trigger SYSNAME;
DECLARE @table SYSNAME;
DECLARE @cmd NVARCHAR(MAX);
DECLARE trigger_cursor CURSOR FOR SELECT trigger_object.name trigger_name,
 table_object.name table_name
FROM sysobjects trigger_object
JOIN sysobjects table_object ON trigger_object.parent_obj = table_object.id
WHERE trigger_object.type = 'TR';

OPEN trigger_cursor;
FETCH NEXT FROM trigger_cursor INTO @trigger, @table;

WHILE @@FETCH_STATUS = 0 BEGIN
 IF @enable = 1
 SET @cmd = 'ENABLE ';
 ELSE
 SET @cmd = 'DISABLE ';

 SET @cmd = @cmd + ' TRIGGER dbo.'+QUOTENAME(@trigger)+' ON dbo.'+QUOTENAME(@table)+' ';
 EXEC (@cmd);
 FETCH NEXT FROM trigger_cursor INTO @trigger, @table;
END

CLOSE trigger_cursor;
DEALLOCATE trigger_cursor;
GO

Set permissions

Only the users where imported, but not its roles and permissions. These have to be added manually.

Update

Dan commented that the SQL scripts didn’t support schemas. He provided me with an updated version that takes schemas into account. I updated the scripts in this post. Thanks Dan for this great contribution! An update for triggers will follow soon.

Error “Keyword not supported: ‘data source’.” using Octopus Deploy

We recently started using Octopus Deploy for our deployments. After the last deployment I got this strange error:

System.ArgumentException: Keyword not supported: 'data source'.
Stack Trace:
at System.Data.EntityClient.DbConnectionOptions.ParseInternal(Hashtable parsetable, String connectionString, Hashtable synonyms)
at System.Data.EntityClient.DbConnectionOptions..ctor(String connectionString, Hashtable synonyms)
at System.Data.EntityClient.EntityConnection.ChangeConnectionString(String newConnectionString)
at System.Data.Entity.Internal.LazyInternalConnection.TryInitializeFromAppConfig(String name, AppConfig config)
at System.Data.Entity.Internal.LazyInternalConnection.Initialize()
...

The exception stack trace indicated that it had something to do with Entity Framework. I assumed the connection string could have been misconfigured on the server. We use the Configuration variables feature of Octopus Deploy to replace some connection strings in the web.config file.

The value for the connection string looked alright:

Quot-Octopus-Deploy

The connection string was for Entity Framework and had this value that contains another connection string inside quotes.

Then I had a look at the actual web.config file on the server. This showed something interesting in the connectionStrings section:

&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;add name="DataModelContainer" connectionString="metadata=[...] connection string=&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;data source=[....]&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;" providerName="System.Data.EntityClient"/&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;

Apparently the ‘&quot;’ was encoded twice to ‘&amp;quot;’. This was done when Octopus Deploy replaced the connection string value.

The solution was to not use the XML encoded value in Octopus deploy. So the ‘&quot;’ had to be replaced with ‘”‘.

Passing a List to a Stored Procedure Using a User Defined Table Type

By default stored procedures only allow you to use scalar values as parameters. Sometimes you would like to use a list of values, for instance when selecting a set of items or inserting multiple values in one statement.

Before I would either use a comma separated string that would be parsed or an XML type. Recently I discovered a third – probably better – way of doing this: by using a User Defined Table Type. This is supported from SQL Server 2008.

First you create the User Defined Table Type:

CREATE TYPE VarcharList AS TABLE (
	Value varchar(50) NOT NULL PRIMARY KEY
)

Then you use it as a parameter in a stored procedure:

CREATE PROCEDURE GetItemsByIDs
    @ids VarcharList READONLY
AS
BEGIN
    SELECT * FROM Items
    WHERE ID IN (SELECT Value FROM @ids);
END

The parameter has to be defined as readonly to make this work.

In your code calling the stored procedure you have to add a parameter using the SqlDbType.Structured type. You set the TypeName property with the name of the Table Type you just created.

SqlParameter sqlParameter = command.Parameters.Add("ids",
    SqlDbType.Structured);
sqlParameter.TypeName = "VarcharList";
var sqlMetaData = new SqlMetaData("Value", 
    SqlDbType.VarChar, 50);
sqlParameter.Value = IDs.Select(id =>
{
    var record = new SqlDataRecord(sqlMetaData);
    record.SetString(0, id);
    return record;
});

One of the possible types you can use as the value for the parameter is IEnumerable. In the example above Linq is used to create it.

Interestingly you will get an error when the parameter value doesn’t contain any rows. I added an extra argument check to prevent this:

public IList<Item> GetItemsByIds(ICollection<string> ids)
{
    if (ids == null) throw new ArgumentNullException("ids");
    if (ids.Count == 0) throw new ArgumentException
        ("Collection cannot be empty", "ids");

I have not any performance tests, but I imagine this will perform better than the other options because the parameters are already structured like the SQL server native types.