Large numbers are hard to imagine. A thousand sounds like a lot and a million even more. I recently read an example of how to get a better feel for the magnitude of difference between these numbers:
A thousand seconds is nearly 17 minutes.
A million seconds is 11.5 days.
A billion seconds is nearly 32 years.
I wish I had read this before I started on my latest assignment. In that case I would have known that seven million records are a massive amount to process.
Back into computer history
The assignment was to import the data from a old mainframe system into a shiny new system that was build in .NET using the latest technologies. This old system runs on a HP 3000 computer that has the size of a refrigerator. The data is stored in KSAM format. These files would be exported and fed into the component I worked on. This component would ingest the data and import it into the new system.
The old system was developed over the last 40 years and it is a big task to build a new system to replace it. For the time being both systems would run side by side. Data from the old system has to be imported into the new system on a daily basis to keep everything in sync.
My first challenge was to read the individual records from the files and store them into a database for further processing. There were about 20 of these KSAM files. The smallest file contained 56 records and the largest had more than 6 million records. In total there were more than 7 million records.
The team had worked on a earlier version of the importer that would process much smaller files. They experienced issues with inserting many rows into a SQL Server database. It was just not fast enough to process the larger files within 24 hours. My task was to look for a database system that was fast enough to do this.
Azure DocumentDB is not fast enough
The new system is running in Azure so we first looked at the storage options of Azure. Our first candidate was Azure DocumentDB. We liked the flexibility of this database system and it was very similar to MongoDB – a database system some team members already had experience with.
After creating a first implementation we found out that inserting into Azure DocumentDB was a lot faster than in SQL Server, but not quite fast enough. One of the major drawbacks was that batch operations are not supported out of the box. For every insert a separate request had to be made and although we ran multiple requests in parallel, it still was too slow. Inserting the records from the largest file took about 12 hours, leaving not enough time for further processing.
Azure Table Storage does it
Our next try was by using Azure Table Storage. This database system is designed for high scalability and availability. It comes with some constraints or limitations to reach this design goal. Each entry needs to have two keys which are used for storing and retrieving the data in an efficient way. Also the querying options are limited. There is no way to sort the data. The data is sorted by its two keys and that is the way it is returned. There is no option to create secondary indexes, so queries by properties other than these keys will lead to a table scan. This can be quite slow when there are a lot of records. This all forces you to think carefully about how you want to store your data in Azure Table Storage. They way the data is used defines the design of the table structures. You may need to store the same data in multiple tables using different keys to be able to query by these keys.
We expected that the batch operation capabilities of Azure Table Storage would help speed up the inserts of our millions of records. This was inspired by an article from Troy Hunt in which he writes that he was able to store 154 million records with a speed of 22,500 rows per second. That sounded pretty impressive to us and if we would be able to do something similar then our problem would be solved.
So after working around the limitations of Azure Table Storage, like the character limitation for property names and key values we were able to store all seven million records in 4 hours and 20 minutes. This comes down to about 470 records per second. Not as impressive as Troy Hunt’s results, but his records were quite small and our records contained a lot more data. The data rate was probably about the same.
The next challenge was to compare the current import to the previous to detect any changes. I’ll write about that in my next post.