Skip to main content

Moving 21 million documents to Microsoft Azure with Spring Batch

Background

While updating a messaging system we decided to move all the unstructured content out of a relational database to a dedicated document store. Keeping only an Identifier for a document considerably reduces the size of the database thus making savings in storage and facilitating the database administration work. Driven by criteria of ease of management, security and cost, the chosen document store was a Software as a service solution: SharePoint online which runs on the Microsoft cloud Azure.

Problem to solve

Very briefly: move 21 million documents from our in premise to the cloud in less than 30 days.

In more detail

As for all migrations we need to move the data and map the source identifier and the ID in the new location. This is to allow references to the migrated to be updated.
The stages high level for this move were ...
Scan the eligible items to be migrated and store their identification and details to access their content in a single table. This table allows to keep track of the items to move while presenting a simplified view of the source data model.
Move the items to the cloud and save for each the document store identifier. Given the high volume, millions, this step has to allow for interruptions and restarts.
Finally update the "structured only" database to add references to the migrated documents.

Quick start with great products

Office 365 SDK
First of all the interface to Microsoft Azure and SharePoint online: Microsoft has published a REST API and a bulk upload tool. The bulk upload tool is a very clever mechanism to push a large volume of content as a big blob to the cloud. Once available on Azure, you can move it to the final destination in SharePoint online. At the end, you can locate every single item by path. On your source drive a file with path a\b\c\d\file.txt would be accessible after it is migrated by the SharePoint site root URL followed by the path a/b/c/d/file.txt. It is then possible to obtain the document Identifier from the path using the REST API. The REST API allows an access from the technology stack we use: Java language, Linux servers, Weblogic application servers. More than the API, Microsoft has published an SDK for Java. Reusing / Extending the SDK was preferred over implementing a new REST API client from scratch. It provides out of the box an integration with the Azure Authentication library for Java and only a few issues had to be sorted to make the SDK work. After a few tests, the performance of an Identifier lookup (to translate a document path to an Identifier) was about the same order of magnitude as the REST API to create a document. Thus I skipped the step of migrating all content as a Blob and did the migration with the REST API only.

Spring Batch
For implementing long running tasks, Spring Batch is extremely well suited. It provides a structure to describe your long running task: a job in the batch jargon. A job consists of at least one step. A step can either be something completely custom: in that case it will implement the Tasklet interface, or, it will be decomposed as ...
A Reader to read the items to process. All this is chunk oriented so you specify how many items are read per per processing chunk.
The reader hands over to an optional Processor which works at the item level. A Composite Processor is used with a list of Processors that get invoked one after the other. The result of Processor 1 is fed as an input to Processor 2 which result is fed as an input to Processor 3 ...
Finally a Writer completes the step. A writer gets the entire chunk in one go. It is also important to note (for jobs that have more than one step) that steps are sequential: Step 2 will only be started when Step 1 is successful and Completed.

Throwing threads does not perform

The signature of methods from the SharePoint SDK return ListenableFutures. For instance the method to create a document: "Why would I want to deal with Futures?" I thought at the start. So the methods were used as: This blocks the executing thread until the asynchronous method m() completes and returns a result. To increase performance, I thought let's have many threads that will push documents in parallel. Performance measured after some tests indicated that 220 days would be required to complete the migration with this implementation. Option canned! Created threads were waiting for the network threads to complete and as such were not improving the overall performance. On the contrary they were deteriorating it as there was a limit on the server of how many threads could be created in parallel.

Using non blocking operations

The second instalment used one thread only. When a chunk with N items was ready to be written, the single execution thread fired all creations in a non blocking way. There would be N network calls on going at most. The main thread then waited for the completion of all calls.
The write operation succeed only if all N calls succeeded.

Retry mechanisms

A piece of code that runs against a service in-premise perfectly may frequently fail in a cloud deployment. Various errors appear: connection, session expiry. To smooth these errors, methods were invoked with the async-retry mechanism from Tomasz Nurkiewicz:
An individual retry strategy improves stability greatly. Another improvement was to introduce a retry at the chunk level for the items that had failed.

Conclusion

While running the migration we could notice some variability in the throughput: from 40 thousand documents moved per hour to over 100 thousand. This makes one think that the chunk size, equivalent to the pipe diameter in fluid mechanics, should rather be adjustable instead of being fixed:
  • if the failure rate increases, it is a sign that the throughput maybe too high. The chunk size should be reduced.
  • if there are no errors, one could attempt to increase the chunk size.
This experience made me really understand some of the reactive programming style (see excellent intro article) where the subscriber influences the throughput.

Comments

Popular posts from this blog

LDAP server with your schema on your laptop

To choose is to live... Jacques Brel said "choisir c'est perdre": to choose is to loose. There would indeed be no choice to make if an option was definitely superior to the others. If you choose you will have to balance pros and cons and you will then have to loose. My great colleague, Achilleas Voutsas , reacted on less than a split second on that one. He straight away answered that: "To choose is to LIVE". He is right: if you have no choice, it is monotone. Choice is creativity. Choice is life. Going back to the theme. In 2005 Romain Eude and Gonzague Huet devised a solution allowing one to run an OpenLDAP server on a windows PC. That was brilliant and it helped tremendously the development of an LDAP based system. I could run an entire system on my laptop: database, application sever and OpenLDAP with 20000 entries. If you run Linux, OpenLDAP is no brainer choice to use. On Windows, the setup from that we used in 2005 is very easy but was not

Websites about using JPA 2.0 with Google app engine

Maven archetype to generate project (found by googling maven jpa 2 google app engine Archetype ): http://webapplicationdeveloper.blogspot.co.uk/2012/07/google-datastore-with-jpa-and-spring.html For Maven and Google App Engine, see also: http://code.google.com/p/maven-gae-plugin/ Google App engine documentation about JPA 2 support: https://developers.google.com/appengine/docs/java/datastore/jpa/overview-dn2 Google App engine documentation about the java datastore: https://developers.google.com/appengine/docs/java/gettingstarted/usingdatastore Another cool article but not to JPA or google app engine: https://code.google.com/p/guava-libraries/wiki/CachesExplained  

Recursivity and Observables: yes it is possible

When writing a loop using a paging API to fetch data for an Angular app, I felt so frustrated  that I was very close to going back to Promises. I came across lot of questions on the web but no convincing answers: some even proposing to subscribe to an Observable within the service implementation. However, I did not abandon  Observables to return using Promises thanks to this very informative comparison . From there I picked an  excellent article on reactive programming  with RxJava which gave me hope. So I kept on trying... Achieving both reactivity and recursivity without understanding the fundamentals of RxJs operators, was a struggle. After focussing on the use of the map and mergeMap operators, I managed a first implementation very similar to how I would write a solution to the same problem using Promises. To help me and others who have the same questions, I created a repository of angular code snippets . One is about  RxJs Recursive Observable . Expand operator to the