Skip to main content

Moving 21 million documents to Microsoft Azure with Spring Batch

Background

While updating a messaging system we decided to move all the unstructured content out of a relational database to a dedicated document store. Keeping only an Identifier for a document considerably reduces the size of the database thus making savings in storage and facilitating the database administration work. Driven by criteria of ease of management, security and cost, the chosen document store was a Software as a service solution: SharePoint online which runs on the Microsoft cloud Azure.

Problem to solve

Very briefly: move 21 million documents from our in premise to the cloud in less than 30 days.

In more detail

As for all migrations we need to move the data and map the source identifier and the ID in the new location. This is to allow references to the migrated to be updated.
The stages high level for this move were ...
Scan the eligible items to be migrated and store their identification and details to access their content in a single table. This table allows to keep track of the items to move while presenting a simplified view of the source data model.
Move the items to the cloud and save for each the document store identifier. Given the high volume, millions, this step has to allow for interruptions and restarts.
Finally update the "structured only" database to add references to the migrated documents.

Quick start with great products

Office 365 SDK
First of all the interface to Microsoft Azure and SharePoint online: Microsoft has published a REST API and a bulk upload tool. The bulk upload tool is a very clever mechanism to push a large volume of content as a big blob to the cloud. Once available on Azure, you can move it to the final destination in SharePoint online. At the end, you can locate every single item by path. On your source drive a file with path a\b\c\d\file.txt would be accessible after it is migrated by the SharePoint site root URL followed by the path a/b/c/d/file.txt. It is then possible to obtain the document Identifier from the path using the REST API. The REST API allows an access from the technology stack we use: Java language, Linux servers, Weblogic application servers. More than the API, Microsoft has published an SDK for Java. Reusing / Extending the SDK was preferred over implementing a new REST API client from scratch. It provides out of the box an integration with the Azure Authentication library for Java and only a few issues had to be sorted to make the SDK work. After a few tests, the performance of an Identifier lookup (to translate a document path to an Identifier) was about the same order of magnitude as the REST API to create a document. Thus I skipped the step of migrating all content as a Blob and did the migration with the REST API only.

Spring Batch
For implementing long running tasks, Spring Batch is extremely well suited. It provides a structure to describe your long running task: a job in the batch jargon. A job consists of at least one step. A step can either be something completely custom: in that case it will implement the Tasklet interface, or, it will be decomposed as ...
A Reader to read the items to process. All this is chunk oriented so you specify how many items are read per per processing chunk.
The reader hands over to an optional Processor which works at the item level. A Composite Processor is used with a list of Processors that get invoked one after the other. The result of Processor 1 is fed as an input to Processor 2 which result is fed as an input to Processor 3 ...
Finally a Writer completes the step. A writer gets the entire chunk in one go. It is also important to note (for jobs that have more than one step) that steps are sequential: Step 2 will only be started when Step 1 is successful and Completed.

Throwing threads does not perform

The signature of methods from the SharePoint SDK return ListenableFutures. For instance the method to create a document: "Why would I want to deal with Futures?" I thought at the start. So the methods were used as: This blocks the executing thread until the asynchronous method m() completes and returns a result. To increase performance, I thought let's have many threads that will push documents in parallel. Performance measured after some tests indicated that 220 days would be required to complete the migration with this implementation. Option canned! Created threads were waiting for the network threads to complete and as such were not improving the overall performance. On the contrary they were deteriorating it as there was a limit on the server of how many threads could be created in parallel.

Using non blocking operations

The second instalment used one thread only. When a chunk with N items was ready to be written, the single execution thread fired all creations in a non blocking way. There would be N network calls on going at most. The main thread then waited for the completion of all calls.
The write operation succeed only if all N calls succeeded.

Retry mechanisms

A piece of code that runs against a service in-premise perfectly may frequently fail in a cloud deployment. Various errors appear: connection, session expiry. To smooth these errors, methods were invoked with the async-retry mechanism from Tomasz Nurkiewicz:
An individual retry strategy improves stability greatly. Another improvement was to introduce a retry at the chunk level for the items that had failed.

Conclusion

While running the migration we could notice some variability in the throughput: from 40 thousand documents moved per hour to over 100 thousand. This makes one think that the chunk size, equivalent to the pipe diameter in fluid mechanics, should rather be adjustable instead of being fixed:
  • if the failure rate increases, it is a sign that the throughput maybe too high. The chunk size should be reduced.
  • if there are no errors, one could attempt to increase the chunk size.
This experience made me really understand some of the reactive programming style (see excellent intro article) where the subscriber influences the throughput.

Comments

Popular posts from this blog

Recursivity and Observables: yes it is possible

When writing a loop using a paging API to fetch data for an Angular app, I felt so frustrated  that I was very close to going back to Promises. I came across lot of questions on the web but no convincing answers: some even proposing to subscribe to an Observable within the service implementation. However, I did not abandon  Observables to return using Promises thanks to this very informative comparison . From there I picked an  excellent article on reactive programming  with RxJava which gave me hope. So I kept on trying... Achieving both reactivity and recursivity without understanding the fundamentals of RxJs operators, was a struggle. After focussing on the use of the map and mergeMap operators, I managed a first implementation very similar to how I would write a solution to the same problem using Promises. To help me and others who have the same questions, I created a repository of angular code snippets . One is about  RxJs Recursive Observable . ...

Looking for a web framework ...

I recently attended the Rich Web Applications course given by Spring Source, and I was surprised that the technology stack employed in the course was "equivalent" to what I used 8 years ago! It's equivalent in the sense that: Struts 1.2 was then the framework for implementing the Front Controller pattern Tiles with JSPs and JSTL was the view mechanism The course used Spring MVC which is the MODERN equivalent of Struts. Spring MVC is much more powerful but the functionality it provides is about the same as Struts. The element that really surprised me was the fact that Tiles was the proposed mechanism for the view layer. I personally find Tiles to be powerful but given the elapsed time, I was expecting something new to replace it. Why Tiles may not be so appropriate these days? It relies on JSPs which are not trivial to test outside a web container. The tiles XML where definitions go, is very similar to a Spring beans file. Nowadays you could expect a n...

Websites about using JPA 2.0 with Google app engine

Maven archetype to generate project (found by googling maven jpa 2 google app engine Archetype ): http://webapplicationdeveloper.blogspot.co.uk/2012/07/google-datastore-with-jpa-and-spring.html For Maven and Google App Engine, see also: http://code.google.com/p/maven-gae-plugin/ Google App engine documentation about JPA 2 support: https://developers.google.com/appengine/docs/java/datastore/jpa/overview-dn2 Google App engine documentation about the java datastore: https://developers.google.com/appengine/docs/java/gettingstarted/usingdatastore Another cool article but not to JPA or google app engine: https://code.google.com/p/guava-libraries/wiki/CachesExplained