Processing tens of TB of data daily

20 minutes

Any

English 

Each of tens of thousands customers data marts in our fast growing GoodData SaaS platform ( www.gooddata.com ) has unique data model. Service for refreshing data daily transform tens of terabytes of denormalized CSV files to clean output suitable for loading to database tables on several hundreds nodes in three private cloud clusters. Majority of nearly million tasks processed daily finish in less than a second. Some of them take hours to finish and use hash structure which allocate more than hundred GB of memory.

Our product was distilled more than ten years ago from Perl code base. This talk will cover long evolution of code base and architecture of data marts upload service. From base Perl implementation with perl hash and Storable serialization to Perl code generator which uses optimized C functions, Judy Arrays for hash structure, custom serialization with incremental caches on local nodes and runs on Erlang middleware.