Processing tens of TB of data daily

By:

Michal Jurosz (‎mj41‎)

From:

Brno.pm

Date:

Thursday, 10 August 2017 12:15

Duration:

20 minutes

Target audience:

Any

Language:

English

Abstract:

Each of tens of thousands customers data marts in our fast growing GoodData SaaS platform ( www.gooddata.com ) has unique data model. Service for refreshing data daily transform tens of terabytes of denormalized CSV files to clean output suitable for loading to database tables on several hundreds nodes in three private cloud clusters. Majority of nearly million tasks processed daily finish in less than a second. Some of them take hours to finish and use hash structure which allocate more than hundred GB of memory.

Our product was distilled more than ten years ago from Perl code base. This talk will cover long evolution of code base and architecture of data marts upload service. From base Perl implementation with perl hash and Storable serialization to Perl code generator which uses optimized C functions, Judy Arrays for hash structure, custom serialization with incremental caches on local nodes and runs on Erlang middleware.

Tags:

c cloud development jit memory optimization performance perl speed xs

Attended by:
Michal Jurosz (‎mj41‎)
Thomas Klausner (‎domm‎)
Diego Kuperman (‎diegok‎)
Martin Barth (‎ufobat‎)
Jose manuel De arce
Patrick Ringl (‎pari‎)
Jarkko Haapalainen (‎tojo‎)
Nohfu8Ie eeki2Eej (‎uch5Isi7‎)
Iaroslav Poliakov
H.Merijn Brand (‎Tux‎)
DrForr
Matthew Chubb (‎mchubb‎)
Felix Antonius Wilhelm Ostmann (‎Sadrak‎)
Daniel Egeberg
Lidia Corde
Dirk De Nijs (‎ddn123456‎)
Wim Boogaerts
Jan Seidl (‎JaSei‎)
Tomáš Ciml
Alessandra Traini (‎leluccia‎)
Miroslav Tynovsky
Errietta Kostala
John Lightsey (‎J.D.‎)
Tom Hukins
adela popa
Renee Bäcker (‎reneeb‎)
Rish
Lukáš Rampa
Oleksii Kysil
Tom Koelman
Xavier Arroyo
David H. Adler (‎dha‎)
Lucie Mohelníková (‎Lysiii‎)
Mo Dulies
Szymon Nieznański
Dan Muey
Rikus Goodell
Ningna Wang
Joelle Maslak
Andreea Hosu (‎Andreea‎)
Dave Sherohman (‎dsheroh‎)
Dennis Schuster
Michal Josef Špaček (‎skim‎)
Thomas Reifenberger

« List of talks

Processing tens of TB of data daily

Contacts

Remember...