Computing That Serves

A Shared Memory Communication System for Big Data Simulation Workflows

Alexander Lemon
MS Thesis Proposal
Wednesday, December 6, 11:00 AM
3346 TMCB
Advisor: Quinn Snell


There is a pressing need for creative new data analysis methods which can sift through scientific simulation data and produce meaningful results. The types of analyses and the amount of data handled by current methods are still quite restricted, and new methods could provide scientists with a large productivity boost. New methods could be simple to develop in big data processing systems such as Apache Spark, which is designed to process many input files in parallel while treating them logically as one large dataset. This makes it ideal for processing simulation output, but processes need to send data to Spark through fast transports such as shared memory rather than filesystems. By using Spark's Apache Mesos interface to force colocation of Spark executors with simulation processes and enabling fast local inter-process communication through shared memory, we can quickly transport bulk data into the Java Virtual Machine, removing the current Spark ingestion bottleneck. Removing this bottleneck will enable authors to write analysis workflows based on Spark which avoid the current serial analysis bottleneck, leading to a significant speedup in scientific simulation pipelines.