Lossy Compression for Scientific Data

Abstract

Large-scale numerical simulations and experiments are generating very large datasets that are difficult to analyze, store and transfer. This problem will be exacerbated for future generations of systems. Data reduction becomes a necessity in order to reduce as much as possible the time lost in data transfer and storage. Lossless and lossy data compression are attractive and efficient techniques to significantly reduce data sets while being rather agnostic to the application. This tutorial will review the state of the art in lossless and lossy compression of scientific data sets, discuss in detail two lossy compressors (SZ and ZFP) and introduce compression error assessment metrics. The tutorial will also cover the characterization of data sets with respect to compression and introduce Z-checker, a tool to assess compression error.

More specifically the tutorial will introduce motivating examples as well as basic compression techniques, cover the role of Shannon Entropy, the different types of advanced data transformation, prediction and quantization techniques, as well as some of the more popular coding techniques. The tutorial will use examples of real world compressors (GZIP, JPEG, FPZIP, SZ, ZFP, etc.) and data sets coming from simulations and instruments to illustrate the different compression techniques and their performance. This 1/2 day tutorial is improved from the evaluations of the two highly attended and rated tutorials given on this topic at ISC17 and SC17.

Topic area: Data, Storage & Visualization

Keywords: Big Data Analytics, Data Reduction, Extreme Scale Simulations

Agenda

Introduction of the Tutorial
Why compression, Why lossy compression
- Introduction to application use cases
- Introduction to compression techniques: Decorrelation, Coding.
- The role of Shannon Entropy
- Limits of lossless compression
Understanding the compressibility of data sets
- What to look for: Smoothness, autocorrelation, single snapshot, time series, etc.
- Example data sets:
  - Particle data sets from molecular dynamics and cosmology simulation
  - Instrument data sets from different light sources
  - Multidimensional regularly structured data sets from climate simulation
Techniques for compression
- Decorrelation/Conditioning
  - Sorting
  - Floating point specific lossless (normalization, leading zero, etc.)
  - Prediction and Fitting
  - Transforms (DCT, wavelet, etc.)
  - Decomposition (SVD, tensor)
- Techniques specific to lossy compressors
  - Quantization (scalar, vector, input, output, residual)
  - Filtering/Thresholding (frequency, coefficient reduction)
  - Bit stream truncation
- Coding
  - Run length, variable length coding (entropy coding)
Metrics and error controls
- What matter for applications / data analytics
- Error controls
- Z-checker framework
Details of compressors and compression results
- FPZIP
- ZFP
- SZ

Target audience

We expect a large diversity of attendees for this tutorial since it is relevant for users (running simulations or transferring or storing very large data sets), researchers (application developers, compressor developers, algorithm developers), students (needing compression or developing new compressors) and practitioners (administrators of facilities who want to save facility time and storage space).

Material samples

All materials are available here. This include slide decks and datasets that will be used for demonstration of the compressors and Z-checker. This link is permanent. The participants will be able to access the material after the tutorial.

Presenters

Franck Cappello is an expert in High Performance Computing with more than 20 years of experience. His research interest are fault tolerance and lossy compression. He initiated the SZ lossy compressor software at ANL. The SZ compressor is one of the most effective existing lossy compressors for scientific datasets. Franck is involved in several projects of the Exascale Computing Project. He leads the EZ software technology project on lossy compression. He leads the data reduction topic of the CODAR co-design project. He co-leads the data analytics topic of the "Computing the Sky at Extreme Scale" application project where he focuses particularly on lossy compression of N-body datasets. He contributed to several Exascale roadmaping efforts (IESP and EESI). Franck is fellow of the IEEE.

Peter Lindstrom is a computer scientist and project leader in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. His primary research areas include data compression, scientific visualization, computer graphics, and geometric modeling. Peter has developed numerous lossy and lossless open source compressors for floating-point and mesh- based scientific data, including zfp, fpzip, and hzip, and has published two dozen papers on data compression over the past 15 years in top venues such as ACM SIGGRAPH, IEEE VIS, ACM/IEEE SC, and the IEEE Data Compression Conference. He leads the zfp data compression effort as part of the US DOE's Exascale Computing Project. His zfp software has been adopted by HPC I/O libraries like HDF5, ADIOS, and Silo, and is being integrated into popular visualization tools like VisIt and VTK-m. Peter previously served as Editor in Chief for Graphical Models and is currently a Senior Member of IEEE.