The SDF Manifesto: Simple Data Format - GHF July 10, 2006 One of the constantly recurring challenges we face as numerical modelers is how to get the raw binary output from numerical simulations, especially large, multi-dimensional arrays, into a form where the data is easy to manipulate and analyze on a wide variety of commonly used platforms. There is currently no standardized way of reading and writing output data from numerical simulations that preserves full precision, is simple, easy to use, that supports large (> 2GB) files, works in C, Fortran77, Fortran95, and IDL/GDL, compiles out of the box in the most common operating systems (Linux, Windows, OS X, Solaris), and is not affected by platform endian-ness. HDF5 may address most of these needs, but in my experience it is hard to compile, somewhat cumbersome to use, and works, for example, only in the very latest versions of IDL. Simple Data Format (SDF) addresses all of these issues. It is written from the perspective of a physical scientist and is motivated by their needs, rather than from the perspective of a professional computer scientist or programmer. The following are the goals and design features for the SDF library of I/O routines: (0) SDF is simple. Within a simulation code, one can read or write datasets ranging from a single variable to large, multi-dimensional arrays by adding single calls to sdf_read or sdf_write. By not including a huge number of options and features, the code is easy to use and understand. In principle, no other sdf functions are needed, provided the user knows all of the relevant properties of the data to be read or written. The simplicity means that SDF may lack the generality or flexibility of other data formats, however. Synopsis: (1) SDF captures the output of the data with full binary precision. Compression of the data in the I/O files is not a priority. (2) Binary data is stored in large-endian byte order. SDF converts between small-endian and large-endian format on small-endian machines in a completely user-transparent fashion. (3) The file structure of an SDF file is divided between a header portion, which describes the data, and the data portion itself which generally is the majority of the file. (4) The data portion is divided into a series of "datasets", numbered from 0 to norder-1, where norder is the total number of datasets in the file. The header portion contains information about the data ("metadata") in a series of short character strings, with one string per dataset. The amount of metadata is kept to an absolute minimum. (5) For each dataset, the structure of each corresponding string in the header is a series of tokens separated by spaces. The order of the tokens corresponds to: 0. The order in which this data occurs in the file 1. A short character string label (no blanks) identifying the data; most typically, this would be the name of a variable. 2. The data type: 'i', 'f', 'c', 'b' (integer, float, complex, or byte) [more types to be considered later] 3. The number of bytes per word: typically '4' or '8'. For complex variables, this means the number of bytes for the real or imaginary parts considered separately. For byte data, this is 1. 4. The number of dimensions (ndim) of the array (if an array). A single variable would be '1', as would a 1-d array. 5. The "ndim" values of the array sizes (as string tokens). So, for example, if the data "vx" were to first contain a 101 by 102 array of double precision floats, its string descriptor might look like: "0 vx f 8 2 101 102". If this were followed by a single 4 byte integer, "nx" it would have a string descriptor of "1 nx i 4 1 1". If this were to be followed by the 6D single precision float array "qu" dimensioned 10x11x12x20x30x40, its string identifier would look like "2 qu f 4 6 10 11 12 20 30 40". (6) 64-bit integer arithmetic is used to describe and compute array indices and file positions, so that large (> 2GB) files can be accomodated in all platforms and operating systems. (7) It is not necessary to read all the data in a file just to extract only a small part of the data. For example, consider output from an MHD simulation containing 8 physical variables, of which the user only needs to read 1 variable: you do not have to read the other 7 variables if you did not want to, as would be the case for files written with unformatted fortran i/o. (8) The source code of SDF is written in C and compiles in linux, windows, solaris, and OSX with a variety of different compilers, although the default compiler for SDF is gcc. There are fortran callable versions of the SDF functions, as well as versions callable from C. SDF was designed specifically to be easy to incorporate into legacy Fortran77 codes, but it also works with Fortran 90/95. A fully functional IDL version of SDF has also been written. (9) SDF uses a simple and automatic transformation between column-major and row-major array indexing, so that headaches involved with the interaction of C code and Fortran code are minimized. This is accomplished in a simple fashion by using the standard that array dimensions written in the dataset descriptor strings are always stored in row-major order, allowing one to always know when row/column major conversion is necessary or not. In particular, arrays which are written with SDF in Fortran or IDL and read back into either Fortran or IDL with SDF will automatically be dimensioned correctly. A C-program reading the same dataset in SDF will have the array index orders flipped into row-major order, as one would expect. (10) SDF includes functions (sdf_transpose, and sdf_tranpose_f77) for doing an in-place multi-dimensional transpose and/or index reversal so that the index order of the data to be written out can be changed if that is more convenient to the user. In particular, this allows one to transform index order in a C program so that the index order is the same as in IDL or Fortran, or visa-versa. (11) Defined user-callable SDF functions: sdf_read, sdf_write, sdf_query, sdf_sizes, sdf_details, sdf_rm, sdf_delete, sdf_insert, sdf_replace, sdf_labmatch, and sdf_transpose. Fortran callable versions have a suffix of _f77. For the C user interface, additional functions sdf_mk_2d through sdf_mk_5d create dynamically allocated, multi-dimensional arrays which can be written with sdf_write, and functions sdf_1d_to_2d through sdf_1d_to_5d which convert arrays read with sdf_read to dynamically allocated, multi-dimensional arrays accessible with standard [i][j]... index notation. IDL versions of the SDF I/O procedures exist, as well as the IDL specific procedures sdf_read_all and sdf_write_all, which allow one to read in all variables from an sdf file into an idl session, or to write out all variables in an idl session to an SDF file. The IDL version also includes the functions sdf_read_var, which reads into an IDL session a variable whose label matches a user-defined label, and sdf_read_arr, which read in *all* occurrences of datasets for which label matches the user-defined string. The latter capability is useful in constructing time-series arrays from time dependent simulations stored in single SDF files.