My last dataset is using a bunch of NPY files with JSON metadata to organize them (see here). It’s quite clunky, but fast to read and has many features (like efficiently taking a subset of the traces, or of the data fields). For transferring it as a file, it is assembled as a tar archive then compressed with zstd.
In general, general file format (Avro, JSON, CSV, sqlite…) are too slow. Columnar file format (like Parquet, etc.) also tend to be somewhat slow in my experiments since they have to handle more complex data than simply “a bit array of numbers”. I didn’t test how fast TRS is, but its feature set is too limited for my needs.
That leaves mostly HDF5, NPY and zarr.
Conceptually, HDF5 is good, but we experienced some performance issues with it (it’s not very predictable, I think it was most apparent in multi-threaded workloads), and it is so complex that it is hard to debug.
Npy is the simplest (so simple that implementing a reader for it is very easy) but lacks features - you just store one matrix and very fast to read (past the header, you just copy the bytes in memory). Npz (also supported by numpy) is simply a zip of npy files, that adds a lot of flexibility. However, its reader/writer goes through the unoptimized python Zipfile, making it slower for no good reason. Npz is very interesting as a file format (e.g. you get a lot of flexibility in the compression algorithms), but its current only implementation in numpy is slow.
Zarr is a data representation rather than a file format. It filesystem storage uses many files, so they need to be packed for transfer, like custom npy+json format. Zarr provides mostly what we need (efficient multidimensional arrays storage) with a lot of flexibility. However, it is not clear to me whether we would use much of their features. It also seems to be evolving fast, which might make it less than ideal for archival of datasets.
So overall, I none of the existing solutions satisfy my requirements, which are roughly:
- Fast to read (sequential read of the whole dataset)
- A file format (to be able to transfer/archive the dataset as a big file - many files are annoying).
- No need for on-disk decompression/unpacking: that is cumbersome as it doubles the required free space, and has also all the problems of using many files (there are more ways to corrupt the files, higher management complexity than simply a file).
- Some compression to reduce transfer and storage size (it chosen well, this will also make reading the dataset faster since disk bandwidth is often the limiting factor)
- Ability to store some small arbitrary metadata (JSON-like data structure).
Nice to have: chunking the dataset in multiple files (very large file downloads are not often annoying) that should ideally be usable individually (if I don’t want to download a full dataset).
Not required: high performance when reading only some POIs in the traces or some traces only. When I need to do this, I just create a dataset which is a subset of the big one, since this gives better performance in overall. I tend to re-run analysis scripts many times, hence the new dataset creation time is amortized, while data format which optimize for this use-case are still not as fast as reading serially a full matrix.
Any suggestion welcome! (If I had to make something, I would look into some kind of ZIP of zarr or npy - it might not be much work to get something to work and to be fast.)