Size of float in disk_log

Richard Cameron camster@REDACTED
Mon Jan 9 11:27:31 CET 2006


I'm using the (extremely useful) disk_log module to dump out a  
timeseries of numerical data to the disk. It's running in internal  
mode and behaving wonderfully with just one minor gripe:

It seems to use a *hell of a lot* of disk space. The main culprit  
seems to be trying to serialize floating point numbers (of which I  
have lots and lots, but I acknowledge that most telephony  
applications don't). Doing a unix "strings" on the log file yields  
lots of stuff like this:

c1.79827350000000003553e+00

which I'm guessing is a factor of 27/8 (=3.4) more than dumping out  
the raw 64-bit double (under some appropriate endianness convention).

Also, atoms seem to be getting dumped out textually every time. So a  
record with a #very_long_record_name_indeed{} is going to require  
lots of space on the disk every time it's written. If disk_logs are  
always read sequentially, is there any reason why these atoms  
couldn't be "interned" and represented by integers on subsequent  
appearances?

Doing a bzip2 -9 on my log file yields a 97% reduction. I know I  
could use the disk_log's external format to roll-my-own format, but I  
don't think my requirements are terribly atypical. So, I suppose my  
questions are:

1) Has anyone already thought about tweaking disk_log to make it less  
disk-hungry, or

2) am I missing some vital point which means that there really isn't  
such a big problem here?

I suppose one extreme alternative is to write a driver for the HDF5  
library <http://hdf.ncsa.uiuc.edu/HDF5/>, although that's probably  
going to be of little use to anyone except those dealing with lots of  
numerical data.

Richard.



More information about the erlang-questions mailing list