Size of float in disk_log
Richard Cameron
camster@REDACTED
Mon Jan 9 11:27:31 CET 2006
I'm using the (extremely useful) disk_log module to dump out a
timeseries of numerical data to the disk. It's running in internal
mode and behaving wonderfully with just one minor gripe:
It seems to use a *hell of a lot* of disk space. The main culprit
seems to be trying to serialize floating point numbers (of which I
have lots and lots, but I acknowledge that most telephony
applications don't). Doing a unix "strings" on the log file yields
lots of stuff like this:
c1.79827350000000003553e+00
which I'm guessing is a factor of 27/8 (=3.4) more than dumping out
the raw 64-bit double (under some appropriate endianness convention).
Also, atoms seem to be getting dumped out textually every time. So a
record with a #very_long_record_name_indeed{} is going to require
lots of space on the disk every time it's written. If disk_logs are
always read sequentially, is there any reason why these atoms
couldn't be "interned" and represented by integers on subsequent
appearances?
Doing a bzip2 -9 on my log file yields a 97% reduction. I know I
could use the disk_log's external format to roll-my-own format, but I
don't think my requirements are terribly atypical. So, I suppose my
questions are:
1) Has anyone already thought about tweaking disk_log to make it less
disk-hungry, or
2) am I missing some vital point which means that there really isn't
such a big problem here?
I suppose one extreme alternative is to write a driver for the HDF5
library <http://hdf.ncsa.uiuc.edu/HDF5/>, although that's probably
going to be of little use to anyone except those dealing with lots of
numerical data.
Richard.
More information about the erlang-questions
mailing list