[erlang-questions] extracting sub-terms from term_to_binary encoded terms without unpacking first

Joe Armstrong <>
Thu Nov 17 15:52:18 CET 2011


Here's a programing technique that might be useful which I haven't seen
described before ...

I've playing with unpacking binaries produced by term_to_binary(Term) in
other languages. Specifically I do term_to_binary in Erlang creating binary
and I send the
binary to javascript. The javascript code does not by default decode the
entire binary,
but accesses sub-terms through selector functions (you only need element,
hd and tl)

This technique seems much nicer than mucking around with JSON
binary formats are way easier to manipulate than than text formats that
need parsing.

Now of course you can do the same thing in Erlang, you do not have to
do binary_to_term(B) to extract a sub-term, but can traverse the internal
structure
of the external format and pull out exactly what you want and nothing else.

I often store large terms in files and databases using term_to_binary
and I extract data by first doing binary_to_term and
then pattern matching on the result.

For example if I create a binary with:

   > B = term_to_binary({foo,bar,[a,b]})

And I want to extract the 'b' sub term, I'd normally write

     {_, _, [_,X]} = binary_to_term(B)

But why bother to unpack? I could just as well write

     X = hd(tl(element(3,B)))

This is not the regular hd/tl/and element but a hacked version that can
traverse the external format.

If the term inside the external format is large and if I only want to
extract a few parameters
then this method should be lot faster than actually building a large term,
just to throw it away after pattern matching.
This should be a  GC and cache friendly way of doing things.

In a similar vein one could think of pattern matching being extended over
packed terms.

If this were so I could write:

     T = {foo,bac,[a,b]}
     B = term_to_binary(T),
     match(B).

match({_,_,[_,X]}) -> X

Doing so would mean that once we have packed terms using term_to_binary we
could leave them
alone and extract data from them without having to completely unpack them.

This should be very cache friendly - Erlang terms can be scatter all over
the place in virtual memory
but in the external form all the term is kept together in memory

This is actually pretty useful - I have a data structure representing a
book - somewhere near the beginning there is a title
the entire book is stored on disk as a term_to_binary encoded blob. Now I
have a large numbers of these
representing ebooks. If I want to list all titles I certainly do not want
to complete unpack everything,
I only want to extract the title field and nothing else. ...

Cheers

/Joe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111117/22551bf1/attachment.html>


More information about the erlang-questions mailing list