[erlang-questions] extracting sub-terms from term_to_binary encoded terms without unpacking first

Thu Nov 17 16:40:53 CET 2011

On Thu, Nov 17, 2011 at 4:12 PM, Max Bourinov <bourinov@REDACTED> wrote:

> Hi Joe,
>
> I also have many term_to_binary calls and as a result many-many binary
> data chunks stored in DB.
>
> As I understood to use your technique I need pimped versions of hd, tl and
> element functions. Can you please provide a little bit more details about
> it plase?
>

The external format is described in

http://www.erlang.org/doc/apps/erts/erl_ext_dist.html

To test my understanding of this I wrote a simple program that
reconstructs a term from the external format (enclosed)

A javascript program that does almost the same thing as my program is here:

https://github.com/rtomayko/node-bertrpc/blob/master/src/bert.js

As you can see decoding a term is easy.

The tuple {a,b,c} gets encoded as

<<131,104,3,100,0,1,97,100,0,1,98,100,0,1,99>>

131 means "external format"
194,3 means a tuple with 3 elements
100,1,97 means the atom a and so on

If you keep a pointer into this structure pointing at the second word (call
this P)
and want to implement element(3, P) then you just check that P points to a
tuple
then skip to the third element.

The external format is not actually designed for rapid random access but
it's not too
bad, so this pretty easy.

Turning the code that I posted here into a set of access functions is easy.
you only need element(K) - to step into tuples hd and tail
to step into lists (though n'th would be a good idea)

For this to be efficient you would need to do this as a NIF since the
erlang code
that does the same thing would create some garbage as it executes.

I haven't done any of this - my goal is to create binaries in Erlang
and decode them in javascript.

Cheers

/Joe

Best regards,
> Max
>
>
>
>
> On Thu, Nov 17, 2011 at 5:52 PM, Joe Armstrong <erlang@REDACTED> wrote:
>
>> Here's a programing technique that might be useful which I haven't seen
>> described before ...
>>
>> I've playing with unpacking binaries produced by term_to_binary(Term) in
>> other languages. Specifically I do term_to_binary in Erlang creating
>> binary and I send the
>> binary to javascript. The javascript code does not by default decode the
>> entire binary,
>> but accesses sub-terms through selector functions (you only need element,
>> hd and tl)
>>
>> This technique seems much nicer than mucking around with JSON
>> binary formats are way easier to manipulate than than text formats that
>> need parsing.
>>
>> Now of course you can do the same thing in Erlang, you do not have to
>> do binary_to_term(B) to extract a sub-term, but can traverse the internal
>> structure
>> of the external format and pull out exactly what you want and nothing
>> else.
>>
>> I often store large terms in files and databases using term_to_binary
>> and I extract data by first doing binary_to_term and
>> then pattern matching on the result.
>>
>> For example if I create a binary with:
>>
>>    > B = term_to_binary({foo,bar,[a,b]})
>>
>> And I want to extract the 'b' sub term, I'd normally write
>>
>>      {_, _, [_,X]} = binary_to_term(B)
>>
>> But why bother to unpack? I could just as well write
>>
>>      X = hd(tl(element(3,B)))
>>
>> This is not the regular hd/tl/and element but a hacked version that can
>> traverse the external format.
>>
>> If the term inside the external format is large and if I only want to
>> extract a few parameters
>> then this method should be lot faster than actually building a large
>> term, just to throw it away after pattern matching.
>> This should be a  GC and cache friendly way of doing things.
>>
>> In a similar vein one could think of pattern matching being extended over
>> packed terms.
>>
>> If this were so I could write:
>>
>>      T = {foo,bac,[a,b]}
>>      B = term_to_binary(T),
>>      match(B).
>>
>> match({_,_,[_,X]}) -> X
>>
>> Doing so would mean that once we have packed terms using term_to_binary
>> we could leave them
>> alone and extract data from them without having to completely unpack them.
>>
>> This should be very cache friendly - Erlang terms can be scatter all over
>> the place in virtual memory
>> but in the external form all the term is kept together in memory
>>
>> This is actually pretty useful - I have a data structure representing a
>> book - somewhere near the beginning there is a title
>> the entire book is stored on disk as a term_to_binary encoded blob. Now I
>> have a large numbers of these
>> representing ebooks. If I want to list all titles I certainly do not want
>> to complete unpack everything,
>> I only want to extract the title field and nothing else. ...
>>
>> Cheers
>>
>> /Joe
>>
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111117/8cfee985/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: decode_bin.erl
Type: text/x-erlang
Size: 2368 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111117/8cfee985/attachment.bin>