[erlang-questions] Managing a huge binary file

Jesper Louis Andersen jesper.louis.andersen@REDACTED
Tue Apr 30 13:42:22 CEST 2013


On Apr 30, 2013, at 3:23 AM, Salvador Tamarit <stamarit@REDACTED> wrote:

> Suppose that we have a huge binary file (8GB or so) containing all the
> integers from 0 to a given (very big) number. The numbers are unsorted
> and some of them (70 numbers for instance) has been removed.
> 
> The goal is to find the the Nth removed number (in order). For
> instance, if the file contains the numbers from 0 to 99, and then
> number 101, the 1st removed would be 100.

Quick shot at a possible solution: Variant of a Fenwick-tree.

Essentially, the idea is to count bit-patterns in the numbers and let each bit correspond to a slot in an array. Now you can "count" how many times a given pattern is there. With a bit of mangling, it may be possible to walk over all the data building up the counts in the array and then finally scrutinize it for 0'es and then work backwards what missing pattern you have.

Another variant: Radix tree or trie structure on bit-patterns since this greatly compresses the in-memory representation. Hash Array Mapped Tries comes to mind here. You can easily walk it in order to determine "holes" in the structure afterwards.

Jesper Louis Andersen
  Erlang Solutions Ltd., Copenhagen






More information about the erlang-questions mailing list