[erlang-questions] Managing a huge binary file

Richard A. O'Keefe ok@REDACTED
Tue Apr 30 04:00:05 CEST 2013


On 30/04/2013, at 1:23 PM, Salvador Tamarit wrote:

> Suppose that we have a huge binary file (8GB or so) containing all the
> integers from 0 to a given (very big) number. The numbers are unsorted
> and some of them (70 numbers for instance) has been removed.

This sounds like a well known problem with an answer in
http://books.google.co.nz/books?id=415loiMd_c0C&lpg=PP1&dq=muthukrishnan+data+stream+algorithms&hl=el&pg=PA1&redir_esc=y

with the twist that you are asked to find a specific missing number
(in value order) rather than all of them.

When you talk about reading the numbers 4 bytes at a time, that tells
me the numbers are not >that< "huge".

One very simple step would be to set up
    number-present : map (0..6535) -> nat := (all -> 0)_
    for each number in the input
        number-present(high 16 bits of that number) +:= 1
Now you can tell which block of numbers must contain the
number you are looking for, and you can make a second pass
through the data counting numbers in that block, and then
it's a simple scan through the block.

So if the numbers are 0..N-1, the problem can be solved in
O(N**1/2) space in two passes, or in O(N**1/3) space in
three passes, &c.

This is _not_ mathematically optimal, but it's quite easy to
program, and if you really had this problem, I'd rather have
the data read twice by a simple program I trusted than once
by a clever program that was almost certainly wrong.





More information about the erlang-questions mailing list