[erlang-questions] Managing a huge binary file

Salvador Tamarit stamarit@REDACTED
Tue Apr 30 13:40:06 CEST 2013


2013/4/30 Richard A. O'Keefe <ok@REDACTED>:
>
> On 30/04/2013, at 1:23 PM, Salvador Tamarit wrote:
>
>> Suppose that we have a huge binary file (8GB or so) containing all the
>> integers from 0 to a given (very big) number. The numbers are unsorted
>> and some of them (70 numbers for instance) has been removed.
>
> This sounds like a well known problem with an answer in
> http://books.google.co.nz/books?id=415loiMd_c0C&lpg=PP1&dq=muthukrishnan+data+stream+algorithms&hl=el&pg=PA1&redir_esc=y
>

That's great. I was suspicious that it existed, but I didn't know how
to find it :P Thank you very much Richard.

> with the twist that you are asked to find a specific missing number
> (in value order) rather than all of them.
>
> When you talk about reading the numbers 4 bytes at a time, that tells
> me the numbers are not >that< "huge".
>
> One very simple step would be to set up
>     number-present : map (0..6535) -> nat := (all -> 0)_
>     for each number in the input
>         number-present(high 16 bits of that number) +:= 1
> Now you can tell which block of numbers must contain the
> number you are looking for, and you can make a second pass
> through the data counting numbers in that block, and then
> it's a simple scan through the block.


It would solve the problem of being able to load the data into memory,
but I still having problems reading the whole file. It takes too much
time. I know my function is very naive, however I'm not sure how could
I improve it. Only reading integers from file, without further
processing, takes a lot of time (minutes) in a fast laptop.


read_integers(FDevice) ->
   case file:read(FDevice, 4) of
     {ok,<<_:2/little-unit:8,Readed:2/little-unit:8>>} ->
        read_integers(FDevice);
     eof ->
        ok
   end.

>
> So if the numbers are 0..N-1, the problem can be solved in
> O(N**1/2) space in two passes, or in O(N**1/3) space in
> three passes, &c.
>
> This is _not_ mathematically optimal, but it's quite easy to
> program, and if you really had this problem, I'd rather have
> the data read twice by a simple program I trusted than once
> by a clever program that was almost certainly wrong.
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions



More information about the erlang-questions mailing list