[erlang-questions] Managing a huge binary file
Tue Apr 30 13:40:06 CEST 2013
2013/4/30 Richard A. O'Keefe <>:
> On 30/04/2013, at 1:23 PM, Salvador Tamarit wrote:
>> Suppose that we have a huge binary file (8GB or so) containing all the
>> integers from 0 to a given (very big) number. The numbers are unsorted
>> and some of them (70 numbers for instance) has been removed.
> This sounds like a well known problem with an answer in
That's great. I was suspicious that it existed, but I didn't know how
to find it :P Thank you very much Richard.
> with the twist that you are asked to find a specific missing number
> (in value order) rather than all of them.
> When you talk about reading the numbers 4 bytes at a time, that tells
> me the numbers are not >that< "huge".
> One very simple step would be to set up
> number-present : map (0..6535) -> nat := (all -> 0)_
> for each number in the input
> number-present(high 16 bits of that number) +:= 1
> Now you can tell which block of numbers must contain the
> number you are looking for, and you can make a second pass
> through the data counting numbers in that block, and then
> it's a simple scan through the block.
It would solve the problem of being able to load the data into memory,
but I still having problems reading the whole file. It takes too much
time. I know my function is very naive, however I'm not sure how could
I improve it. Only reading integers from file, without further
processing, takes a lot of time (minutes) in a fast laptop.
case file:read(FDevice, 4) of
> So if the numbers are 0..N-1, the problem can be solved in
> O(N**1/2) space in two passes, or in O(N**1/3) space in
> three passes, &c.
> This is _not_ mathematically optimal, but it's quite easy to
> program, and if you really had this problem, I'd rather have
> the data read twice by a simple program I trusted than once
> by a clever program that was almost certainly wrong.
> erlang-questions mailing list
More information about the erlang-questions