<div dir="ltr">This may be a really dumb question, but is there a reason you can't keep the numbers that have been removed in a separate file?<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Apr 30, 2013 at 4:40 AM, Salvador Tamarit <span dir="ltr"><<a href="mailto:stamarit@dsic.upv.es" target="_blank">stamarit@dsic.upv.es</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">2013/4/30 Richard A. O'Keefe <<a href="mailto:ok@cs.otago.ac.nz">ok@cs.otago.ac.nz</a>>:<br>
><br>
> On 30/04/2013, at 1:23 PM, Salvador Tamarit wrote:<br>
><br>
>> Suppose that we have a huge binary file (8GB or so) containing all the<br>
>> integers from 0 to a given (very big) number. The numbers are unsorted<br>
>> and some of them (70 numbers for instance) has been removed.<br>
><br>
> This sounds like a well known problem with an answer in<br>
> <a href="http://books.google.co.nz/books?id=415loiMd_c0C&lpg=PP1&dq=muthukrishnan+data+stream+algorithms&hl=el&pg=PA1&redir_esc=y" target="_blank">http://books.google.co.nz/books?id=415loiMd_c0C&lpg=PP1&dq=muthukrishnan+data+stream+algorithms&hl=el&pg=PA1&redir_esc=y</a><br>
><br>
<br>
That's great. I was suspicious that it existed, but I didn't know how<br>
to find it :P Thank you very much Richard.<br>
<br>
> with the twist that you are asked to find a specific missing number<br>
> (in value order) rather than all of them.<br>
><br>
> When you talk about reading the numbers 4 bytes at a time, that tells<br>
> me the numbers are not >that< "huge".<br>
><br>
> One very simple step would be to set up<br>
> number-present : map (0..6535) -> nat := (all -> 0)_<br>
> for each number in the input<br>
> number-present(high 16 bits of that number) +:= 1<br>
> Now you can tell which block of numbers must contain the<br>
> number you are looking for, and you can make a second pass<br>
> through the data counting numbers in that block, and then<br>
> it's a simple scan through the block.<br>
<br>
<br>
It would solve the problem of being able to load the data into memory,<br>
but I still having problems reading the whole file. It takes too much<br>
time. I know my function is very naive, however I'm not sure how could<br>
I improve it. Only reading integers from file, without further<br>
processing, takes a lot of time (minutes) in a fast laptop.<br>
<br>
<br>
read_integers(FDevice) -><br>
case file:read(FDevice, 4) of<br>
{ok,<<_:2/little-unit:8,Readed:2/little-unit:8>>} -><br>
read_integers(FDevice);<br>
eof -><br>
ok<br>
end.<br>
<br>
><br>
> So if the numbers are 0..N-1, the problem can be solved in<br>
> O(N**1/2) space in two passes, or in O(N**1/3) space in<br>
> three passes, &c.<br>
><br>
> This is _not_ mathematically optimal, but it's quite easy to<br>
> program, and if you really had this problem, I'd rather have<br>
> the data read twice by a simple program I trusted than once<br>
> by a clever program that was almost certainly wrong.<br>
><br>
><br>
> _______________________________________________<br>
> erlang-questions mailing list<br>
> <a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>
> <a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>
_______________________________________________<br>
erlang-questions mailing list<br>
<a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>
<a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>
</blockquote></div><br></div>