<div dir="ltr">This may be a really dumb question, but is there a reason you can't keep the numbers that have been removed in a separate file?<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Apr 30, 2013 at 4:40 AM, Salvador Tamarit <span dir="ltr"><<a href="mailto:stamarit@dsic.upv.es" target="_blank">stamarit@dsic.upv.es</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">2013/4/30 Richard A. O'Keefe <<a href="mailto:ok@cs.otago.ac.nz">ok@cs.otago.ac.nz</a>>:<br>

><br>

> On 30/04/2013, at 1:23 PM, Salvador Tamarit wrote:<br>

><br>

>> Suppose that we have a huge binary file (8GB or so) containing all the<br>

>> integers from 0 to a given (very big) number. The numbers are unsorted<br>

>> and some of them (70 numbers for instance) has been removed.<br>

><br>

> This sounds like a well known problem with an answer in<br>

> <a href="http://books.google.co.nz/books?id=415loiMd_c0C&lpg=PP1&dq=muthukrishnan+data+stream+algorithms&hl=el&pg=PA1&redir_esc=y" target="_blank">http://books.google.co.nz/books?id=415loiMd_c0C&lpg=PP1&dq=muthukrishnan+data+stream+algorithms&hl=el&pg=PA1&redir_esc=y</a><br>


><br>

<br>

That's great. I was suspicious that it existed, but I didn't know how<br>

to find it :P Thank you very much Richard.<br>

<br>

> with the twist that you are asked to find a specific missing number<br>

> (in value order) rather than all of them.<br>

><br>

> When you talk about reading the numbers 4 bytes at a time, that tells<br>

> me the numbers are not >that< "huge".<br>

><br>

> One very simple step would be to set up<br>

>     number-present : map (0..6535) -> nat := (all -> 0)_<br>

>     for each number in the input<br>

>         number-present(high 16 bits of that number) +:= 1<br>

> Now you can tell which block of numbers must contain the<br>

> number you are looking for, and you can make a second pass<br>

> through the data counting numbers in that block, and then<br>

> it's a simple scan through the block.<br>

<br>

<br>

It would solve the problem of being able to load the data into memory,<br>

but I still having problems reading the whole file. It takes too much<br>

time. I know my function is very naive, however I'm not sure how could<br>

I improve it. Only reading integers from file, without further<br>

processing, takes a lot of time (minutes) in a fast laptop.<br>

<br>

<br>

read_integers(FDevice) -><br>

   case file:read(FDevice, 4) of<br>

     {ok,<<_:2/little-unit:8,Readed:2/little-unit:8>>} -><br>

        read_integers(FDevice);<br>

     eof -><br>

        ok<br>

   end.<br>

<br>

><br>

> So if the numbers are 0..N-1, the problem can be solved in<br>

> O(N**1/2) space in two passes, or in O(N**1/3) space in<br>

> three passes, &c.<br>

><br>

> This is _not_ mathematically optimal, but it's quite easy to<br>

> program, and if you really had this problem, I'd rather have<br>

> the data read twice by a simple program I trusted than once<br>

> by a clever program that was almost certainly wrong.<br>

><br>

><br>

> _______________________________________________<br>

> erlang-questions mailing list<br>

> <a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>

> <a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>

_______________________________________________<br>

erlang-questions mailing list<br>

<a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>

<a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>

</blockquote></div><br></div>