[erlang-questions] Managing a huge binary file

Jayson Barley jayson.barley@REDACTED
Wed May 1 00:25:22 CEST 2013


This may be a really dumb question, but is there a reason you can't keep
the numbers that have been removed in a separate file?


On Tue, Apr 30, 2013 at 4:40 AM, Salvador Tamarit <stamarit@REDACTED>wrote:

> 2013/4/30 Richard A. O'Keefe <ok@REDACTED>:
> >
> > On 30/04/2013, at 1:23 PM, Salvador Tamarit wrote:
> >
> >> Suppose that we have a huge binary file (8GB or so) containing all the
> >> integers from 0 to a given (very big) number. The numbers are unsorted
> >> and some of them (70 numbers for instance) has been removed.
> >
> > This sounds like a well known problem with an answer in
> >
> http://books.google.co.nz/books?id=415loiMd_c0C&lpg=PP1&dq=muthukrishnan+data+stream+algorithms&hl=el&pg=PA1&redir_esc=y
> >
>
> That's great. I was suspicious that it existed, but I didn't know how
> to find it :P Thank you very much Richard.
>
> > with the twist that you are asked to find a specific missing number
> > (in value order) rather than all of them.
> >
> > When you talk about reading the numbers 4 bytes at a time, that tells
> > me the numbers are not >that< "huge".
> >
> > One very simple step would be to set up
> >     number-present : map (0..6535) -> nat := (all -> 0)_
> >     for each number in the input
> >         number-present(high 16 bits of that number) +:= 1
> > Now you can tell which block of numbers must contain the
> > number you are looking for, and you can make a second pass
> > through the data counting numbers in that block, and then
> > it's a simple scan through the block.
>
>
> It would solve the problem of being able to load the data into memory,
> but I still having problems reading the whole file. It takes too much
> time. I know my function is very naive, however I'm not sure how could
> I improve it. Only reading integers from file, without further
> processing, takes a lot of time (minutes) in a fast laptop.
>
>
> read_integers(FDevice) ->
>    case file:read(FDevice, 4) of
>      {ok,<<_:2/little-unit:8,Readed:2/little-unit:8>>} ->
>         read_integers(FDevice);
>      eof ->
>         ok
>    end.
>
> >
> > So if the numbers are 0..N-1, the problem can be solved in
> > O(N**1/2) space in two passes, or in O(N**1/3) space in
> > three passes, &c.
> >
> > This is _not_ mathematically optimal, but it's quite easy to
> > program, and if you really had this problem, I'd rather have
> > the data read twice by a simple program I trusted than once
> > by a clever program that was almost certainly wrong.
> >
> >
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130430/f696c52b/attachment.htm>


More information about the erlang-questions mailing list