[erlang-questions] Using ETS for large amounts of data?

Tue Aug 10 11:50:13 CEST 2010

Absolutely fastest (including startup) should be nif with mmaped file.
But if you want stay in erlang you can use file:read_file/1 if
occupied memory is not problem and than binary:at/2 if you can use
R14. Otherwise I would use

<<_:Index, V1:1, V2:1, V3:1,..., _/binary>> = Bin.

If occupied memory is problem I would use file:open/2 with [read, raw,
binary, read_ahead] and file:pread/3 and may be tailor mine own LRU
record or block cache.

Anyway, if you are curious, tailor your own benchmark and measure. It
is worth of thousands of advices. You should start with simplest
solution and than measure. If not fast enough, then try another
approach and then measure and compare and again and again unless you
are satisfied. You will learn a lot during this process.

On Tue, Aug 10, 2010 at 1:13 AM, Anthony Molinaro
<anthonym@REDACTED> wrote:
>
> On Mon, Aug 09, 2010 at 02:22:15PM -0700, Paul Mineiro wrote:
>> On Mon, 9 Aug 2010, Anthony Molinaro wrote:
>>
>> > Hi,
>> >
>> >   I've got some data in a file of the form
>> >
>> > start_integer|end_integer|data_field1|data_field2|..|data_fieldN
>> >
>> > Where N is 12.  Most fields are actually smallish integers.
>> > The total number of entries in the file is 9964217 lines.
>> > and the total file size is 752265457.
>> >
>> > I want to load these into an erlang process and essentially do lookups
>> > of the form
>> >
>> >   given an integer M
>> >
>> >   if M >= start_integer && M <= end_integer then
>> >     return data
>> ...
>> > So is ets up to this task at all? and if not any suggestions?
>>
>> Here's my initial thoughts:
>>
>> 1) use an interval tree to encode (start, end) -> index
>>    1a) http://en.wikipedia.org/wiki/Interval_tree
>>    1b) if your intervals do not overlap you can use ets ordered set for
>> this part, but use the beginning of the interval for the key, and then use
>> ets:prev/2 to find the entry
>> 2) put the rest of the data fields in a huge flat binary and look them up
>> by reading the index from the interval tree ... or hey just seek around a
>> disk file, and let the OS cache keep the hot spots warm.
>
> Since my intervals are closed and non intersecting, it looks like storing
> just the interval start and an index is a win.   I was able to create an
> ets file with the starts and random indexes, then save it with ets:tab2file,
> then load it quickly and do ets:prev calls.
>
> So now for the data file.  It looks like my structure is actually
>
> 4 uint8 integers
> 5 uint16 integers
> 1 uint32 integer
> 2 floats (4 byte precision should be fine)
>
> So fixed length record size of 26 bytes for a total of 259069642 bytes.
>
> Since I'm generating these files from perl, I'll use pack on the encode
> side, then the bit syntax to decode, but I'm wondering what the most
> efficient way of pulling the individual record out of the large binary
> are (I plan to file:read_file/1 to get the binary)?
>
> It is using bit syntax to pull out 26 bytes?
>
> BigBinary = <<_:Index, Record:26>>
> Record = <<V1:1, V2:1, V3....>>
>
> or is it better to using something like split_binary/2?
>
> {Before, At} -> split_binary (BigBinary, Index),
> At = <<V1:1, V2:1, V3....>>
>
> I was going to try some timing tests, but figured I'd see if anyone has any
> idea (I'm using R13B04 if that matters).
>
> Thanks for the help,
>
> -Anthony
>
> --
> ------------------------------------------------------------------------
> Anthony Molinaro                           <anthonym@REDACTED>
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>
>

-- 
--Hynek (Pichi) Vychodil

Analyze your data in minutes. Share your insights instantly. Thrill
your boss.  Be a data hero!
Try GoodData now for free: www.gooddata.com