[erlang-questions] pre-load large data files when the application start

Fri Mar 25 19:25:55 CET 2016

On Fri, Mar 25, 2016 at 1:11 PM Benoit Chesneau <bchesneau@REDACTED> wrote:

> On Fri, Mar 25, 2016 at 7:06 PM Garrett Smith <g@REDACTED> wrote:
>
>> On Fri, Mar 25, 2016 at 12:09 PM Benoit Chesneau <bchesneau@REDACTED>
>> wrote:
>>
>>> Hi all,
>>>
>>> I have a large data file provided as comma separated values (unicode
>>> data) I need to load and parse it ASAP since it will be used by all the
>>> functions.
>>>
>>
>> What's the interface?
>>
>>
>>> The current implementation consists in parsing the file and generate
>>> either a source file or an include file that will be then compiled. My
>>> issue with it for now is that the compilation will use more than 1GB and
>>> then crash on small machines or containers.
>>>
>>> Other solutions I tried:
>>>
>>> - use merl + `-onload` to build a module on first call of the module
>>> (too long the first time)
>>> - store an ets file and load it later, which can be an issue if you need
>>> to create an escript will all modules later
>>> - load an parse in a gen_server (same result as using merl)
>>>
>>> Thinks I have in mind:
>>>
>>> - generate a DETS file or small binary tree on disk and cache the
>>> content on demand
>>> - generate a beam and ship it
>>>
>>> Is there anything else I can do?  I am curious how others are doing in
>>> that case.
>>>
>>
>> I think this depends entirely on your interface :)
>>
>> Do you have to scan the entire table? If so why? If not, why not treat
>> this as a indexing problem and start from disk, assuming you can defer
>> loading of any data until it's read?
>>
>
>
> Sorry I should have just posted the code I was working on (the advantage
> of working on opensource stuffs).
>
> The code I'm referring is here : https://github.com/benoitc/erlang-idna
> and the recent change I describe:
> https://github.com/benoitc/erlang-idna/tree/precompile
>
> The table really need to be in memory somehow or need to be accessed very
> fast while reading it, since it will be used to encode any domain names
> used in a requests (can be xmpp, http..) .
>
> It basically check the code for each chars in a string and try to
> compose/decompose  it.
>

Messing around with includes, beam generation, etc. seems like a possibly
plan B or C. I'd start with dets or leveldb and measure performance. But
seems like you're well down all roads anyway.

So then it's a matter of publishing your results :)

Btw, this reminds me a larger scale version of this:

https://github.com/gar1t/erlang-bench/blob/master/name-lookup.escript

I routinely create an erlang-bench script to satisfy my curiosity and
explore different methods. It might not be a good fit as a harness for your
test, given the scale - but might useful to sniff test smaller variants.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20160325/bd2e9e6e/attachment.htm>