[erlang-questions] Erlang data structure

Wed Jan 30 17:29:11 CET 2013

On Jan 29, 2013, at 1:49 PM, a h <ah.kode@REDACTED> wrote:

> Hi Everyone,
> 
> Please suggest me the best way to implement the below requirement:
> I have one file-A which contains below format of data
> A10,A11, A12, A13
> A20, A21, A22, A23,
> 
> and B file as:
> B10, B11, B12, B13
> B20, B21, B22, B23
> 
> I need to process these two file but my system(framework in Erlang) can able to send data chunks of A to my application and the path for B file. 
> My application need to perform that if A10 = B10 then do something.
> What  I am doing is storing the B file in ETS and then when i receive 1st line from framework, I am searching the same in ETS.
> 
> But I am facing an Issue that storing the B file(which is too huge in GB) in ETS result in Heap Crash.

If you cannot process your data on-line, the usual trick is to process them off-line.

* Call the unix sort(1) on the file. Note that GNU sort is parallel. Now data can be processed with an ordered merge and this is much faster and memory usage is constant.
* investigate probable use of the file_sorter module and use disk_logs
* Data is in CSV format, so you can store data in postgres (Look up the COPY command) and index data, then do a join on the tables. This makes sure you get data in the correct order. You can use the 'epgsql' application from Erlang to read out data in chunks over a cursor.

Alternative:

* Avoid storign A10, B10 and so on as lists, but store them as binaries. This cuts memory use by 20 on 64bit architectures.
* If you can accept an approximate correctness, hash your A10 and B10 values and gain a heuristically correct value.

Jesper Louis Andersen
  Erlang Solutions Ltd., Copenhagen