I've been writing an Erlang app (my first one) for the last 2 and 1/2 months and one of it's duties is to parse/transform _large_ CSV files.  Some of them 11+ million lines and 600+ MB in size and they will only get larger as time goes on.  Luckily I've got 128GB of RAM to work with, so I decided to do everything in memory via binaries.  My initial, naive attempt was to read the file into one large binary and then split that into it's individual columns.  The end result would be a list of lists, each sublist being a list of binaries representing the columns.<div>
<br></div><div>e.g.</div><div><br></div><div>[[<<"col1">>,<<"col2">>],</div><div> [<<"val1">>,<<"val2">>]]</div><div><br></div><div>This chewed up a _lot_ of memory, and my runtimes were nothing to write home about.  A few days ago I sat down and tried to read up more on binaries, their idiomatic usage, pitfalls, etc.  I ended up re-implementing my CSV functions to build a new binary while parsing the source binary.  My memory usage went down drastically (although it spikes b/c of ephemeral garbage, lots of temp binaries) and my runtimes improved greatly!</div>
<div><br></div><div>Anyways, while reading the docs on binary handling in the efficiency guide I got the notion that if you match/split a large binary the resulting sub-binary will reference the larger, off-heap one.  That is, if you pull a chunk of a large binary it's storage is essentially free.  However, after running some tests I feel like a lot of heap memory is allocated.  I wrote a more formal test tonight, and honestly, I'm not sure I understand things any better.</div>
<div><br></div><div>I created a bunch of test runs on a 27MB CSV file with 597971 lines each with 8 columns.</div><div><br></div><div>list_of_cols: break each column into it's own binary</div><div>list_of_lines: break each line into it's own binary</div>
<div>list_of_32byte: break into 32 byte chunks</div><div>list_of_64byte: " "</div><div>list_of_128byte:</div><div>...</div><div><br></div><div>Each test shows certain memory stats for each stage of the test: start, after send, after run, after GC, and after shutdown.  The main thing to focus on is the difference in process allocated memory between "after send data" and "after run" stages.  Here are my results and attached is the code.</div>
<div><br></div><div><div>2> binary_test:run_tests().</div><div>*** After read</div><div>                 total  |        processes used  |                binary  </div><div>               35.57MB  |                0.81MB  |               26.78MB</div>
<div><br></div><div>==========RUNNING list_of_cols============================</div><div>*** Start</div><div>                 total  |        processes used  |                binary  </div><div>               35.53MB  |                0.77MB  |               26.78MB</div>
<div><br></div><div>*** After send data</div><div>                 total  |        processes used  |                binary  </div><div>               35.53MB  |                0.77MB  |               26.78MB</div><div><br>
</div><div>*** After run</div><div>                 total  |        processes used  |                binary  </div><div>              400.45MB  |              365.68MB  |               26.79MB</div><div><br></div><div>*** After GC</div>
<div>                 total  |        processes used  |                binary  </div><div>              400.45MB  |              365.68MB  |               26.78MB</div><div><br></div><div>*** After shutdown</div><div>                 total  |        processes used  |                binary  </div>
<div>               35.55MB  |                0.78MB  |               26.78MB</div><div><br></div><div>==========DONE list_of_cols===============================</div><div><br></div><div>==========RUNNING list_of_lines===========================</div>
<div>*** Start</div><div>                 total  |        processes used  |                binary  </div><div>               35.55MB  |                0.78MB  |               26.78MB</div><div><br></div><div>*** After send data</div>
<div>                 total  |        processes used  |                binary  </div><div>               35.55MB  |                0.78MB  |               26.78MB</div><div><br></div><div>*** After run</div><div>                 total  |        processes used  |                binary  </div>
<div>               84.53MB  |               49.75MB  |               26.79MB</div><div><br></div><div>*** After GC</div><div>                 total  |        processes used  |                binary  </div><div>               84.53MB  |               49.75MB  |               26.78MB</div>
<div><br></div><div>*** After shutdown</div><div>                 total  |        processes used  |                binary  </div><div>               35.55MB  |                0.78MB  |               26.78MB</div><div><br>
</div><div>==========DONE list_of_lines==============================</div><div><br></div><div>==========RUNNING list_of_32byte==========================</div><div>*** Start</div><div>                 total  |        processes used  |                binary  </div>
<div>               35.55MB  |                0.78MB  |               26.78MB</div><div><br></div><div>*** After send data</div><div>                 total  |        processes used  |                binary  </div><div>               35.56MB  |                0.79MB  |               26.79MB</div>
<div><br></div><div>*** After run</div><div>                 total  |        processes used  |                binary  </div><div>               96.76MB  |               61.99MB  |               26.78MB</div><div><br></div>
<div>*** After GC</div><div>                 total  |        processes used  |                binary  </div><div>              112.07MB  |               77.30MB  |               26.78MB</div><div><br></div><div>*** After shutdown</div>
<div>                 total  |        processes used  |                binary  </div><div>               35.55MB  |                0.78MB  |               26.78MB</div><div><br></div><div>==========DONE list_of_32byte=============================</div>
<div><br></div><div>==========RUNNING list_of_64byte==========================</div><div>*** Start</div><div>                 total  |        processes used  |                binary  </div><div>               35.55MB  |                0.78MB  |               26.78MB</div>
<div><br></div><div>*** After send data</div><div>                 total  |        processes used  |                binary  </div><div>               35.56MB  |                0.79MB  |               26.78MB</div><div><br>
</div><div>*** After run</div><div>                 total  |        processes used  |                binary  </div><div>               66.90MB  |               32.13MB  |               26.78MB</div><div><br></div><div>*** After GC</div>
<div>                 total  |        processes used  |                binary  </div><div>               66.90MB  |               32.13MB  |               26.78MB</div><div><br></div><div>*** After shutdown</div><div>                 total  |        processes used  |                binary  </div>
<div>               35.56MB  |                0.78MB  |               26.78MB</div><div><br></div><div>==========DONE list_of_64byte=============================</div><div><br></div><div>==========RUNNING list_of_128byte=========================</div>
<div>*** Start</div><div>                 total  |        processes used  |                binary  </div><div>               35.56MB  |                0.78MB  |               26.78MB</div><div><br></div><div>*** After send data</div>
<div>                 total  |        processes used  |                binary  </div><div>               35.56MB  |                0.79MB  |               26.78MB</div><div><br></div><div>*** After run</div><div>                 total  |        processes used  |                binary  </div>
<div>               51.61MB  |               16.83MB  |               26.78MB</div><div><br></div><div>*** After GC</div><div>                 total  |        processes used  |                binary  </div><div>               51.61MB  |               16.83MB  |               26.79MB</div>
<div><br></div><div>*** After shutdown</div><div>                 total  |        processes used  |                binary  </div><div>               35.55MB  |                0.77MB  |               26.79MB</div><div><br>
</div><div>==========DONE list_of_128byte============================</div><div><br></div><div>==========RUNNING list_of_256byte=========================</div><div>*** Start</div><div>                 total  |        processes used  |                binary  </div>
<div>               35.55MB  |                0.78MB  |               26.79MB</div><div><br></div><div>*** After send data</div><div>                 total  |        processes used  |                binary  </div><div>               35.55MB  |                0.78MB  |               26.79MB</div>
<div><br></div><div>*** After run</div><div>                 total  |        processes used  |                binary  </div><div>               45.82MB  |               11.05MB  |               26.79MB</div><div><br></div>
<div>*** After GC</div><div>                 total  |        processes used  |                binary  </div><div>               45.83MB  |               11.06MB  |               26.79MB</div><div><br></div><div>*** After shutdown</div>
<div>                 total  |        processes used  |                binary  </div><div>               35.56MB  |                0.78MB  |               26.79MB</div><div><br></div><div>==========DONE list_of_256byte============================</div>
<div><br></div><div>==========RUNNING list_of_512byte=========================</div><div>*** Start</div><div>                 total  |        processes used  |                binary  </div><div>               35.54MB  |                0.77MB  |               26.79MB</div>
<div><br></div><div>*** After send data</div><div>                 total  |        processes used  |                binary  </div><div>               35.55MB  |                0.77MB  |               26.79MB</div><div><br>
</div><div>*** After run</div><div>                 total  |        processes used  |                binary  </div><div>               41.89MB  |                7.12MB  |               26.78MB</div><div><br></div><div>*** After GC</div>
<div>                 total  |        processes used  |                binary  </div><div>               41.89MB  |                7.13MB  |               26.78MB</div><div><br></div><div>*** After shutdown</div><div>                 total  |        processes used  |                binary  </div>
<div>               35.55MB  |                0.78MB  |               26.78MB</div><div><br></div><div>==========DONE list_of_512byte============================</div></div><div><br></div><div><br></div><div>I went ahead and did some _rough_ calculations on how much each sub-binary costs on the process heap by taking the difference in processes_used memory and dividing by the number of sub-binaries created.  I got the following numbers:</div>
<div><br></div><div>list_of_cols: 80B</div><div>list_of_lines: 86B</div><div>list_of_32byte: 73B</div><div>list_of_64byte: 75B</div><div>list_of_128byte: 76B</div><div>list_of_256byte: 99B</div><div>list_of_512byte: 122B</div>
<div><br></div><div>So does this mean a Refc/sub-binary costs roughly 80B of memory?  Am I thinking  about this too much?  Probably.</div><div><br></div><div>Just to be clear, I'm now happy with the performance of my CSV routines.  I'm writing this because I want to understand the underlying binary implementation better and because I spent too much time setting up my tests not to post this :).  If you read this far, thank you.</div>
<div><br></div><div>-Ryan</div>