[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Sun Mar 25 04:26:54 CEST 2012

On 25 Mar 2012, at 03:19, Tim Watson wrote:
> On 24 Mar 2012, at 22:32, Dmitry Kolesnikov wrote:
> 
>> Hello,
>> 
>> You have raise a freaking interesting issue!
>> 
>> I have decided to invest my time here and share binary parsing techniques used by me. 
>> In the nutshell, I was capable to parse a line in 2.17 micro seconds using native erlang implementation. I have tried to productize code and put it here:
>> 
>> https://github.com/fogfish/csv
>> 
>> The algorithm is very simple
>> 1. load a whole file into memory
>> 2. shard the file into N chunks
>> 3. parse each chunk in parallel, parsing is based on binary scan (no binary:split, etc)
>> 4. merge results.
>> 
> 
> Thanks for sharing this, it sounds great. I was under the impression from reading the documentation that using the binary module to split everything up would be quicker than hand coding. 
> 
>> The test was run on Mac Mini server with R15B 
>> Erlang R15B (erts-5.9) [source] [64-bit] [smp:8:8] [async-threads:0] [hipe] [kernel-poll:false]
>> 
>> Since, you have not shared a details of csv file I used the following setups:
>> * set.txt contains 300K rows, each row has 16 fields of 3 hex digit each
>> * set2.txt contains 300K rows, each row has 48 fields of 3 hex digit each
>> 
> 
> This is a bit different to the input Max is trying to work with isn't it? Can you try with his generated example.csv? I tried doing this and your output seems to be a small list of numbers, so I'm not sure what's going wrong. 
> 
> 13> csv_example:run("example.csv", 80).
> total parse time: 5897.321
> Result = 
> [3711,3750,3750,3751,3751,3750,3750,3749,3750,3751,3751,3751,3750,3752,3751,
> 3750,3750,3751,3751,3751,3750,3750,3751,3751,3750,3751,3751,3750,3753,3751,
> 3749,3750,3751,3751,3750,3751,3750,3750,3751,3751,3750,3751,3751,3750,3750,
> 3750,3751,3751,3751,3751,3751,3751,3750,3751,3751,3751,3749,3751,3749,3751,
> 3751,3750,3751,3751,3750,3752,3750,3751,3750,3751,3750,3751,3749,3750,3750,
> 3751,3750,3749,3750,3750]
> 

On further inspection, I can see that this happens because the fun you pass to the parser (to handle the {line, _} event) is simply incrementing a count. My mistake! 

> 
>> Here is the results:
>> csv_example:run("priv/set.txt", 80).
>> {{lines,300000},
>> {size,18000000},
>> {read_ms,132.269},
>> {parse_ms,652.865},
>> {line_us,2.1762166666666665}}
>> 
>> 2> csv_example:run("priv/set2.txt", 80).
>> {{lines,300000},
>> {size,54000000},
>> {read_ms,204.259},
>> {parse_ms,1860.286},
>> {line_us,6.2009533333333335}}
>> 
> 
> I changed the example code a little bit, to account for the fact that we consider the file:read_file and the parsing all together in all our other examples. I get very different results, although I'm admittedly (probably?) on a less powerful machine:
> 
> 4@REDACTED:csv $ erl -pa src -pa priv
> Erlang R14B01 (erts-5.8.2) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]
> 
> Eshell V5.8.2  (abort with ^G)
> 1> csv_example:run("example.csv", 80).
> {{lines,300001},
> {size,78599457},
> {parse_ms,5727.757},
> {line_us,19.09245969180103}}
> 2> 
> 
> 
>> - Dmitry
>> 
>> 
>> On Mar 25, 2012, at 12:14 AM, Tim Watson wrote:
>> 
>>> On 24 Mar 2012, at 16:45, Max Lapshin wrote:
>>> 
>>>> I'm trying to use file:read from erlang combined with compiled scanf pattern or some regex. There are two ideas:
>>>> 1) combine two binary:split will give double memory walking. I think it is a good idea to parse memory once
>>> 
>>> Actually there's a double problem here. In a CSV file it is possible that a delimiter might be present 'inside' one of the fields, providing the contents of the cell are in quotes. For example:
>>> 
>>>>> 1,2,16,John Snow,"Flat A, 128, Watling Street", etc....
>>> 
>>> So just doing binary:split(Bin, binary:compile_pattern(<<",">>), [global]) isn't enough. I will take a look how binary:split/3 is implemented, but I'm starting to think that you really do need to drop into C to make this fast enough. Maybe for your data set you don't care about this possibility (because you have some control over the inputs) but for a general purpose CSV parser, this has to be handled and requires too much introspection of the data to efficient over 70MiB and more if written in pure Erlang. 
>>> 
>>> With file:read_line and binary:split, I can get the data out (without float/int conversion) sequentially in about 10 seconds. It's not particularly useful to parallelise the binary:split work as for the most part it isn't taking a long time per operation, but performs a vast number of them which slows down. Spawning for each processing line isn't really very smart as even in SMP mode the amount of churn due to scheduling seems likely to be prohibitive. Even if you hit the sweet spot with the right number of worker processes (1 per cpu +/- 1 for example) you've got the additional work of reconstituting the data in the correct order which takes up a lot of time.
>>> 
>>> Now if I manually chunk my way through the file, using an optimal segment size (on my mac, 16 * 1024 seems to be best) with a sizeable read_ahead of 1024 * 1024, then I can get through the whole file and binary:split(Bin, CommaPattern, [global]) on each chunk, in an average of 2300ms. Of course this is too slow, is ignoring newlines at the moment (unlike the other tests) and I've removed the code for dealing with fields that sit across the segment boundaries (which slows things down also): 
>>> 
>>> read(P=#csv_parser{ io_device=Fd, buffer_size=Size }, start, Idx, Rem) ->
>>>  read(P, file:read(Fd, Size), Idx, Rem);
>>> read(P=#csv_parser{ io_device=Fd, buffer_size=Size,
>>>                  delimiter=Delim, collector=Collector,
>>>                  supported_newlines=NL },
>>>                  {ok, Chunks},
>>>                  Idx, Rem) ->
>>>  binary:split(Chunks, Delim, [global]),
>>>  read(P, file:read(Fd, Size), Idx + 1, <<>>); 
>>> read(_P, eof, _, _Rem) ->
>>>  %% in strict mode, fail unless size(Rem) == 0
>>>  ok.
>>> 
>>>>>>> 
>>> t4@REDACTED:csv_reader $ ./csv_reader.erl example.csv 
>>> Starting read....
>>> read_chunk time: 2376
>>> t4@REDACTED:csv_reader $ 
>>> 
>>> Now if I switch the split to use binary:matches instead, and look for either ',' or a newline, we get a big performance improvement:
>>> 
>>> t4@REDACTED:csv_reader $ ./csv_reader.erl example.csv 
>>> Starting read....
>>> read_chunk time: 953
>>> t4@REDACTED:csv_reader $ 
>>> 
>>> That's getting much faster, *but* we're not doing any hard work of parsing yet and we're not dealing with 'cross-segment' parts, which I'll look at next.
>>> 
>>>> 2) intermediate creation of binaries for all those floats (12 millions of them in this example) is also evil. Perhaps line parsing should be combined with float extraction in the same manner as decode_packet does.
>>> 
>>> 
>>> Like I said I think that with the approach I mentioned above (using nicely segmented file:read calls and binary:matches) we've got the i/o subsystem and splitting of fields to go as fast as they're likely to go without dropping into C. Now we get into the more cpu intensive parts - taking the {Location, Size} data from binary:matches and combined with your segment Idx working out the correct offset to use in calling binary:part, plus reconstituting stuff that sits across two segments and then finally converting binary to other data types for the final results. If I could make these parts happen in 2 seconds to get combined throughput of 3 seconds I'd be pleased, but I'm not sure it'll be possible. And even then, your desired time of 1 second isn't looking at all likely for even this 300k record file, let alone a larger 1million line file.  
>>> 
>>> It is possible that applying some parallelism to this will help, but the timings we're after are so small that I'm not sure. Spawning per split is of course a waste of time, but it is possible that having a worker process per core that does binary:matches and parses might help. The difficulty here is that you have to deal with the data sitting across two segments before you can pass it off to a worker, because you need the previous 'remainder' if any - I suppose you could 'pass this in' along with the data. If this could be made to work efficiently, then you might also benefit from a parallel collector process taking the completed/parsed records from the workers and storing them so they're ready for retrieval. Again, I'm not convinced all of this could be done in pure Erlang in less than 3 - 4 seconds overall. Perhaps putting ets into the mix might provide some speedup, as I would guess that a shared table can be atomically accessed faster than the processes can exchange data
>>> via a mailbox.   
>>> 
>>> I'm going to see how fast this can be made to go in pure Erlang (allowing for ETS if it helps) just as a fun exercise. I don't know how much spare time I'll have to do it though and I'm convinced a solution written in C is the right answer for 1million records per second, and to do that a segment at a time approach that does almost all the work in C including all the type conversion is probably the way to go. In that case I would probably do the whole i/o part in C too, using a similar series of buffered read operations rather than line oriented fgets calls or things like that. 
>>> 
>>> I do not think that mmap or even reading all the file into memory will be of any use, as you mentioned earlier, the problem is really cpu bound. 
>>> 
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>> 
>