[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Mon Mar 26 05:00:14 CEST 2012

> The kind of lines in the generated example.csv BTW look like this:
>
> KEY1,20120201,12:15:44.543,34.28,54.74,16.74,88.51,32.48,15.7,54.19,71.52,69.5,55.35,3.9,20.08,33.83,63.43,12.4,9.66,0.29,59.61,52.94,82.49,78.96,70.52,55.73,79.37,61.25,54.19,49.31,14.14,40.18,21.39,82.26,40.79,36.57,86.14,39.58,28.3,20.1,24.07,51.35,8.38,zz

Which made me wonder: is Max actually trying to parse a CSV file of
*floating point* numbers, or just numbers that *happen* to be
expressed in floating point notation but that can be far more narrowly
characterized? If you have 1 million lines of positive 4-digit fixed
point numbers with only two significant digits past the decimal point
(as the above would suggest), the numbers will repeat (on average)
about 100 times in the file. So you could at least save yourself a
factor of 100 in the text-to-float conversion part of the problem.

This *general* problem of parsing CSV is important too, of course. But
the *general* solution could also (FTW) admit of such approaches in
its API.

-michael turner

On Sun, Mar 25, 2012 at 11:54 AM, Tim Watson <watson.timothy@REDACTED> wrote:
> On 25 Mar 2012, at 03:26, Tim Watson wrote:
>>>> The test was run on Mac Mini server with R15B
>>>> Erlang R15B (erts-5.9) [source] [64-bit] [smp:8:8] [async-threads:0] [hipe] [kernel-poll:false]
>>>>
>>>> Since, you have not shared a details of csv file I used the following setups:
>>>> * set.txt contains 300K rows, each row has 16 fields of 3 hex digit each
>>>> * set2.txt contains 300K rows, each row has 48 fields of 3 hex digit each
>>>>
>>>
>>> This is a bit different to the input Max is trying to work with isn't it? Can you try with his generated example.csv? I tried doing this and your output seems to be a small list of numbers, so I'm not sure what's going wrong.
>>>
>
> The kind of lines in the generated example.csv BTW look like this:
>
> KEY1,20120201,12:15:44.543,34.28,54.74,16.74,88.51,32.48,15.7,54.19,71.52,69.5,55.35,3.9,20.08,33.83,63.43,12.4,9.66,0.29,59.61,52.94,82.49,78.96,70.52,55.73,79.37,61.25,54.19,49.31,14.14,40.18,21.39,82.26,40.79,36.57,86.14,39.58,28.3,20.1,24.07,51.35,8.38,zz
>
>>> 13> csv_example:run("example.csv", 80).
>>> total parse time: 5897.321
>>> Result =
>>> [3711,3750,3750,3751,3751,3750,3750,3749,3750,3751,3751,3751,3750,3752,3751,
>>> 3750,3750,3751,3751,3751,3750,3750,3751,3751,3750,3751,3751,3750,3753,3751,
>>> 3749,3750,3751,3751,3750,3751,3750,3750,3751,3751,3750,3751,3751,3750,3750,
>>> 3750,3751,3751,3751,3751,3751,3751,3750,3751,3751,3751,3749,3751,3749,3751,
>>> 3751,3750,3751,3751,3750,3752,3750,3751,3750,3751,3750,3751,3749,3750,3750,
>>> 3751,3750,3749,3750,3750]
>>>
>>
>> On further inspection, I can see that this happens because the fun you pass to the parser (to handle the {line, _} event) is simply incrementing a count. My mistake!
>>
>>>
>>>> Here is the results:
>>>> csv_example:run("priv/set.txt", 80).
>>>> {{lines,300000},
>>>> {size,18000000},
>>>> {read_ms,132.269},
>>>> {parse_ms,652.865},
>>>> {line_us,2.1762166666666665}}
>>>>
>>>> 2> csv_example:run("priv/set2.txt", 80).
>>>> {{lines,300000},
>>>> {size,54000000},
>>>> {read_ms,204.259},
>>>> {parse_ms,1860.286},
>>>> {line_us,6.2009533333333335}}
>>>>
>>>
>>> I changed the example code a little bit, to account for the fact that we consider the file:read_file and the parsing all together in all our other examples. I get very different results, although I'm admittedly (probably?) on a less powerful machine:
>>>
>>> 4@REDACTED:csv $ erl -pa src -pa priv
>>> Erlang R14B01 (erts-5.8.2) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]
>>>
>>> Eshell V5.8.2  (abort with ^G)
>>> 1> csv_example:run("example.csv", 80).
>>> {{lines,300001},
>>> {size,78599457},
>>> {parse_ms,5727.757},
>>> {line_us,19.09245969180103}}
>>> 2>
>
> Accumulating the results takes even longer than this though. I have the line event handler simply aggregate the lines and do the wall clock time measures as Max did (just for consistency), and try it with Wrk set to 80 and then larger (to 200) which starts to degrade somewhere between the two:
>
> test(Filename, Wrk) ->
>    T1 = erlang:now(),
>    {ok, Bin} = file:read_file(Filename),
>    R = csv:pparse(Bin, Wrk,
>                  fun({line, Line}, Acc) -> [Line|Acc] end, []),
>    T2 = erlang:now(),
>    io:format("total parse time: ~p~n", [timer:now_diff(T2, T1) div 1000]),
>    io:format("size(Results) = ~p~n", [length(R)]).
>
> And here's what I get:
>
> Eshell V5.8.2  (abort with ^G)
> 3> csv_example:test("example.csv", 80).
> total parse time: 15232
> size(Results) = 80
> ok
> 4>
> 5> csv_example:test("example.csv", 200).
> total parse time: 17086
> size(Results) = 200
> ok
> 6>
>
> So these appear to be taking between 50 and 20 seconds, and of course, you've got to then work your way through the results in the data for each shard. This part has switched me on to the idea of having a stream oriented API - I suspect that will work well, although it's worth baring in mind that the ordering isn't going to be maintained if you start 'sharding' the data set so as to process it sequentially.
>
> So do we think the additional time is spent because of the comparative width of the line(s) in Max's inputs? I can't imagine that binary:split is doing anything vastly different, except that it uses binary:matches which is implemented as a BIF - I am guessing that a BIF is going to beat hand coded binary pattern matching every time, is it not? Please also do run this code on some other (faster) machines to see how much of the slowness is specific to my machine.
>
>>>
>>>> - Dmitry
>>>>
>>>>
>>>> On Mar 25, 2012, at 12:14 AM, Tim Watson wrote:
>>>>
>>>>> On 24 Mar 2012, at 16:45, Max Lapshin wrote:
>>>>>
>>>>>> I'm trying to use file:read from erlang combined with compiled scanf pattern or some regex. There are two ideas:
>>>>>> 1) combine two binary:split will give double memory walking. I think it is a good idea to parse memory once
>>>>>
>>>>> Actually there's a double problem here. In a CSV file it is possible that a delimiter might be present 'inside' one of the fields, providing the contents of the cell are in quotes. For example:
>>>>>
>>>>>>> 1,2,16,John Snow,"Flat A, 128, Watling Street", etc....
>>>>>
>>>>> So just doing binary:split(Bin, binary:compile_pattern(<<",">>), [global]) isn't enough. I will take a look how binary:split/3 is implemented, but I'm starting to think that you really do need to drop into C to make this fast enough. Maybe for your data set you don't care about this possibility (because you have some control over the inputs) but for a general purpose CSV parser, this has to be handled and requires too much introspection of the data to efficient over 70MiB and more if written in pure Erlang.
>>>>>
>>>>> With file:read_line and binary:split, I can get the data out (without float/int conversion) sequentially in about 10 seconds. It's not particularly useful to parallelise the binary:split work as for the most part it isn't taking a long time per operation, but performs a vast number of them which slows down. Spawning for each processing line isn't really very smart as even in SMP mode the amount of churn due to scheduling seems likely to be prohibitive. Even if you hit the sweet spot with the right number of worker processes (1 per cpu +/- 1 for example) you've got the additional work of reconstituting the data in the correct order which takes up a lot of time.
>>>>>
>>>>> Now if I manually chunk my way through the file, using an optimal segment size (on my mac, 16 * 1024 seems to be best) with a sizeable read_ahead of 1024 * 1024, then I can get through the whole file and binary:split(Bin, CommaPattern, [global]) on each chunk, in an average of 2300ms. Of course this is too slow, is ignoring newlines at the moment (unlike the other tests) and I've removed the code for dealing with fields that sit across the segment boundaries (which slows things down also):
>>>>>
>>>>> read(P=#csv_parser{ io_device=Fd, buffer_size=Size }, start, Idx, Rem) ->
>>>>> read(P, file:read(Fd, Size), Idx, Rem);
>>>>> read(P=#csv_parser{ io_device=Fd, buffer_size=Size,
>>>>>                 delimiter=Delim, collector=Collector,
>>>>>                 supported_newlines=NL },
>>>>>                 {ok, Chunks},
>>>>>                 Idx, Rem) ->
>>>>> binary:split(Chunks, Delim, [global]),
>>>>> read(P, file:read(Fd, Size), Idx + 1, <<>>);
>>>>> read(_P, eof, _, _Rem) ->
>>>>> %% in strict mode, fail unless size(Rem) == 0
>>>>> ok.
>>>>>
>>>>>>>>>
>>>>> t4@REDACTED:csv_reader $ ./csv_reader.erl example.csv
>>>>> Starting read....
>>>>> read_chunk time: 2376
>>>>> t4@REDACTED:csv_reader $
>>>>>
>>>>> Now if I switch the split to use binary:matches instead, and look for either ',' or a newline, we get a big performance improvement:
>>>>>
>>>>> t4@REDACTED:csv_reader $ ./csv_reader.erl example.csv
>>>>> Starting read....
>>>>> read_chunk time: 953
>>>>> t4@REDACTED:csv_reader $
>>>>>
>>>>> That's getting much faster, *but* we're not doing any hard work of parsing yet and we're not dealing with 'cross-segment' parts, which I'll look at next.
>>>>>
>>>>>> 2) intermediate creation of binaries for all those floats (12 millions of them in this example) is also evil. Perhaps line parsing should be combined with float extraction in the same manner as decode_packet does.
>>>>>
>>>>>
>>>>> Like I said I think that with the approach I mentioned above (using nicely segmented file:read calls and binary:matches) we've got the i/o subsystem and splitting of fields to go as fast as they're likely to go without dropping into C. Now we get into the more cpu intensive parts - taking the {Location, Size} data from binary:matches and combined with your segment Idx working out the correct offset to use in calling binary:part, plus reconstituting stuff that sits across two segments and then finally converting binary to other data types for the final results. If I could make these parts happen in 2 seconds to get combined throughput of 3 seconds I'd be pleased, but I'm not sure it'll be possible. And even then, your desired time of 1 second isn't looking at all likely for even this 300k record file, let alone a larger 1million line file.
>>>>>
>>>>> It is possible that applying some parallelism to this will help, but the timings we're after are so small that I'm not sure. Spawning per split is of course a waste of time, but it is possible that having a worker process per core that does binary:matches and parses might help. The difficulty here is that you have to deal with the data sitting across two segments before you can pass it off to a worker, because you need the previous 'remainder' if any - I suppose you could 'pass this in' along with the data. If this could be made to work efficiently, then you might also benefit from a parallel collector process taking the completed/parsed records from the workers and storing them so they're ready for retrieval. Again, I'm not convinced all of this could be done in pure Erlang in less than 3 - 4 seconds overall. Perhaps putting ets into the mix might provide some speedup, as I would guess that a shared table can be atomically accessed faster than the processes can exchange
>  data
>>>>> via a mailbox.
>>>>>
>>>>> I'm going to see how fast this can be made to go in pure Erlang (allowing for ETS if it helps) just as a fun exercise. I don't know how much spare time I'll have to do it though and I'm convinced a solution written in C is the right answer for 1million records per second, and to do that a segment at a time approach that does almost all the work in C including all the type conversion is probably the way to go. In that case I would probably do the whole i/o part in C too, using a similar series of buffered read operations rather than line oriented fgets calls or things like that.
>>>>>
>>>>> I do not think that mmap or even reading all the file into memory will be of any use, as you mentioned earlier, the problem is really cpu bound.
>>>>>
>>>>> _______________________________________________
>>>>> erlang-questions mailing list
>>>>> erlang-questions@REDACTED
>>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>>
>>>
>>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions