[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Sat Mar 24 23:14:37 CET 2012

On 24 Mar 2012, at 16:45, Max Lapshin wrote:

> I'm trying to use file:read from erlang combined with compiled scanf pattern or some regex. There are two ideas:
> 1) combine two binary:split will give double memory walking. I think it is a good idea to parse memory once

Actually there's a double problem here. In a CSV file it is possible that a delimiter might be present 'inside' one of the fields, providing the contents of the cell are in quotes. For example:

>> 1,2,16,John Snow,"Flat A, 128, Watling Street", etc....

So just doing binary:split(Bin, binary:compile_pattern(<<",">>), [global]) isn't enough. I will take a look how binary:split/3 is implemented, but I'm starting to think that you really do need to drop into C to make this fast enough. Maybe for your data set you don't care about this possibility (because you have some control over the inputs) but for a general purpose CSV parser, this has to be handled and requires too much introspection of the data to efficient over 70MiB and more if written in pure Erlang. 

With file:read_line and binary:split, I can get the data out (without float/int conversion) sequentially in about 10 seconds. It's not particularly useful to parallelise the binary:split work as for the most part it isn't taking a long time per operation, but performs a vast number of them which slows down. Spawning for each processing line isn't really very smart as even in SMP mode the amount of churn due to scheduling seems likely to be prohibitive. Even if you hit the sweet spot with the right number of worker processes (1 per cpu +/- 1 for example) you've got the additional work of reconstituting the data in the correct order which takes up a lot of time.

Now if I manually chunk my way through the file, using an optimal segment size (on my mac, 16 * 1024 seems to be best) with a sizeable read_ahead of 1024 * 1024, then I can get through the whole file and binary:split(Bin, CommaPattern, [global]) on each chunk, in an average of 2300ms. Of course this is too slow, is ignoring newlines at the moment (unlike the other tests) and I've removed the code for dealing with fields that sit across the segment boundaries (which slows things down also): 

read(P=#csv_parser{ io_device=Fd, buffer_size=Size }, start, Idx, Rem) ->
    read(P, file:read(Fd, Size), Idx, Rem);
read(P=#csv_parser{ io_device=Fd, buffer_size=Size,
                    delimiter=Delim, collector=Collector,
                    supported_newlines=NL },
                    {ok, Chunks},
                    Idx, Rem) ->
    binary:split(Chunks, Delim, [global]),
    read(P, file:read(Fd, Size), Idx + 1, <<>>); 
read(_P, eof, _, _Rem) ->
    %% in strict mode, fail unless size(Rem) == 0
    ok.

>>>>
t4@REDACTED:csv_reader $ ./csv_reader.erl example.csv 
Starting read....
read_chunk time: 2376
t4@REDACTED:csv_reader $ 

Now if I switch the split to use binary:matches instead, and look for either ',' or a newline, we get a big performance improvement:

t4@REDACTED:csv_reader $ ./csv_reader.erl example.csv 
Starting read....
read_chunk time: 953
t4@REDACTED:csv_reader $ 

That's getting much faster, *but* we're not doing any hard work of parsing yet and we're not dealing with 'cross-segment' parts, which I'll look at next.

> 2) intermediate creation of binaries for all those floats (12 millions of them in this example) is also evil. Perhaps line parsing should be combined with float extraction in the same manner as decode_packet does.

Like I said I think that with the approach I mentioned above (using nicely segmented file:read calls and binary:matches) we've got the i/o subsystem and splitting of fields to go as fast as they're likely to go without dropping into C. Now we get into the more cpu intensive parts - taking the {Location, Size} data from binary:matches and combined with your segment Idx working out the correct offset to use in calling binary:part, plus reconstituting stuff that sits across two segments and then finally converting binary to other data types for the final results. If I could make these parts happen in 2 seconds to get combined throughput of 3 seconds I'd be pleased, but I'm not sure it'll be possible. And even then, your desired time of 1 second isn't looking at all likely for even this 300k record file, let alone a larger 1million line file.  

It is possible that applying some parallelism to this will help, but the timings we're after are so small that I'm not sure. Spawning per split is of course a waste of time, but it is possible that having a worker process per core that does binary:matches and parses might help. The difficulty here is that you have to deal with the data sitting across two segments before you can pass it off to a worker, because you need the previous 'remainder' if any - I suppose you could 'pass this in' along with the data. If this could be made to work efficiently, then you might also benefit from a parallel collector process taking the completed/parsed records from the workers and storing them so they're ready for retrieval. Again, I'm not convinced all of this could be done in pure Erlang in less than 3 - 4 seconds overall. Perhaps putting ets into the mix might provide some speedup, as I would guess that a shared table can be atomically accessed faster than the processes can exchange data via a mailbox.   

I'm going to see how fast this can be made to go in pure Erlang (allowing for ETS if it helps) just as a fun exercise. I don't know how much spare time I'll have to do it though and I'm convinced a solution written in C is the right answer for 1million records per second, and to do that a segment at a time approach that does almost all the work in C including all the type conversion is probably the way to go. In that case I would probably do the whole i/o part in C too, using a similar series of buffered read operations rather than line oriented fgets calls or things like that. 

I do not think that mmap or even reading all the file into memory will be of any use, as you mentioned earlier, the problem is really cpu bound.