[erlang-questions] Running a regular expression on each line of a file

Hynek Vychodil <>
Tue Jan 11 13:14:45 CET 2011


If you are looking for fastest line oriented IO from STDIN you can
find fastest (known to me) in
http://shootout.alioth.debian.org/u32q/program.php?test=regexdna&lang=hipe&id=6

If you can read from file, your hand crafted block IO using
file:open/2 in raw, binary mode should be better. Use binary module
for splitting by lines and don't forget glue lines on block
boundaries. You can look for wide finder project for some inspiration.

And of course in both cases don't transform to lists but keep it in binaries.

On Tue, Jan 11, 2011 at 11:28 AM, Dave Challis <> wrote:
> Ah, thanks, io:get_line/1 makes more sense here.
>
> I solved the problem in the end by adding a another clause to the case
> statement:
>
> ...
> Result = re:run(Text, Re, [{capture, all_but_first, list}]),
> case Result of
>    {match, Captured} ->
>        io:format("~p ~p ~p~n", Captured);
>    _False ->
>        false
> end,
> ...
>
> It's still pretty slow to run though (~10 minutes to parse ~1 million
> lines).  Interestingly enough, I tried removing the regex and just passing
> the input out unchanged, and the whole thing still takes ~9m30s to run, the
> bottleneck wasn't in the regex as I"d assumed.
>
> I'm guessing that io:get_line is being pretty slow.  Is there a preferred
> method for faster I/O?
>
>
>
> On 10/01/11 17:42, Jesper Louis Andersen wrote:
>>
>> On Mon, Jan 10, 2011 at 17:02, Dave Challis<>  wrote:
>>
>>> parse(Re) ->
>>>    case io:get_chars('', 8192) of
>>
>> Since your data is line-oriented, try io:get_line/1 here. You are not
>> going to get a line of input at a time with your approach I think but
>> rather 8K. So how simple are your small inputs?
>>
>
>
> --
> Dave Challis
> 
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:
>
>



-- 
--Hynek (Pichi) Vychodil

Analyze your data in minutes. Share your insights instantly. Thrill
your boss.  Be a data hero!
Try GoodData now for free: www.gooddata.com


More information about the erlang-questions mailing list