[erlang-questions] String handling & regexp

Mon Sep 15 05:51:33 CEST 2008

On 14 Sep 2008, at 7:04 pm, Olivier Pernet wrote:

Brian Marick has some lovely stickers; he was kind enough to send me a  
couple.
I have "to be less wrong than yesterday" on my door.
The one I want now is the one that reads

	An example would be useful about now.

> I'm reading a line that has a room name and a message, separated by
> one (or more) whitespace characters, and terminated by <crlf>.

I assume that this means
	<input> ::= <white space>* <room name> <white space>+
		    <message> <white space>* \r\n
where <message> may contain white space but not start or end
with it, and <room name> may not contain white space.

%  The following code assumes that white space characters are
%  precisely those with codes 0..32 inclusive.  This is right
%  for ASCII and Latin1 &c, but wrong for Unicode.   This was
%  done to concentrate on the main issues.

drop_while_white_space([C|Cs]) when C =< 32 ->
     drop_while_white_space(Cs);
drop_while_white_space(Cs) ->
     Cs.

take_until_white_space([C|Cs]) when C > 32 ->
     [C|take_until_white_space(Cs)];
take_until_white_space(_) ->
     [].

drop_until_white_space([C|Cs]) when C > 32 ->
     drop_until_white_space(Cs);
drop_until_white_space(Cs) ->
     Cs.

Now put them together.

room_and_message_from_line(Line) ->
     Line1 = drop_while_white_space(Line),
     Room = take_until_white_space(Line1),
     Line2 = drop_until_white_space(Line1),
     Message = lists:reverse(drop_while_white_space(
               lists:reverse(drop_while_white_space(
               Line2)))),
     {Room, Message}.

I think that for *fixed* patterns it is often
better to program at this level than to use regular
expressions, which really shine as run-time data.
For example, in the pattern
"[
> \t]*([^\s\t\r\n]+)[ \t]+([^\r\n]*)\r\n").

([^\r\n]*)\r\n does not mean what your *words* would
require it to mean.  According to your *words*,
if the input was "foo  bar\rugh\nzoo\r\n"
the embedded \r and \n should NOT end the message
(only the sequence \r\n was said to do that).
Your pattern will fail on that string.  It was
easier for me to write the Erlang code above than to
fix your regexp.