[erlang-questions] parsing text

Fri Apr 30 07:07:37 CEST 2010

On Apr 30, 2010, at 6:39 AM, Wes James wrote:

> I have a function grabbing a page and I'm pulling text out of the
> result.  I can get the line:
>
> lists:nth(424,B).
> <<"<B>Page Counter</B></TD><TD>4880</TD></TR>">>
>
>
> but 4880 will eventually get to 10000, etc.

It's not clear exactly how much else about the data will
vary.  My take on this is that you want the stuff between
<TD> and </TD>.

I'm actually typing this on a machine that doesn't have
Erlang installed, so I'm going to do this from first principles.

%   begins_with(Prefix, String) :: (list(T), list(T)) -> list(T) | no

begins_with([X|Prefix], [X|String]) ->
     begins_with(Prefix, String);
begins_with([_|_], _) ->
     no;
begins_with([], String) ->
     String.

%   decat(B, ABC) :: (list(T), list(T)) -> {list(T),list(T)} | no
%   decat(B, ABC) -> {A,B}, anything else -> no.

decat(Infix, String) when Infix =/= [] ->
     decat_loop(Infix, String, []).

decat_loop(_, [], _) ->
     no;
decat_loop(Infix, String = [Head|Tail], Rev) ->
     case begins_with(Infix, String)
       of After when is_list(After) ->
              {lists:reverse(Rev),After}
        ; no ->
               decat_loop(Infix, Tail, [Head|Rev])
     end.

This is all we need now.

     Binary = <<"<B>Page Counter</B></TD><TD>4880</TD></TR>">>,
     String = binary_to_list(Binary),
     {_, After_TD} = decat("<TD>", String),
     {Wanted, _} = decat("</TD>", After_TD),
     list_to_integer(Wanted)

The next step for improving this would be to work directly on
the binary instead of turning it into a list.  It's not particularly
hard; binary_decat_loop is closely analogous to C's strstr().

Or of course you could use regular expressions.
If I were doing this in AWK I'd just do

     S = "<B>Page Counter</B></TD><TD>4880</TD></TR>"
     if (S ~ /<TD>[0-9]*<\/TD>/) {
	N = substr(S, RSTART+4, RLENGTH-9)+0
	...
     }

Erlang has a couple of regular expression packages, including one
that works on strings.

Of course, the question is *how much* variation in the input
can there be.  XML tags may have any amount of white space
except inside identifiers and strings, so if

"<B>Page Counter</B></TD><TD>4880</TD></TR>"

is legal input, then

"<B >Page Counter</B  ></TD   ><TD     >4880</TD      ></TR        
 >"

is also legal input with *exactly* the same meaning, and an XML-aware
application (at any rate a structure-controlled one) should not treat
them at all differently.  In that case you want an XML parser, and
then to match by tree position.  (There's a reason why XPath exists;
I'm not sure there's a reason why it's so horrible.)