[erlang-questions] parsing text

Fri Apr 30 18:51:15 CEST 2010

On Fri, Apr 30, 2010 at 09:04:02AM -0600, Wes James wrote:
> On Thu, Apr 29, 2010 at 11:07 PM, Richard O'Keefe <ok@REDACTED> wrote:
> >
> > On Apr 30, 2010, at 6:39 AM, Wes James wrote:
> >
> >> I have a function grabbing a page and I'm pulling text out of the
> >> result.  I can get the line:
> >>
> >> lists:nth(424,B).
> >> <<"<B>Page Counter</B></TD><TD>4880</TD></TR>">>
> >>
> >>
> >> but 4880 will eventually get to 10000, etc.
> >
> > It's not clear exactly how much else about the data will
> > vary.  My take on this is that you want the stuff between
> > <TD> and </TD>.
> 
> <snip>
> 
> 
> Richard,
> 
> Thanks for your input on this.  I tested it and it worked.  I messed
> around with xmerl_scan:string, but
> "<B>Page Counter</B></TD><TD>4880</TD></TR>" doesn't seem to be
> well formed xml - I kept getting errors.
> 
> xmerl_scan:string("<foo>" ++
> 11>                       "<myelement myattribute=\"red\">x</myelement>" ++
> 11>                       "<myelement myattribute=\"blue\">x</myelement>" ++
> 11>                       "<myelement myattribute=\"blue\">y</myelement>" ++
> 11>                     "</foo>").
> 
> works, but
> 
> xmerl_scan:string("<B>Page Counter</B></TD><TD>4880</TD></TR>").
> 2711- fatal: {unknown_entity_ref,nbsp}
> 2621- fatal: error_scanning_entity_ref
> ** exception exit: {fatal,{error_scanning_entity_ref,{file,file_name_unknown},
>                                                      {line,1},
>                                                      {col,10}}}
>      in function  xmerl_scan:fatal/2
>      in call from xmerl_scan:scan_content/11
>      in call from xmerl_scan:scan_element/12
>      in call from xmerl_scan:scan_document/2
>      in call from xmerl_scan:string/2
> 
> not....

Your string contains an HTML entity   but that is not a valid xml
entity (there are only 5 of those
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references)

So if you tried

1> xmerl_scan:string("<B>Page&nbsp;Counter</B></TD><TD>4880</TD></TR>").
{{xmlElement,'B','B',[],
             {xmlNamespace,[],[]},
             [],1,[],
             [{xmlText,[{'B',1}],1,[],"Page",text},
              {xmlText,[{'B',1}],2,[]," Counter",text}],
             [],"/tmp",undeclared},
 "</TD><TD>4880</TD></TR>"}

You can see it does better, but still not what you want as it can only parse
part of the structure (only <b>...</b> can be parsed, then you hit an end
element without a start and the parsing stops).

Your best bet might be to attempt to parse the entire file and not just part
of it.  But you'd still need a way to escape html entities so they can be
parsed by an xml parser.

-Anthony

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <anthonym@REDACTED>