[erlang-questions] parsing text
Anthony Molinaro
anthonym@REDACTED
Fri Apr 30 18:51:15 CEST 2010
On Fri, Apr 30, 2010 at 09:04:02AM -0600, Wes James wrote:
> On Thu, Apr 29, 2010 at 11:07 PM, Richard O'Keefe <ok@REDACTED> wrote:
> >
> > On Apr 30, 2010, at 6:39 AM, Wes James wrote:
> >
> >> I have a function grabbing a page and I'm pulling text out of the
> >> result. I can get the line:
> >>
> >> lists:nth(424,B).
> >> <<"<B>Page Counter</B></TD><TD>4880</TD></TR>">>
> >>
> >>
> >> but 4880 will eventually get to 10000, etc.
> >
> > It's not clear exactly how much else about the data will
> > vary. My take on this is that you want the stuff between
> > <TD> and </TD>.
>
> <snip>
>
>
> Richard,
>
> Thanks for your input on this. I tested it and it worked. I messed
> around with xmerl_scan:string, but
> "<B>Page Counter</B></TD><TD>4880</TD></TR>" doesn't seem to be
> well formed xml - I kept getting errors.
>
> xmerl_scan:string("<foo>" ++
> 11> "<myelement myattribute=\"red\">x</myelement>" ++
> 11> "<myelement myattribute=\"blue\">x</myelement>" ++
> 11> "<myelement myattribute=\"blue\">y</myelement>" ++
> 11> "</foo>").
>
> works, but
>
> xmerl_scan:string("<B>Page Counter</B></TD><TD>4880</TD></TR>").
> 2711- fatal: {unknown_entity_ref,nbsp}
> 2621- fatal: error_scanning_entity_ref
> ** exception exit: {fatal,{error_scanning_entity_ref,{file,file_name_unknown},
> {line,1},
> {col,10}}}
> in function xmerl_scan:fatal/2
> in call from xmerl_scan:scan_content/11
> in call from xmerl_scan:scan_element/12
> in call from xmerl_scan:scan_document/2
> in call from xmerl_scan:string/2
>
> not....
Your string contains an HTML entity but that is not a valid xml
entity (there are only 5 of those
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references)
So if you tried
1> xmerl_scan:string("<B>Page Counter</B></TD><TD>4880</TD></TR>").
{{xmlElement,'B','B',[],
{xmlNamespace,[],[]},
[],1,[],
[{xmlText,[{'B',1}],1,[],"Page",text},
{xmlText,[{'B',1}],2,[]," Counter",text}],
[],"/tmp",undeclared},
"</TD><TD>4880</TD></TR>"}
You can see it does better, but still not what you want as it can only parse
part of the structure (only <b>...</b> can be parsed, then you hit an end
element without a start and the parsing stops).
Your best bet might be to attempt to parse the entire file and not just part
of it. But you'd still need a way to escape html entities so they can be
parsed by an xml parser.
-Anthony
--
------------------------------------------------------------------------
Anthony Molinaro <anthonym@REDACTED>
More information about the erlang-questions
mailing list