[erlang-questions] using xmerl on web docs
Kevin Scaldeferri
kevin@REDACTED
Sun Feb 10 06:49:45 CET 2008
Hi,
I wanted to do something fairly basic (to me, at least) but ran into
problems. Specifically, I would like to fetch some (X)HTML docs from
the web and parse and validate them. So, I started optimistic. After
fetching a doc with http:request, I tried to validate it:
6> {XML, Rest} = xmerl_scan:string(Body, [{validation, dtd}]).
3290- fatal: {error,{error_missing_element_declaration_in_DTD,html}}
** exception exit: {fatal,
{{error,
{error_missing_element_declaration_in_DTD,
html}},
{file,
file_name_unknown},
{line,
4},
{col,
1}}}
in function xmerl_scan:fatal/2
in call from xmerl_scan:scan_document/2
in call from xmerl_scan:string/2
hmm... well I know this is really a valid XHTML doc, and I know the
dtd does declare the element "html". As best I can tell, the problem
is that xmerl doesn't actually fetch dtds from the web. This seems to
be suggested by this bit in the source:
%%% Always assume an external resource can be found locally! Thus
%%% don't bother fetching with e.g. HTTP. Returns the path where the
%%% resource is found. The path to the external resource is given by
%%% URI directly or the option fetch_path (additional paths) or
%%% directory (base path to external resource)
fetch_URI(URI, S) -> ...
So, I decided to give up on validating for the moment, and just parse
the doc for the time being. Unfortunately, I didn't get any further;
8> {XML, Rest} = xmerl_scan:string(Body).
2692- fatal: {unknown_entity_ref,copy}
2602- fatal: error_scanning_entity_ref
** exception exit: {fatal,
{error_scanning_entity_ref,
{file,
file_name_unknown},
{line,
33},
{col,
16}}}
in function xmerl_scan:fatal/2
in call from xmerl_scan:scan_element/12
in call from xmerl_scan:scan_content/11
in call from xmerl_scan:scan_element/12
in call from xmerl_scan:scan_content/11
in call from xmerl_scan:scan_element/12
in call from xmerl_scan:scan_content/11
in call from xmerl_scan:scan_element/12
In this case, an © entity, perfectly valid according to the DTD,
is rejected because we haven't parsed the DTD.
It seems like I'm stuck in a chicken-and-egg problem. If I could
parse without validating, I could extract the DTD location, fetch it
and make a local copy that could be used for validation. But, it
seems that I can't parse the document unless we already know the
contents of the DTD (specifically, the entity declarations). Thus,
I'm stuck.
Is there something that I'm doing stupidly wrong here? Surely someone
else has tried to parse documents off the web.
Thanks,
-kevin
More information about the erlang-questions
mailing list