[erlang-questions] using xmerl on web docs
Ulf Wiger
ulf@REDACTED
Sun Feb 10 19:03:56 CET 2008
The fetch_URI function in xmerl_scan can be replaced by
a user-defined function, {fetch_fun, F}.
This is described in http://www.erlang.org/doc/apps/xmerl/xmerl_examples.html
I've sometimes been a bit annoyed that even though the default mode is
{validation, off}, xmerl will not accept not being able to find the
DTD.
BR,
Ulf W
2008/2/10, Kevin Scaldeferri <kevin@REDACTED>:
> Hi,
>
> I wanted to do something fairly basic (to me, at least) but ran into
> problems. Specifically, I would like to fetch some (X)HTML docs from
> the web and parse and validate them. So, I started optimistic. After
> fetching a doc with http:request, I tried to validate it:
>
> 6> {XML, Rest} = xmerl_scan:string(Body, [{validation, dtd}]).
> 3290- fatal: {error,{error_missing_element_declaration_in_DTD,html}}
> ** exception exit: {fatal,
> {{error,
> {error_missing_element_declaration_in_DTD,
> html}},
> {file,
> file_name_unknown},
> {line,
> 4},
> {col,
> 1}}}
> in function xmerl_scan:fatal/2
> in call from xmerl_scan:scan_document/2
> in call from xmerl_scan:string/2
>
>
> hmm... well I know this is really a valid XHTML doc, and I know the
> dtd does declare the element "html". As best I can tell, the problem
> is that xmerl doesn't actually fetch dtds from the web. This seems to
> be suggested by this bit in the source:
>
> %%% Always assume an external resource can be found locally! Thus
> %%% don't bother fetching with e.g. HTTP. Returns the path where the
> %%% resource is found. The path to the external resource is given by
> %%% URI directly or the option fetch_path (additional paths) or
> %%% directory (base path to external resource)
> fetch_URI(URI, S) -> ...
>
>
> So, I decided to give up on validating for the moment, and just parse
> the doc for the time being. Unfortunately, I didn't get any further;
>
> 8> {XML, Rest} = xmerl_scan:string(Body).
> 2692- fatal: {unknown_entity_ref,copy}
> 2602- fatal: error_scanning_entity_ref
> ** exception exit: {fatal,
> {error_scanning_entity_ref,
> {file,
> file_name_unknown},
> {line,
> 33},
> {col,
> 16}}}
> in function xmerl_scan:fatal/2
> in call from xmerl_scan:scan_element/12
> in call from xmerl_scan:scan_content/11
> in call from xmerl_scan:scan_element/12
> in call from xmerl_scan:scan_content/11
> in call from xmerl_scan:scan_element/12
> in call from xmerl_scan:scan_content/11
> in call from xmerl_scan:scan_element/12
>
>
> In this case, an © entity, perfectly valid according to the DTD,
> is rejected because we haven't parsed the DTD.
>
> It seems like I'm stuck in a chicken-and-egg problem. If I could
> parse without validating, I could extract the DTD location, fetch it
> and make a local copy that could be used for validation. But, it
> seems that I can't parse the document unless we already know the
> contents of the DTD (specifically, the entity declarations). Thus,
> I'm stuck.
>
> Is there something that I'm doing stupidly wrong here? Surely someone
> else has tried to parse documents off the web.
>
>
> Thanks,
>
> -kevin
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
More information about the erlang-questions
mailing list