[erlang-questions] using xmerl on web docs

Sun Feb 10 19:03:56 CET 2008

The fetch_URI function in xmerl_scan can be replaced by
a user-defined function, {fetch_fun, F}.

This is described in http://www.erlang.org/doc/apps/xmerl/xmerl_examples.html

I've sometimes been a bit annoyed that even though the default mode is
{validation, off}, xmerl will not accept not being able to find the
DTD.

BR,
Ulf W

2008/2/10, Kevin Scaldeferri <kevin@REDACTED>:
> Hi,
>
> I wanted to do something fairly basic (to me, at least) but ran into
> problems.  Specifically, I would like to fetch some (X)HTML docs from
> the web and parse and validate them.  So, I started optimistic.  After
> fetching a doc with http:request, I tried to validate it:
>
> 6> {XML, Rest} = xmerl_scan:string(Body, [{validation, dtd}]).
> 3290- fatal: {error,{error_missing_element_declaration_in_DTD,html}}
> ** exception exit: {fatal,
>                      {{error,
>                        {error_missing_element_declaration_in_DTD,
>                         html}},
>                       {file,
>                        file_name_unknown},
>                       {line,
>                        4},
>                       {col,
>                        1}}}
>       in function  xmerl_scan:fatal/2
>       in call from xmerl_scan:scan_document/2
>       in call from xmerl_scan:string/2
>
>
> hmm... well I know this is really a valid XHTML doc, and I know the
> dtd does declare the element "html".  As best I can tell, the problem
> is that xmerl doesn't actually fetch dtds from the web.  This seems to
> be suggested by this bit in the source:
>
> %%% Always assume an external resource can be found locally! Thus
> %%% don't bother fetching with e.g. HTTP. Returns the path where the
> %%% resource is found.  The path to the external resource is given by
> %%% URI directly or the option fetch_path (additional paths) or
> %%% directory (base path to external resource)
> fetch_URI(URI, S) -> ...
>
>
> So, I decided to give up on validating for the moment, and just parse
> the doc for the time being.  Unfortunately, I didn't get any further;
>
> 8> {XML, Rest} = xmerl_scan:string(Body).
> 2692- fatal: {unknown_entity_ref,copy}
> 2602- fatal: error_scanning_entity_ref
> ** exception exit: {fatal,
>                      {error_scanning_entity_ref,
>                       {file,
>                        file_name_unknown},
>                       {line,
>                        33},
>                       {col,
>                        16}}}
>       in function  xmerl_scan:fatal/2
>       in call from xmerl_scan:scan_element/12
>       in call from xmerl_scan:scan_content/11
>       in call from xmerl_scan:scan_element/12
>       in call from xmerl_scan:scan_content/11
>       in call from xmerl_scan:scan_element/12
>       in call from xmerl_scan:scan_content/11
>       in call from xmerl_scan:scan_element/12
>
>
> In this case, an © entity, perfectly valid according to the DTD,
> is rejected because we haven't parsed the DTD.
>
> It seems like I'm stuck in a chicken-and-egg problem.  If I could
> parse without validating, I could extract the DTD location, fetch it
> and make a local copy that could be used for validation.  But, it
> seems that I can't parse the document unless we already know the
> contents of the DTD (specifically, the entity declarations).  Thus,
> I'm stuck.
>
> Is there something that I'm doing stupidly wrong here?  Surely someone
> else has tried to parse documents off the web.
>
>
> Thanks,
>
> -kevin
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>