[erlang-questions] using xmerl on web docs

Sun Feb 10 06:49:45 CET 2008

Hi,

I wanted to do something fairly basic (to me, at least) but ran into  
problems.  Specifically, I would like to fetch some (X)HTML docs from  
the web and parse and validate them.  So, I started optimistic.  After  
fetching a doc with http:request, I tried to validate it:

6> {XML, Rest} = xmerl_scan:string(Body, [{validation, dtd}]).
3290- fatal: {error,{error_missing_element_declaration_in_DTD,html}}
** exception exit: {fatal,
                     {{error,
                       {error_missing_element_declaration_in_DTD,
                        html}},
                      {file,
                       file_name_unknown},
                      {line,
                       4},
                      {col,
                       1}}}
      in function  xmerl_scan:fatal/2
      in call from xmerl_scan:scan_document/2
      in call from xmerl_scan:string/2

hmm... well I know this is really a valid XHTML doc, and I know the  
dtd does declare the element "html".  As best I can tell, the problem  
is that xmerl doesn't actually fetch dtds from the web.  This seems to  
be suggested by this bit in the source:

%%% Always assume an external resource can be found locally! Thus
%%% don't bother fetching with e.g. HTTP. Returns the path where the
%%% resource is found.  The path to the external resource is given by
%%% URI directly or the option fetch_path (additional paths) or
%%% directory (base path to external resource)
fetch_URI(URI, S) -> ...

So, I decided to give up on validating for the moment, and just parse  
the doc for the time being.  Unfortunately, I didn't get any further;

8> {XML, Rest} = xmerl_scan:string(Body).
2692- fatal: {unknown_entity_ref,copy}
2602- fatal: error_scanning_entity_ref
** exception exit: {fatal,
                     {error_scanning_entity_ref,
                      {file,
                       file_name_unknown},
                      {line,
                       33},
                      {col,
                       16}}}
      in function  xmerl_scan:fatal/2
      in call from xmerl_scan:scan_element/12
      in call from xmerl_scan:scan_content/11
      in call from xmerl_scan:scan_element/12
      in call from xmerl_scan:scan_content/11
      in call from xmerl_scan:scan_element/12
      in call from xmerl_scan:scan_content/11
      in call from xmerl_scan:scan_element/12

It seems like I'm stuck in a chicken-and-egg problem.  If I could  
parse without validating, I could extract the DTD location, fetch it  
and make a local copy that could be used for validation.  But, it  
seems that I can't parse the document unless we already know the  
contents of the DTD (specifically, the entity declarations).  Thus,  
I'm stuck.

Is there something that I'm doing stupidly wrong here?  Surely someone  
else has tried to parse documents off the web.

Thanks,

-kevin