[erlang-bugs] [erlang-questions] using xmerl on web docs

Mon Feb 11 22:44:29 CET 2008

Hi Kevin,

I just thought I'd point out that, while I wrote the original
version of xmerl some 8 years ago, I've hardly ever used
it since. (:

This does seem like a bug in xmerl_scan, though:

case DataRet of
    {file,F} ->	
	{get_file(F,S),F};
    {string,Str} ->
	{binary_to_list(Str),file_name_unknown};

It is obvious that Str should be treated as a string, and calling
binary_to_list(Str) is bound to fail.

You could, of course, make your code bug-compatible and return
{string, list_to_binary(Body)}   ;-)

BR,
Ulf W

2008/2/11, Kevin Scaldeferri <kevin@REDACTED>:
> Hi Ulf,
>
> Well, I wrote my own fetch, but I'm still having problems with the
> XHTML1 DTD (http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd).  It
> includes a number of other fragments with entity declarations and
> such.  I had to do some gymnastics in my fetch function to deal with
> the relative paths, but now I get this:
>
> 49> spider:start("http://kevin.scaldeferri.com/").
> fetching http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
> fetching xhtml-lat1.ent
> resolved xhtml-lat1.ent to http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
> fetching http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
> ** exception error: bad argument
>       in function  binary_to_list/1
>          called as
>          called as binary_to_list("<!-- Portions (C) ...")
>       in call from xmerl_scan:fetch_not_parse/2
>       in call from xmerl_scan:scan_decl_sep/2
>       in call from xmerl_scan:scan_ext_subset/2
>       in call from xmerl_scan:scan_decl/2
>       in call from xmerl_scan:fetch_and_parse/3
>       in call from xmerl_scan:fetch_DTD/2
>       in call from xmerl_scan:scan_doctype2/3
>
>
> I'm not sury why fetch_not_parse is calling binary_to_list in this
> place.  It seems like an inconsistency between that functino and
> fetch_and_parse, in terms of what they expect to be returned from the
> fetch function.  So, is this a bug or is it intentional?  Should I be
> doing something different in my fetch function?  Here's what it looks
> like:
>
> fetchURI(URI, State) ->
>      io:format("fetching ~s~n", [URI]),
>      FetchState = xmerl_scan:fetch_state(State),
>      case URI of
>          "http:"++_ ->
>              {ok, {_,_,Body}} = http:request(URI),
>              {ok, {string, Body},
> xmerl_scan:fetch_state(FetchState#state{last=URI}, State)};
>          Rel ->
>              Abs = resolve_relative(Rel, FetchState#state.last),
>              io:format("resolved ~s to ~s~n", [Rel, Abs]),
>              fetchURI(Abs, State)
>      end.
>
>
> Thanks,
>
> -kevin
>
>
> On Feb 10, 2008, at 10:03 AM, Ulf Wiger wrote:
>
> > The fetch_URI function in xmerl_scan can be replaced by
> > a user-defined function, {fetch_fun, F}.
> >
> > This is described in http://www.erlang.org/doc/apps/xmerl/xmerl_examples.html
> >
> > I've sometimes been a bit annoyed that even though the default mode is
> > {validation, off}, xmerl will not accept not being able to find the
> > DTD.
> >
> >
> > BR,
> > Ulf W
> >
> > 2008/2/10, Kevin Scaldeferri <kevin@REDACTED>:
> >> Hi,
> >>
> >> I wanted to do something fairly basic (to me, at least) but ran into
> >> problems.  Specifically, I would like to fetch some (X)HTML docs from
> >> the web and parse and validate them.  So, I started optimistic.
> >> After
> >> fetching a doc with http:request, I tried to validate it:
> >>
> >> 6> {XML, Rest} = xmerl_scan:string(Body, [{validation, dtd}]).
> >> 3290- fatal: {error,{error_missing_element_declaration_in_DTD,html}}
> >> ** exception exit: {fatal,
> >>                     {{error,
> >>                       {error_missing_element_declaration_in_DTD,
> >>                        html}},
> >>                      {file,
> >>                       file_name_unknown},
> >>                      {line,
> >>                       4},
> >>                      {col,
> >>                       1}}}
> >>      in function  xmerl_scan:fatal/2
> >>      in call from xmerl_scan:scan_document/2
> >>      in call from xmerl_scan:string/2
> >>
> >>
> >> hmm... well I know this is really a valid XHTML doc, and I know the
> >> dtd does declare the element "html".  As best I can tell, the problem
> >> is that xmerl doesn't actually fetch dtds from the web.  This seems
> >> to
> >> be suggested by this bit in the source:
> >>
> >> %%% Always assume an external resource can be found locally! Thus
> >> %%% don't bother fetching with e.g. HTTP. Returns the path where the
> >> %%% resource is found.  The path to the external resource is given by
> >> %%% URI directly or the option fetch_path (additional paths) or
> >> %%% directory (base path to external resource)
> >> fetch_URI(URI, S) -> ...
> >>
> >>
> >> So, I decided to give up on validating for the moment, and just parse
> >> the doc for the time being.  Unfortunately, I didn't get any further;
> >>
> >> 8> {XML, Rest} = xmerl_scan:string(Body).
> >> 2692- fatal: {unknown_entity_ref,copy}
> >> 2602- fatal: error_scanning_entity_ref
> >> ** exception exit: {fatal,
> >>                     {error_scanning_entity_ref,
> >>                      {file,
> >>                       file_name_unknown},
> >>                      {line,
> >>                       33},
> >>                      {col,
> >>                       16}}}
> >>      in function  xmerl_scan:fatal/2
> >>      in call from xmerl_scan:scan_element/12
> >>      in call from xmerl_scan:scan_content/11
> >>      in call from xmerl_scan:scan_element/12
> >>      in call from xmerl_scan:scan_content/11
> >>      in call from xmerl_scan:scan_element/12
> >>      in call from xmerl_scan:scan_content/11
> >>      in call from xmerl_scan:scan_element/12
> >>
> >>
> >> In this case, an © entity, perfectly valid according to the DTD,
> >> is rejected because we haven't parsed the DTD.
> >>
> >> It seems like I'm stuck in a chicken-and-egg problem.  If I could
> >> parse without validating, I could extract the DTD location, fetch it
> >> and make a local copy that could be used for validation.  But, it
> >> seems that I can't parse the document unless we already know the
> >> contents of the DTD (specifically, the entity declarations).  Thus,
> >> I'm stuck.
> >>
> >> Is there something that I'm doing stupidly wrong here?  Surely
> >> someone
> >> else has tried to parse documents off the web.
> >>
> >>
> >> Thanks,
> >>
> >> -kevin
> >> _______________________________________________
> >> erlang-questions mailing list
> >> erlang-questions@REDACTED
> >> http://www.erlang.org/mailman/listinfo/erlang-questions
> >>
>
>