[erlang-bugs] [erlang-questions] using xmerl on web docs

Tue Feb 12 00:32:16 CET 2008

Hi,

If you have the R12B-1 release you can try the undocumented and unsupported
function docb_main:validate_html/1 which does exactly what you are
trying to do except that it used the DTD's for xhtml which all are
available in the docbuilder-0.9.8/dtd directory. In the R12B-1 release
that is.

Note that the function may be moved, removed or changed in an upcoming
release. Our intention is however that it should be possible to use
xmerl for validation of
xhtml in R12B-2 preliminary scheduled for April.

Also note that there is a bug in xmerl regarding DTD validation
(correction comes in R12B-2) that makes it complain about correct
xhtml if it contains <table>.

I can provide a source patch for that if someone wants it.

/Kenneth Erlang/OTP team at Ericsson

On 2/11/08, Ulf Wiger <ulf@REDACTED> wrote:
> Hi Kevin,
>
> I just thought I'd point out that, while I wrote the original
> version of xmerl some 8 years ago, I've hardly ever used
> it since. (:
>
> This does seem like a bug in xmerl_scan, though:
>
> case DataRet of
>    {file,F} ->
>        {get_file(F,S),F};
>    {string,Str} ->
>        {binary_to_list(Str),file_name_unknown};
>
> It is obvious that Str should be treated as a string, and calling
> binary_to_list(Str) is bound to fail.
>
> You could, of course, make your code bug-compatible and return
> {string, list_to_binary(Body)}   ;-)
>
> BR,
> Ulf W
>
> 2008/2/11, Kevin Scaldeferri <kevin@REDACTED>:
> > Hi Ulf,
> >
> > Well, I wrote my own fetch, but I'm still having problems with the
> > XHTML1 DTD (http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd).  It
> > includes a number of other fragments with entity declarations and
> > such.  I had to do some gymnastics in my fetch function to deal with
> > the relative paths, but now I get this:
> >
> > 49> spider:start("http://kevin.scaldeferri.com/").
> > fetching http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
> > fetching xhtml-lat1.ent
> > resolved xhtml-lat1.ent to http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
> > fetching http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
> > ** exception error: bad argument
> >       in function  binary_to_list/1
> >          called as
> >          called as binary_to_list("<!-- Portions (C) ...")
> >       in call from xmerl_scan:fetch_not_parse/2
> >       in call from xmerl_scan:scan_decl_sep/2
> >       in call from xmerl_scan:scan_ext_subset/2
> >       in call from xmerl_scan:scan_decl/2
> >       in call from xmerl_scan:fetch_and_parse/3
> >       in call from xmerl_scan:fetch_DTD/2
> >       in call from xmerl_scan:scan_doctype2/3
> >
> >
> > I'm not sury why fetch_not_parse is calling binary_to_list in this
> > place.  It seems like an inconsistency between that functino and
> > fetch_and_parse, in terms of what they expect to be returned from the
> > fetch function.  So, is this a bug or is it intentional?  Should I be
> > doing something different in my fetch function?  Here's what it looks
> > like:
> >
> > fetchURI(URI, State) ->
> >      io:format("fetching ~s~n", [URI]),
> >      FetchState = xmerl_scan:fetch_state(State),
> >      case URI of
> >          "http:"++_ ->
> >              {ok, {_,_,Body}} = http:request(URI),
> >              {ok, {string, Body},
> > xmerl_scan:fetch_state(FetchState#state{last=URI}, State)};
> >          Rel ->
> >              Abs = resolve_relative(Rel, FetchState#state.last),
> >              io:format("resolved ~s to ~s~n", [Rel, Abs]),
> >              fetchURI(Abs, State)
> >      end.
> >
> >
> > Thanks,
> >
> > -kevin
> >
> >
> > On Feb 10, 2008, at 10:03 AM, Ulf Wiger wrote:
> >
> > > The fetch_URI function in xmerl_scan can be replaced by
> > > a user-defined function, {fetch_fun, F}.
> > >
> > > This is described in http://www.erlang.org/doc/apps/xmerl/xmerl_examples.html
> > >
> > > I've sometimes been a bit annoyed that even though the default mode is
> > > {validation, off}, xmerl will not accept not being able to find the
> > > DTD.
> > >
> > >
> > > BR,
> > > Ulf W
> > >
> > > 2008/2/10, Kevin Scaldeferri <kevin@REDACTED>:
> > >> Hi,
> > >>
> > >> I wanted to do something fairly basic (to me, at least) but ran into
> > >> problems.  Specifically, I would like to fetch some (X)HTML docs from
> > >> the web and parse and validate them.  So, I started optimistic.
> > >> After
> > >> fetching a doc with http:request, I tried to validate it:
> > >>
> > >> 6> {XML, Rest} = xmerl_scan:string(Body, [{validation, dtd}]).
> > >> 3290- fatal: {error,{error_missing_element_declaration_in_DTD,html}}
> > >> ** exception exit: {fatal,
> > >>                     {{error,
> > >>                       {error_missing_element_declaration_in_DTD,
> > >>                        html}},
> > >>                      {file,
> > >>                       file_name_unknown},
> > >>                      {line,
> > >>                       4},
> > >>                      {col,
> > >>                       1}}}
> > >>      in function  xmerl_scan:fatal/2
> > >>      in call from xmerl_scan:scan_document/2
> > >>      in call from xmerl_scan:string/2
> > >>
> > >>
> > >> hmm... well I know this is really a valid XHTML doc, and I know the
> > >> dtd does declare the element "html".  As best I can tell, the problem
> > >> is that xmerl doesn't actually fetch dtds from the web.  This seems
> > >> to
> > >> be suggested by this bit in the source:
> > >>
> > >> %%% Always assume an external resource can be found locally! Thus
> > >> %%% don't bother fetching with e.g. HTTP. Returns the path where the
> > >> %%% resource is found.  The path to the external resource is given by
> > >> %%% URI directly or the option fetch_path (additional paths) or
> > >> %%% directory (base path to external resource)
> > >> fetch_URI(URI, S) -> ...
> > >>
> > >>
> > >> So, I decided to give up on validating for the moment, and just parse
> > >> the doc for the time being.  Unfortunately, I didn't get any further;
> > >>
> > >> 8> {XML, Rest} = xmerl_scan:string(Body).
> > >> 2692- fatal: {unknown_entity_ref,copy}
> > >> 2602- fatal: error_scanning_entity_ref
> > >> ** exception exit: {fatal,
> > >>                     {error_scanning_entity_ref,
> > >>                      {file,
> > >>                       file_name_unknown},
> > >>                      {line,
> > >>                       33},
> > >>                      {col,
> > >>                       16}}}
> > >>      in function  xmerl_scan:fatal/2
> > >>      in call from xmerl_scan:scan_element/12
> > >>      in call from xmerl_scan:scan_content/11
> > >>      in call from xmerl_scan:scan_element/12
> > >>      in call from xmerl_scan:scan_content/11
> > >>      in call from xmerl_scan:scan_element/12
> > >>      in call from xmerl_scan:scan_content/11
> > >>      in call from xmerl_scan:scan_element/12
> > >>
> > >>
> > >> In this case, an © entity, perfectly valid according to the DTD,
> > >> is rejected because we haven't parsed the DTD.
> > >>
> > >> It seems like I'm stuck in a chicken-and-egg problem.  If I could
> > >> parse without validating, I could extract the DTD location, fetch it
> > >> and make a local copy that could be used for validation.  But, it
> > >> seems that I can't parse the document unless we already know the
> > >> contents of the DTD (specifically, the entity declarations).  Thus,
> > >> I'm stuck.
> > >>
> > >> Is there something that I'm doing stupidly wrong here?  Surely
> > >> someone
> > >> else has tried to parse documents off the web.
> > >>
> > >>
> > >> Thanks,
> > >>
> > >> -kevin
> > >> _______________________________________________
> > >> erlang-questions mailing list
> > >> erlang-questions@REDACTED
> > >> http://www.erlang.org/mailman/listinfo/erlang-questions
> > >>
> >
> >
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-bugs
>