[erlang-bugs] [erlang-questions] using xmerl on web docs
Ulf Wiger
ulf@REDACTED
Mon Feb 11 22:44:29 CET 2008
Hi Kevin,
I just thought I'd point out that, while I wrote the original
version of xmerl some 8 years ago, I've hardly ever used
it since. (:
This does seem like a bug in xmerl_scan, though:
case DataRet of
{file,F} ->
{get_file(F,S),F};
{string,Str} ->
{binary_to_list(Str),file_name_unknown};
It is obvious that Str should be treated as a string, and calling
binary_to_list(Str) is bound to fail.
You could, of course, make your code bug-compatible and return
{string, list_to_binary(Body)} ;-)
BR,
Ulf W
2008/2/11, Kevin Scaldeferri <kevin@REDACTED>:
> Hi Ulf,
>
> Well, I wrote my own fetch, but I'm still having problems with the
> XHTML1 DTD (http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd). It
> includes a number of other fragments with entity declarations and
> such. I had to do some gymnastics in my fetch function to deal with
> the relative paths, but now I get this:
>
> 49> spider:start("http://kevin.scaldeferri.com/").
> fetching http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
> fetching xhtml-lat1.ent
> resolved xhtml-lat1.ent to http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
> fetching http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
> ** exception error: bad argument
> in function binary_to_list/1
> called as
> called as binary_to_list("<!-- Portions (C) ...")
> in call from xmerl_scan:fetch_not_parse/2
> in call from xmerl_scan:scan_decl_sep/2
> in call from xmerl_scan:scan_ext_subset/2
> in call from xmerl_scan:scan_decl/2
> in call from xmerl_scan:fetch_and_parse/3
> in call from xmerl_scan:fetch_DTD/2
> in call from xmerl_scan:scan_doctype2/3
>
>
> I'm not sury why fetch_not_parse is calling binary_to_list in this
> place. It seems like an inconsistency between that functino and
> fetch_and_parse, in terms of what they expect to be returned from the
> fetch function. So, is this a bug or is it intentional? Should I be
> doing something different in my fetch function? Here's what it looks
> like:
>
> fetchURI(URI, State) ->
> io:format("fetching ~s~n", [URI]),
> FetchState = xmerl_scan:fetch_state(State),
> case URI of
> "http:"++_ ->
> {ok, {_,_,Body}} = http:request(URI),
> {ok, {string, Body},
> xmerl_scan:fetch_state(FetchState#state{last=URI}, State)};
> Rel ->
> Abs = resolve_relative(Rel, FetchState#state.last),
> io:format("resolved ~s to ~s~n", [Rel, Abs]),
> fetchURI(Abs, State)
> end.
>
>
> Thanks,
>
> -kevin
>
>
> On Feb 10, 2008, at 10:03 AM, Ulf Wiger wrote:
>
> > The fetch_URI function in xmerl_scan can be replaced by
> > a user-defined function, {fetch_fun, F}.
> >
> > This is described in http://www.erlang.org/doc/apps/xmerl/xmerl_examples.html
> >
> > I've sometimes been a bit annoyed that even though the default mode is
> > {validation, off}, xmerl will not accept not being able to find the
> > DTD.
> >
> >
> > BR,
> > Ulf W
> >
> > 2008/2/10, Kevin Scaldeferri <kevin@REDACTED>:
> >> Hi,
> >>
> >> I wanted to do something fairly basic (to me, at least) but ran into
> >> problems. Specifically, I would like to fetch some (X)HTML docs from
> >> the web and parse and validate them. So, I started optimistic.
> >> After
> >> fetching a doc with http:request, I tried to validate it:
> >>
> >> 6> {XML, Rest} = xmerl_scan:string(Body, [{validation, dtd}]).
> >> 3290- fatal: {error,{error_missing_element_declaration_in_DTD,html}}
> >> ** exception exit: {fatal,
> >> {{error,
> >> {error_missing_element_declaration_in_DTD,
> >> html}},
> >> {file,
> >> file_name_unknown},
> >> {line,
> >> 4},
> >> {col,
> >> 1}}}
> >> in function xmerl_scan:fatal/2
> >> in call from xmerl_scan:scan_document/2
> >> in call from xmerl_scan:string/2
> >>
> >>
> >> hmm... well I know this is really a valid XHTML doc, and I know the
> >> dtd does declare the element "html". As best I can tell, the problem
> >> is that xmerl doesn't actually fetch dtds from the web. This seems
> >> to
> >> be suggested by this bit in the source:
> >>
> >> %%% Always assume an external resource can be found locally! Thus
> >> %%% don't bother fetching with e.g. HTTP. Returns the path where the
> >> %%% resource is found. The path to the external resource is given by
> >> %%% URI directly or the option fetch_path (additional paths) or
> >> %%% directory (base path to external resource)
> >> fetch_URI(URI, S) -> ...
> >>
> >>
> >> So, I decided to give up on validating for the moment, and just parse
> >> the doc for the time being. Unfortunately, I didn't get any further;
> >>
> >> 8> {XML, Rest} = xmerl_scan:string(Body).
> >> 2692- fatal: {unknown_entity_ref,copy}
> >> 2602- fatal: error_scanning_entity_ref
> >> ** exception exit: {fatal,
> >> {error_scanning_entity_ref,
> >> {file,
> >> file_name_unknown},
> >> {line,
> >> 33},
> >> {col,
> >> 16}}}
> >> in function xmerl_scan:fatal/2
> >> in call from xmerl_scan:scan_element/12
> >> in call from xmerl_scan:scan_content/11
> >> in call from xmerl_scan:scan_element/12
> >> in call from xmerl_scan:scan_content/11
> >> in call from xmerl_scan:scan_element/12
> >> in call from xmerl_scan:scan_content/11
> >> in call from xmerl_scan:scan_element/12
> >>
> >>
> >> In this case, an © entity, perfectly valid according to the DTD,
> >> is rejected because we haven't parsed the DTD.
> >>
> >> It seems like I'm stuck in a chicken-and-egg problem. If I could
> >> parse without validating, I could extract the DTD location, fetch it
> >> and make a local copy that could be used for validation. But, it
> >> seems that I can't parse the document unless we already know the
> >> contents of the DTD (specifically, the entity declarations). Thus,
> >> I'm stuck.
> >>
> >> Is there something that I'm doing stupidly wrong here? Surely
> >> someone
> >> else has tried to parse documents off the web.
> >>
> >>
> >> Thanks,
> >>
> >> -kevin
> >> _______________________________________________
> >> erlang-questions mailing list
> >> erlang-questions@REDACTED
> >> http://www.erlang.org/mailman/listinfo/erlang-questions
> >>
>
>
More information about the erlang-bugs
mailing list