[erlang-questions] using xmerl on web docs

Kevin Scaldeferri kevin@REDACTED
Mon Feb 11 00:19:51 CET 2008


Hmm... do you happen to know a well debugged implementation?  I'm  
starting with something fairly naive, and the next problem I hit is:

fetching http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
fetching xhtml-lat1.ent
** exception error: no match of right hand side value {error,
                                                        no_scheme}

Which I sort of expected, having had exactly the same problem in other  
languages with immature XML libs :-)

I guess I will use the GlobalState to store the last fetched URI and  
then use it to resolve the relative reference, but really I'd love not  
to have to reinvent all this myself.  So, if there's a good  
implementation someone has to share, I'd love to just use it.


Thanks,

-kevin


On Feb 10, 2008, at 10:03 AM, Ulf Wiger wrote:

> The fetch_URI function in xmerl_scan can be replaced by
> a user-defined function, {fetch_fun, F}.
>
> This is described in http://www.erlang.org/doc/apps/xmerl/xmerl_examples.html
>
> I've sometimes been a bit annoyed that even though the default mode is
> {validation, off}, xmerl will not accept not being able to find the
> DTD.
>
>
> BR,
> Ulf W
>
> 2008/2/10, Kevin Scaldeferri <kevin@REDACTED>:
>> Hi,
>>
>> I wanted to do something fairly basic (to me, at least) but ran into
>> problems.  Specifically, I would like to fetch some (X)HTML docs from
>> the web and parse and validate them.  So, I started optimistic.   
>> After
>> fetching a doc with http:request, I tried to validate it:
>>
>> 6> {XML, Rest} = xmerl_scan:string(Body, [{validation, dtd}]).
>> 3290- fatal: {error,{error_missing_element_declaration_in_DTD,html}}
>> ** exception exit: {fatal,
>>                     {{error,
>>                       {error_missing_element_declaration_in_DTD,
>>                        html}},
>>                      {file,
>>                       file_name_unknown},
>>                      {line,
>>                       4},
>>                      {col,
>>                       1}}}
>>      in function  xmerl_scan:fatal/2
>>      in call from xmerl_scan:scan_document/2
>>      in call from xmerl_scan:string/2
>>
>>
>> hmm... well I know this is really a valid XHTML doc, and I know the
>> dtd does declare the element "html".  As best I can tell, the problem
>> is that xmerl doesn't actually fetch dtds from the web.  This seems  
>> to
>> be suggested by this bit in the source:
>>
>> %%% Always assume an external resource can be found locally! Thus
>> %%% don't bother fetching with e.g. HTTP. Returns the path where the
>> %%% resource is found.  The path to the external resource is given by
>> %%% URI directly or the option fetch_path (additional paths) or
>> %%% directory (base path to external resource)
>> fetch_URI(URI, S) -> ...
>>
>>
>> So, I decided to give up on validating for the moment, and just parse
>> the doc for the time being.  Unfortunately, I didn't get any further;
>>
>> 8> {XML, Rest} = xmerl_scan:string(Body).
>> 2692- fatal: {unknown_entity_ref,copy}
>> 2602- fatal: error_scanning_entity_ref
>> ** exception exit: {fatal,
>>                     {error_scanning_entity_ref,
>>                      {file,
>>                       file_name_unknown},
>>                      {line,
>>                       33},
>>                      {col,
>>                       16}}}
>>      in function  xmerl_scan:fatal/2
>>      in call from xmerl_scan:scan_element/12
>>      in call from xmerl_scan:scan_content/11
>>      in call from xmerl_scan:scan_element/12
>>      in call from xmerl_scan:scan_content/11
>>      in call from xmerl_scan:scan_element/12
>>      in call from xmerl_scan:scan_content/11
>>      in call from xmerl_scan:scan_element/12
>>
>>
>> In this case, an © entity, perfectly valid according to the DTD,
>> is rejected because we haven't parsed the DTD.
>>
>> It seems like I'm stuck in a chicken-and-egg problem.  If I could
>> parse without validating, I could extract the DTD location, fetch it
>> and make a local copy that could be used for validation.  But, it
>> seems that I can't parse the document unless we already know the
>> contents of the DTD (specifically, the entity declarations).  Thus,
>> I'm stuck.
>>
>> Is there something that I'm doing stupidly wrong here?  Surely  
>> someone
>> else has tried to parse documents off the web.
>>
>>
>> Thanks,
>>
>> -kevin
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>




More information about the erlang-questions mailing list