xmerl and fetching docs from the net

Wed Oct 27 10:57:13 CEST 2004

OTOH, in xmerl-0.18.1, the changes.txt document contained
the following list of known issues (below). 

>From a quick look, it appears as if (please correct
me if I'm mistaken):

- the handling of entity references has been improved.
- The issue about xmlText below could probably be handled
  by a custom 'acc_fun', but doesn't seem to be solved by
  default.
- The pre-processing requirement for stream oriented data
  can only be solved by a total re-write of the parser,
  I think. (Joe has a parser which I think has a sounder
  structure. Perhaps it could be merged with the xmerl
  programming framework? How about Unicode? Joe?)
- Attribute normalization is now done properly.
- AFAICT, merging of attributes is not done, but could 
  probably be done i a custom 'hook_fun'.

I didn't check the remaining issues.

/Uffe

KNOWN ISSUES:
- When scanning entity references, a sublist is created (see scan_entity_ref/2)
  to distinguish such references. The default behaviour is then to flatten this
  in the acc/3 function to keep application-level unaware of this
  implementation detail. However, entity references may occur elsewhere also
  (attribute values). This information might sometimes also be useful for the
  application to know about (e.g. when representing character not supported by
  the current charset).
- Instead of splitting up text in several xmlText records when entities are
  found it might be better for the application if they are kept into a single
  one. This given that the application is not interested to know if any entities
  have been referenced etc.
- As reported by Oleg Kiselyov:
  + Streamed-oriented parsing needs preprocessing. For example, in function
    scan_cdata/5. Suppose the current string to parse contains "<![CDATA[aaa]" .
    Suppose the continuation_fun, when invoked, will supply the rest of the
    stream: "]>..." In other words, the character combination "]]>" happens to
    be split across two chunks of the input stream.  As the logic of
    scan_cdata/5 indicates, scan_cdata will not recognize that ']' in one chunk
    and ']>' in the next chunk actually form a single token. The scanner will
    misidentify the end of the CDATA section, and consequently fail to parse
    the rest of the document stream.
    See xmerl_eventp.erl for an example how a preprocessor can solve this.
  + General entity references
    * detect recursion in general entity references
    * scan_entity_value/5 expands general entities even DTD, which it must
      not do.
  + Attribute value normalization
    * scan_att_value/2 and scan_att_chars/4 don't do value normalization
  + Merging of attributes declared in DTD with those specified in an
    element
- The {rules,Rules} requires an ets table but {rules, Read, Write, Rules} does
  not! This cause problems with recursive calls when expanding DTDs.
- In <!ENTITY entityref SYSTEM "file.dtd"> entityref is not expanded with the
  declarations in file.dtd