xmerl and fetching docs from the net
Ulf Wiger (AL/EAB)
ulf.wiger@REDACTED
Wed Oct 27 10:57:13 CEST 2004
OTOH, in xmerl-0.18.1, the changes.txt document contained
the following list of known issues (below).
>From a quick look, it appears as if (please correct
me if I'm mistaken):
- the handling of entity references has been improved.
- The issue about xmlText below could probably be handled
by a custom 'acc_fun', but doesn't seem to be solved by
default.
- The pre-processing requirement for stream oriented data
can only be solved by a total re-write of the parser,
I think. (Joe has a parser which I think has a sounder
structure. Perhaps it could be merged with the xmerl
programming framework? How about Unicode? Joe?)
- Attribute normalization is now done properly.
- AFAICT, merging of attributes is not done, but could
probably be done i a custom 'hook_fun'.
I didn't check the remaining issues.
/Uffe
KNOWN ISSUES:
- When scanning entity references, a sublist is created (see scan_entity_ref/2)
to distinguish such references. The default behaviour is then to flatten this
in the acc/3 function to keep application-level unaware of this
implementation detail. However, entity references may occur elsewhere also
(attribute values). This information might sometimes also be useful for the
application to know about (e.g. when representing character not supported by
the current charset).
- Instead of splitting up text in several xmlText records when entities are
found it might be better for the application if they are kept into a single
one. This given that the application is not interested to know if any entities
have been referenced etc.
- As reported by Oleg Kiselyov:
+ Streamed-oriented parsing needs preprocessing. For example, in function
scan_cdata/5. Suppose the current string to parse contains "<![CDATA[aaa]" .
Suppose the continuation_fun, when invoked, will supply the rest of the
stream: "]>..." In other words, the character combination "]]>" happens to
be split across two chunks of the input stream. As the logic of
scan_cdata/5 indicates, scan_cdata will not recognize that ']' in one chunk
and ']>' in the next chunk actually form a single token. The scanner will
misidentify the end of the CDATA section, and consequently fail to parse
the rest of the document stream.
See xmerl_eventp.erl for an example how a preprocessor can solve this.
+ General entity references
* detect recursion in general entity references
* scan_entity_value/5 expands general entities even DTD, which it must
not do.
+ Attribute value normalization
* scan_att_value/2 and scan_att_chars/4 don't do value normalization
+ Merging of attributes declared in DTD with those specified in an
element
- The {rules,Rules} requires an ets table but {rules, Read, Write, Rules} does
not! This cause problems with recursive calls when expanding DTDs.
- In <!ENTITY entityref SYSTEM "file.dtd"> entityref is not expanded with the
declarations in file.dtd
More information about the erlang-questions
mailing list