xmerl newlines

Wed Jun 18 11:29:01 CEST 2003

On Wed, 18 Jun 2003, Erik Reitsma (ETM) wrote:

>I think that that was James' issue. My "problem" was, that
>a pretty-printed XML document is not parsed into the same
>data structure as a non-pretty-printed. So:

Yes, apologies for referring to James's problems in a reply
to your mail.

>Still, it would be easier if all xmlText records with empty
>text would be removed during parsing, especially if it has
>"siblings" that are xmlElement records. An option
>{space,remove} would be nice...

Such an option would be non-standard, ;), and perhaps
difficult to implement well.

However, there's a non-standard option called {acc_fun,F}
with which one can do cute things. It would allow you to
remove empty text segments today, with the slight side
effect that the pos attribute will be wrong.

I have attached a diff to xmerl_scan.erl (the 0.18 version,
since the latest version in CVS is much altered.) This diff
includes a backward-compatible extension to the acc_fun
semantics, allowing you to cleanly skip elements and
attributes that you do not want to include in the output
from the scanner. The change allows the acc function to
return a new value for the position counter, which is
otherwise automatically incremented by 1 (old semantics).
This change also makes it possible to split an object into
multiple smaller objects (for whatever reason) while keeping
the numbering sequential. I wanted to avoid having to call
length(NewAcc) each time, as it's an O(N) operation.

The following sample program illustrates three ways of
scanning a document:
1) the default scan, which leaves whitespace untouched
2) normalizing spaces
3) normalizing spaces, then removing text elements that only
   contain one space.

-module(tmp).

-include("xmerl.hrl").

-export([file1/1,
	 file2/1,
	 file3/1]).

file1(F) ->
    xmerl_scan:file(F).

file2(F) ->
    xmerl_scan:file(F, [{space,normalize}]).

file3(F) ->
    Acc = fun(#xmlText{value = " ", pos = P}, Acc, S) ->
		  {Acc, P, S};  % new return format
	     (X, Acc, S) ->
		  {[X|Acc], S}
	  end,
    xmerl_scan:file(F, [{space,normalize}, {acc_fun, Acc}]).

I do not have developer access to the sowap project, so I
have not updated the CVS version.

/Uffe
-- 
Ulf Wiger, Senior Specialist,
   / / /   Architecture & Design of Carrier-Class Software
  / / /    Strategic Product & System Management
 / / /     Ericsson AB, Connectivity and Control Nodes
-------------- next part --------------
333,335c333,339
< %% The acc/3 function must return either {[X'|Acc], S'} or {Acc, S'}.
< %% It is not allowed to make significant changes to X, such as altering
< %% the object type.
---
> %% The acc/3 function can return either {Acc?, S'} or {Acc', Pos', S'},
> %% where Pos' can be derived from X#xmlElement.pos, X#xmlText.pos, or
> %% X#xmlAttribute.pos (whichever is the current object type.)
> %% The acc/3 function is not allowed to redefine the type of object
> %% being defined, but _is_ allowed to either ignore it or split it 
> %% into multiple objects (in which case {Acc',Pos',S'} should be returned.)
> %% If {Acc',S'} is returned, Pos will be incremented by 1 by default.
1430,1431c1434,1441
<     {NewAcc, S2} = AccF(Markup, Acc, S1),
<     scan_content(T1, S2, Pos+1, Name, Attrs, Space, Lang, Parents, NS, NewAcc);
---
>     {NewAcc, NewPos, NewS} = case AccF(Markup, Acc, S1) of
> 				 {Acc2, S2} ->
> 				     {Acc2, Pos+1, S2};
> 				 {Acc2, Pos2, S2} ->
> 				     {Acc2, Pos2, S2}
> 			     end,
>     scan_content(T1, S2, NewPos, Name, Attrs, Space, Lang,
> 		 Parents, NS, NewAcc);
1446,1447c1456,1463
<     {NewAcc, S4} = F(Text, Acc, S3),
<     scan_content(T1, S4, Pos+1, Name, Attrs, Space, Lang, Parents, NS, NewAcc).
---
>     {NewAcc, NewPos, NewS} = case F(Text, Acc, S3) of
> 				 {Acc4, S4} ->
> 				     {Acc4, Pos+1, S4};
> 				 {Acc4, Pos4, S4} ->
> 				     {Acc4, Pos4, S4}
> 			     end,
>     scan_content(T1, NewS, NewPos, Name, Attrs, Space, Lang,
> 		 Parents, NS, NewAcc).