[erlang-questions] How to extract string between XML tags

Eckard Brauer eckard.brauer@REDACTED
Sat Sep 29 20:29:41 CEST 2018


Thanks for the response.

yes, indeed the first idea was to check balancing, and I'm aware of
xmerl and the regex issue. Thanks for the hints, it's because "[^<]"
wouldn't match nested tags as you wrote below.

So my first idea was to try

X ++ Y ++ X = re:replace(...)

what of course didn't work. I know of the limitations of REs, just
wanted to check if there's a way to ga as with Prolog's append/2
(append([X,Y,X], [a,b,a]).)

As I wrote, I'm just playing around with Erlang. Most of my work
consists of either C(++) or shell programming with only limited
practice of other languages (e.g. Prolog, very little Lisp, Python,
PHP). So I'm here only for learning so far, and glad for any help I can
get.

Thanks again, best regards
Eckard


Am Sat, 29 Sep 2018 17:59:34 +0200 schrieb PAILLEAU Eric
<eric.pailleau@REDACTED>:

> Hello,
> if the question is to be sure that tags are correctly balanced, it is 
> better to use xmerl parser like Fred proposed.
> 
> I see an issue in your regexp
> "<\([^>]\+\)>\(.*\)</\([^>]\+\)>"   (.*) will catch anything
> including tags (I mean also < )
> 
> use instead
> "<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>"  i.e anything that is not a tag
> start.
> 
> but for instance it will not work on nested tags :
> 
> 1> re:replace("<th>title   
> <b>bold</b></th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2 
> \\3",[global,{return, list}]).
> "<th>title b bold b</th>"
> 
> note that could be rewritten also to :
>  [...]  
> <b>bold</b></th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2 
> \\1",[global,{return, list}]).
> "<th>title b bold b</th>"
> 
>  [...]  
> <b>bold</b></th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2 
> \\1",[global,{return, list}]).
> "<th>title b bold b</th>"
> 
> As \\1 MUST BE equal to \\3
>  [...]  
> should be ok.
> 
> Exemple with a single tag
> 
>  [...]  
> re:replace("<th>title</th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 
> \\2 \\3",[global,{return, list}]).
> "th title th"
>  [...]  
> re:replace("<th>title</th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 
> \\2 \\3",[global,{return, list}]).
> "th title th"
>  [...]  
> "th title th"
> 
> But with unbalanced tag fails:
> 
>  [...]  
> re:replace("<th>title</b>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1
> \\2 \\3",[global,{return, list}]).
> "th title b"
>  [...]  
> re:replace("<th>title</b>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1
> \\2 \\1",[global,{return, list}]).
> "th title th"
>  [...]  
> ** exception error: no match of right hand side value "th title th"
> 
> Regards
> 
> 
> Le 29/09/2018 à 13:30, Eckard Brauer a écrit :
>  [...]  
>  [...]  
>  [...]  
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions



-- 
:)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: Digitale Signatur von OpenPGP
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180929/dbf736bd/attachment.bin>


More information about the erlang-questions mailing list