[erlang-questions] How to extract string between XML tags

PAILLEAU Eric eric.pailleau@REDACTED
Sat Sep 29 17:59:34 CEST 2018


Hello,
if the question is to be sure that tags are correctly balanced, it is 
better to use xmerl parser like Fred proposed.

I see an issue in your regexp
"<\([^>]\+\)>\(.*\)</\([^>]\+\)>"   (.*) will catch anything including 
tags (I mean also < )

use instead
"<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>"  i.e anything that is not a tag start.

but for instance it will not work on nested tags :

1> re:replace("<th>title 
<b>bold</b></th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2 
\\3",[global,{return, list}]).
"<th>title b bold b</th>"

note that could be rewritten also to :
2> A = re:replace("<th>title 
<b>bold</b></th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2 
\\1",[global,{return, list}]).
"<th>title b bold b</th>"

3> B = re:replace("<th>title 
<b>bold</b></th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2 
\\1",[global,{return, list}]).
"<th>title b bold b</th>"

As \\1 MUST BE equal to \\3
4> A = B.
should be ok.

Exemple with a single tag

43> A = 
re:replace("<th>title</th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 
\\2 \\3",[global,{return, list}]).
"th title th"
44> B = 
re:replace("<th>title</th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 
\\2 \\3",[global,{return, list}]).
"th title th"
45> A = B.
"th title th"

But with unbalanced tag fails:

48>  A = 
re:replace("<th>title</b>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2 
\\3",[global,{return, list}]).
"th title b"
49> B = 
re:replace("<th>title</b>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2 
\\1",[global,{return, list}]).
"th title th"
50> A=B.
** exception error: no match of right hand side value "th title th"

Regards


Le 29/09/2018 à 13:30, Eckard Brauer a écrit :
> Hello,
> 
> just another (a beginner's) question probably leading away from the
> initial point:
> 
> If I use
> 
> T = re:replace("<th>title <b>bold</b></th>",
> 	"<\([^>]\+\)>\(.*\)</\([^>]\+\)>",
> 	"\\1 \\2 \\3",
> 	[global,{return, list}]).
> 
> how could I check that T is of the form "X Y X"?
> 
> 
> 
> Am Sat, 29 Sep 2018 11:18:14 +0200
> schrieb PAILLEAU Eric <eric.pailleau@REDACTED>:
> 
>> Hello,
>> BTW  "</?[^>]{1,}>"  works too, no need to escape /  (Perl
>> reflex :) ...)
>>
>>
>> Le 29/09/2018 à 11:13, PAILLEAU Eric a écrit :
>>   [...]
>>   [...]
>>   [...]
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
> 
> 
> 
> 
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
> 




More information about the erlang-questions mailing list