[erlang-questions] How to extract string between XML tags
Eckard Brauer
eckard.brauer@REDACTED
Sat Sep 29 20:29:41 CEST 2018
Thanks for the response.
yes, indeed the first idea was to check balancing, and I'm aware of
xmerl and the regex issue. Thanks for the hints, it's because "[^<]"
wouldn't match nested tags as you wrote below.
So my first idea was to try
X ++ Y ++ X = re:replace(...)
what of course didn't work. I know of the limitations of REs, just
wanted to check if there's a way to ga as with Prolog's append/2
(append([X,Y,X], [a,b,a]).)
As I wrote, I'm just playing around with Erlang. Most of my work
consists of either C(++) or shell programming with only limited
practice of other languages (e.g. Prolog, very little Lisp, Python,
PHP). So I'm here only for learning so far, and glad for any help I can
get.
Thanks again, best regards
Eckard
Am Sat, 29 Sep 2018 17:59:34 +0200 schrieb PAILLEAU Eric
<eric.pailleau@REDACTED>:
> Hello,
> if the question is to be sure that tags are correctly balanced, it is
> better to use xmerl parser like Fred proposed.
>
> I see an issue in your regexp
> "<\([^>]\+\)>\(.*\)</\([^>]\+\)>" (.*) will catch anything
> including tags (I mean also < )
>
> use instead
> "<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>" i.e anything that is not a tag
> start.
>
> but for instance it will not work on nested tags :
>
> 1> re:replace("<th>title
> <b>bold</b></th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2
> \\3",[global,{return, list}]).
> "<th>title b bold b</th>"
>
> note that could be rewritten also to :
> [...]
> <b>bold</b></th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2
> \\1",[global,{return, list}]).
> "<th>title b bold b</th>"
>
> [...]
> <b>bold</b></th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2
> \\1",[global,{return, list}]).
> "<th>title b bold b</th>"
>
> As \\1 MUST BE equal to \\3
> [...]
> should be ok.
>
> Exemple with a single tag
>
> [...]
> re:replace("<th>title</th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1
> \\2 \\3",[global,{return, list}]).
> "th title th"
> [...]
> re:replace("<th>title</th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1
> \\2 \\3",[global,{return, list}]).
> "th title th"
> [...]
> "th title th"
>
> But with unbalanced tag fails:
>
> [...]
> re:replace("<th>title</b>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1
> \\2 \\3",[global,{return, list}]).
> "th title b"
> [...]
> re:replace("<th>title</b>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1
> \\2 \\1",[global,{return, list}]).
> "th title th"
> [...]
> ** exception error: no match of right hand side value "th title th"
>
> Regards
>
>
> Le 29/09/2018 à 13:30, Eckard Brauer a écrit :
> [...]
> [...]
> [...]
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
--
:)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: Digitale Signatur von OpenPGP
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180929/dbf736bd/attachment.bin>
More information about the erlang-questions
mailing list