[erlang-questions] How to extract string between XML tags
Fred Hebert
mononcqc@REDACTED
Wed Sep 26 00:20:31 CEST 2018
On 09/25, lloyd@REDACTED wrote:
>Hello,
>
>By now I should know how to do this. But I've fumbled for more time than I have to find an elegant solution.
>
>Can anyone show a better way?
>
>Example string: "<th>Firstname</th>" % NOTE: could be any valid tag
>
>My kludge:
>
>extract_text(TaggedText) ->
> Split = re:split(TaggedText, "<"),
> Split2 = lists:nth(2, Split),
> Split3 = binary_to_list(Split2),
> Split4 = re:split(Split3, ">"),
> Split5 = lists:nth(2, Split4),
> binary_to_list(Split5).
>
>Surely there's a better way.
>
The classic answer: https://stackoverflow.com/a/1732454/35344
The nice non-ridiculous one: You will want to use an XML parser to parse
XML. Regular expressions are usually not the proper structure.
Let's take this as an example:
1> Str = "<a>aaaa<b>bbbbb<c>ccccc<d>ddd</d>ccc</c>bbb</b>bbbb</a>".
"<a>aaaa<b>bbbbb<c>ccccc<d>ddd</d>ccc</c>bbb</b>bbbb</a>"
2> rr(xmerl).
[xmerl_event,xmerl_fun_states,xmerl_scanner,xmlAttribute,
xmlComment,xmlContext,xmlDecl,xmlDocument,xmlElement,
xmlNamespace,xmlNode,xmlNsNode,xmlObj,xmlPI,xmlText]
3> {XML, _} = xmerl_scan:string(Str).
{#xmlElement{
name = a,expanded_name = a,nsinfo = [],
namespace = #xmlNamespace{default = [],nodes = []},
parents = [],pos = 1,attributes = [],
content =
[#xmlText{
parents = [{a,1}],
pos = 1,language = [],value = "aaaa",type = text},
#xmlElement{
name = b,expanded_name = b,nsinfo = [],
namespace = #xmlNamespace{default = [],nodes = []},
parents = [{a,1}],
pos = 2,attributes = [],
content =
[#xmlText{
parents = [{b,2},{a,1}],
pos = 1,language = [],value = "bbbbb",type = text},
#xmlElement{
name = c,expanded_name = c,nsinfo = [],
namespace = #xmlNamespace{...},
parents = [...],...},
#xmlText{
parents = [{b,2},{a,...}],
pos = 3,language = [],
value = [...],...}],
language = [],xmlbase = "/Users/ferd",
elementdef = undeclared},
#xmlText{
parents = [{a,1}],
pos = 3,language = [],value = "bbbb",type = text}],
language = [],xmlbase = "/Users/ferd",
elementdef = undeclared},
[]}
This gives you a parsed XML document. You can use xpath to access nodes,
if you want. XPath defines a syntax to query the insides of XML
documents as strings: https://en.wikipedia.org/wiki/XPath
For example, the /a/b/c string would mean 'within the root document /,
find the node a, and then go find node b in there, and go find node c'.
This fives something like this:
8> xmerl_xpath:string("/a/b/c", XML).
[#xmlElement{
name = c,expanded_name = c,nsinfo = [],
namespace = #xmlNamespace{default = [],nodes = []},
parents = [{b,2},{a,1}],
pos = 2,attributes = [],
content =
[#xmlText{
parents = [{c,2},{b,2},{a,1}],
pos = 1,language = [],value = "ccccc",type = text},
#xmlElement{
name = d,expanded_name = d,nsinfo = [],
namespace = #xmlNamespace{default = [],nodes = []},
parents = [{c,2},{b,2},{a,1}],
pos = 2,attributes = [],
content =
[#xmlText{
parents = [{d,2},{c,2},{b,2},{a,...}],
pos = 1,language = [],value = "ddd",type = text}],
language = [],xmlbase = "/Users/ferd",
elementdef = undeclared},
#xmlText{
parents = [{c,2},{b,2},{a,1}],
pos = 3,language = [],value = "ccc",type = text}],
language = [],xmlbase = "/Users/ferd",
elementdef = undeclared}]
You can see that the XML node has 3 entries in it: a text node (#xmlText
with a value "cccc", #xmlElement which has the name 'd' (so the <d>
tag), and another text node.
You can then go and dig within `<d>` by adding to the xpath:
9> xmerl_xpath:string("/a/b/c/d", XML).
[#xmlElement{name = d,expanded_name = d,nsinfo = [],
namespace = #xmlNamespace{default = [],nodes = []},
parents = [{c,2},{b,2},{a,1}],
pos = 2,attributes = [],
content = [#xmlText{parents = [{d,2},{c,2},{b,2},{a,1}],
pos = 1,language = [],value = "ddd",type = text}],
language = [],xmlbase = "/Users/ferd",
elementdef = undeclared}]
And the sole node contained there is the one with the content that is
#xmlText.content = "ddd". If you want to extract the text, you can use
the `text()` xpath qualifier:
18> xmerl_xpath:string("/a/b/c/text()", XML).
[#xmlText{parents = [{c,2},{b,2},{a,1}],
pos = 1,language = [],value = "ccccc",type = text},
#xmlText{parents = [{c,2},{b,2},{a,1}],
pos = 3,language = [],value = "ccc",type = text}]
19> xmerl_xpath:string("/a/b/c/d/text()", XML).
[#xmlText{parents = [{d,2},{c,2},{b,2},{a,1}],
pos = 1,language = [],value = "ddd",type = text}]
The xmerl structure is kind of cumbersome, but when you have to handle
more complex documents, a real parser with niceties like xpath can do
wonders to handle documents as a logical structure rather than as a
group of tokens to wrangle.
Regards,
Fred.
More information about the erlang-questions
mailing list