[erlang-questions] How to extract string between XML tags

Wed Sep 26 00:20:31 CEST 2018

On 09/25, lloyd@REDACTED wrote:
>Hello,
>
>By now I should know how to do this. But I've fumbled for more time than I have to find an elegant solution.
>
>Can anyone show a better way?
>
>Example string: "<th>Firstname</th>"  % NOTE: could be any valid tag
>
>My kludge:
>
>extract_text(TaggedText) ->
>  Split = re:split(TaggedText, "<"),
>  Split2 = lists:nth(2, Split),
>  Split3 = binary_to_list(Split2),
>  Split4 = re:split(Split3, ">"),
>  Split5 = lists:nth(2, Split4),
>  binary_to_list(Split5).
>
>Surely there's a better way.
>

The classic answer: https://stackoverflow.com/a/1732454/35344

The nice non-ridiculous one: You will want to use an XML parser to parse 
XML. Regular expressions are usually not the proper structure.

Let's take this as an example:

1> Str = "<a>aaaa<b>bbbbb<c>ccccc<d>ddd</d>ccc</c>bbb</b>bbbb</a>".
"<a>aaaa<b>bbbbb<c>ccccc<d>ddd</d>ccc</c>bbb</b>bbbb</a>"
2> rr(xmerl).
[xmerl_event,xmerl_fun_states,xmerl_scanner,xmlAttribute,
 xmlComment,xmlContext,xmlDecl,xmlDocument,xmlElement,
 xmlNamespace,xmlNode,xmlNsNode,xmlObj,xmlPI,xmlText]
3> {XML, _} = xmerl_scan:string(Str).
{#xmlElement{
     name = a,expanded_name = a,nsinfo = [],
     namespace = #xmlNamespace{default = [],nodes = []},
     parents = [],pos = 1,attributes = [],
     content =
         [#xmlText{
              parents = [{a,1}],
              pos = 1,language = [],value = "aaaa",type = text},
          #xmlElement{
              name = b,expanded_name = b,nsinfo = [],
              namespace = #xmlNamespace{default = [],nodes = []},
              parents = [{a,1}],
              pos = 2,attributes = [],
              content =
                  [#xmlText{
                       parents = [{b,2},{a,1}],
                       pos = 1,language = [],value = "bbbbb",type = text},
                   #xmlElement{
                       name = c,expanded_name = c,nsinfo = [],
                       namespace = #xmlNamespace{...},
                       parents = [...],...},
                   #xmlText{
                       parents = [{b,2},{a,...}],
                       pos = 3,language = [],
                       value = [...],...}],
              language = [],xmlbase = "/Users/ferd",
              elementdef = undeclared},
          #xmlText{
              parents = [{a,1}],
              pos = 3,language = [],value = "bbbb",type = text}],
     language = [],xmlbase = "/Users/ferd",
     elementdef = undeclared},
 []}

This gives you a parsed XML document. You can use xpath to access nodes, 
if you want. XPath defines a syntax to query the insides of XML 
documents as strings: https://en.wikipedia.org/wiki/XPath

For example, the /a/b/c string would mean 'within the root document /, 
find the node a, and then go find node b in there, and go find node c'.

This fives something like this:

8> xmerl_xpath:string("/a/b/c", XML).
[#xmlElement{
     name = c,expanded_name = c,nsinfo = [],
     namespace = #xmlNamespace{default = [],nodes = []},
     parents = [{b,2},{a,1}],
     pos = 2,attributes = [],
     content =
         [#xmlText{
              parents = [{c,2},{b,2},{a,1}],
              pos = 1,language = [],value = "ccccc",type = text},
          #xmlElement{
              name = d,expanded_name = d,nsinfo = [],
              namespace = #xmlNamespace{default = [],nodes = []},
              parents = [{c,2},{b,2},{a,1}],
              pos = 2,attributes = [],
              content =
                  [#xmlText{
                       parents = [{d,2},{c,2},{b,2},{a,...}],
                       pos = 1,language = [],value = "ddd",type = text}],
              language = [],xmlbase = "/Users/ferd",
              elementdef = undeclared},
          #xmlText{
              parents = [{c,2},{b,2},{a,1}],
              pos = 3,language = [],value = "ccc",type = text}],
     language = [],xmlbase = "/Users/ferd",
     elementdef = undeclared}]

You can see that the XML node has 3 entries in it: a text node (#xmlText 
with a value "cccc", #xmlElement which has the name 'd' (so the <d> 
tag), and another text node.

You can then go and dig within `<d>` by adding to the xpath:

9> xmerl_xpath:string("/a/b/c/d", XML).
[#xmlElement{name = d,expanded_name = d,nsinfo = [],
             namespace = #xmlNamespace{default = [],nodes = []},
             parents = [{c,2},{b,2},{a,1}],
             pos = 2,attributes = [],
             content = [#xmlText{parents = [{d,2},{c,2},{b,2},{a,1}],
                                 pos = 1,language = [],value = "ddd",type = text}],
             language = [],xmlbase = "/Users/ferd",
             elementdef = undeclared}]

And the sole node contained there is the one with the content that is 
#xmlText.content = "ddd". If you want to extract the text, you can use 
the `text()` xpath qualifier:

18> xmerl_xpath:string("/a/b/c/text()", XML).
[#xmlText{parents = [{c,2},{b,2},{a,1}],
          pos = 1,language = [],value = "ccccc",type = text},
 #xmlText{parents = [{c,2},{b,2},{a,1}],
          pos = 3,language = [],value = "ccc",type = text}]

19> xmerl_xpath:string("/a/b/c/d/text()", XML).
[#xmlText{parents = [{d,2},{c,2},{b,2},{a,1}],
          pos = 1,language = [],value = "ddd",type = text}]

The xmerl structure is kind of cumbersome, but when you have to handle 
more complex documents, a real parser with niceties like xpath can do 
wonders to handle documents as a logical structure rather than as a 
group of tokens to wrangle.

Regards,
Fred.