[erlang-questions] How to extract string between XML tags

Wed Sep 26 00:57:30 CEST 2018

Thanks all,

This is definitely useful info for another aspect of my project.

But perhaps there's a simpler solution to my immediate problem if I pose my question more specifically. 

Goal: Add tables to erlpress_core.

Problem: Extract cell info from html tables:

table() ->
   "<table style=\"width:100%\"> 
       <tr>
          <th>Firstname</th>
          <th>Lastname</th>
          <th>Age</th>
       </tr>
       <tr>
          <td>Jill</td>
          <td>Smith</td>
          <td>50</td>
       </tr>
       <tr>
          <td>Eve</td>
          <td>Jackson</td>
          <td>94</td>
       </tr>
     </table>".

My function ep_parse_table/1 gives me:

[[["<th>Firstname</th>"],
  ["<th>Lastname</th>"],
  ["<th>Age</th>"]],
 [["<tr>Jill</tr>"],["<tr>Smith</tr>"],["<tr>50</tr>"]],
 [["<tr>Eve</tr>"],["<tr>Jackson</tr>"],["<tr>94</tr>"]]]

Is it reasonable to require xmerl as a dependency of erlpress_core and go through the many transformations required to extract cell data for every cell in a, perhaps, large table?

How likely is that a user will include full-fledged XML data in table cells? 

If so then maybe we need to suck up the pain.

Or, maybe we just specify that XML data is not permitted in tables submitted to erlpress_core.

Any thoughts?

All the best,

L. 

-----Original Message-----
From: "Fred Hebert" <mononcqc@REDACTED>
Sent: Tuesday, September 25, 2018 6:20pm
To: lloyd@REDACTED
Cc: "Erlang/OTP discussions" <erlang-questions@REDACTED>
Subject: Re: [erlang-questions] How to extract string between XML tags

On 09/25, lloyd@REDACTED wrote:
>Hello,
>
>By now I should know how to do this. But I've fumbled for more time than I have to find an elegant solution.
>
>Can anyone show a better way?
>
>Example string: "<th>Firstname</th>"  % NOTE: could be any valid tag
>
>My kludge:
>
>extract_text(TaggedText) ->
>  Split = re:split(TaggedText, "<"),
>  Split2 = lists:nth(2, Split),
>  Split3 = binary_to_list(Split2),
>  Split4 = re:split(Split3, ">"),
>  Split5 = lists:nth(2, Split4),
>  binary_to_list(Split5).
>
>Surely there's a better way.
>

The classic answer: https://stackoverflow.com/a/1732454/35344

The nice non-ridiculous one: You will want to use an XML parser to parse 
XML. Regular expressions are usually not the proper structure.

Let's take this as an example:

1> Str = "<a>aaaa<b>bbbbb<c>ccccc<d>ddd</d>ccc</c>bbb</b>bbbb</a>".
"<a>aaaa<b>bbbbb<c>ccccc<d>ddd</d>ccc</c>bbb</b>bbbb</a>"
2> rr(xmerl).
[xmerl_event,xmerl_fun_states,xmerl_scanner,xmlAttribute,
 xmlComment,xmlContext,xmlDecl,xmlDocument,xmlElement,
 xmlNamespace,xmlNode,xmlNsNode,xmlObj,xmlPI,xmlText]
3> {XML, _} = xmerl_scan:string(Str).
{#xmlElement{
     name = a,expanded_name = a,nsinfo = [],
     namespace = #xmlNamespace{default = [],nodes = []},
     parents = [],pos = 1,attributes = [],
     content =
         [#xmlText{
              parents = [{a,1}],
              pos = 1,language = [],value = "aaaa",type = text},
          #xmlElement{
              name = b,expanded_name = b,nsinfo = [],
              namespace = #xmlNamespace{default = [],nodes = []},
              parents = [{a,1}],
              pos = 2,attributes = [],
              content =
                  [#xmlText{
                       parents = [{b,2},{a,1}],
                       pos = 1,language = [],value = "bbbbb",type = text},
                   #xmlElement{
                       name = c,expanded_name = c,nsinfo = [],
                       namespace = #xmlNamespace{...},
                       parents = [...],...},
                   #xmlText{
                       parents = [{b,2},{a,...}],
                       pos = 3,language = [],
                       value = [...],...}],
              language = [],xmlbase = "/Users/ferd",
              elementdef = undeclared},
          #xmlText{
              parents = [{a,1}],
              pos = 3,language = [],value = "bbbb",type = text}],
     language = [],xmlbase = "/Users/ferd",
     elementdef = undeclared},
 []}

This gives you a parsed XML document. You can use xpath to access nodes, 
if you want. XPath defines a syntax to query the insides of XML 
documents as strings: https://en.wikipedia.org/wiki/XPath

For example, the /a/b/c string would mean 'within the root document /, 
find the node a, and then go find node b in there, and go find node c'.

This fives something like this:

8> xmerl_xpath:string("/a/b/c", XML).
[#xmlElement{
     name = c,expanded_name = c,nsinfo = [],
     namespace = #xmlNamespace{default = [],nodes = []},
     parents = [{b,2},{a,1}],
     pos = 2,attributes = [],
     content =
         [#xmlText{
              parents = [{c,2},{b,2},{a,1}],
              pos = 1,language = [],value = "ccccc",type = text},
          #xmlElement{
              name = d,expanded_name = d,nsinfo = [],
              namespace = #xmlNamespace{default = [],nodes = []},
              parents = [{c,2},{b,2},{a,1}],
              pos = 2,attributes = [],
              content =
                  [#xmlText{
                       parents = [{d,2},{c,2},{b,2},{a,...}],
                       pos = 1,language = [],value = "ddd",type = text}],
              language = [],xmlbase = "/Users/ferd",
              elementdef = undeclared},
          #xmlText{
              parents = [{c,2},{b,2},{a,1}],
              pos = 3,language = [],value = "ccc",type = text}],
     language = [],xmlbase = "/Users/ferd",
     elementdef = undeclared}]

You can see that the XML node has 3 entries in it: a text node (#xmlText 
with a value "cccc", #xmlElement which has the name 'd' (so the <d> 
tag), and another text node.

You can then go and dig within `<d>` by adding to the xpath:

9> xmerl_xpath:string("/a/b/c/d", XML).
[#xmlElement{name = d,expanded_name = d,nsinfo = [],
             namespace = #xmlNamespace{default = [],nodes = []},
             parents = [{c,2},{b,2},{a,1}],
             pos = 2,attributes = [],
             content = [#xmlText{parents = [{d,2},{c,2},{b,2},{a,1}],
                                 pos = 1,language = [],value = "ddd",type = text}],
             language = [],xmlbase = "/Users/ferd",
             elementdef = undeclared}]

And the sole node contained there is the one with the content that is 
#xmlText.content = "ddd". If you want to extract the text, you can use 
the `text()` xpath qualifier:

18> xmerl_xpath:string("/a/b/c/text()", XML).
[#xmlText{parents = [{c,2},{b,2},{a,1}],
          pos = 1,language = [],value = "ccccc",type = text},
 #xmlText{parents = [{c,2},{b,2},{a,1}],
          pos = 3,language = [],value = "ccc",type = text}]

19> xmerl_xpath:string("/a/b/c/d/text()", XML).
[#xmlText{parents = [{d,2},{c,2},{b,2},{a,1}],
          pos = 1,language = [],value = "ddd",type = text}]

The xmerl structure is kind of cumbersome, but when you have to handle 
more complex documents, a real parser with niceties like xpath can do 
wonders to handle documents as a logical structure rather than as a 
group of tokens to wrangle.

Regards,
Fred.