[erlang-questions] How to extract string between XML tags

Fred Hebert mononcqc@REDACTED
Wed Sep 26 01:38:33 CEST 2018


On 09/25, lloyd@REDACTED wrote:
>Thanks all,
>
>This is definitely useful info for another aspect of my project.
>
>But perhaps there's a simpler solution to my immediate problem if I pose my question more specifically.
>
>Goal: Add tables to erlpress_core.
>
>Problem: Extract cell info from html tables:
>
>table() ->
>  ...
>
>My function ep_parse_table/1 gives me:
>
>[[["<th>Firstname</th>"],
>  ["<th>Lastname</th>"],
>  ["<th>Age</th>"]],
> [["<tr>Jill</tr>"],["<tr>Smith</tr>"],["<tr>50</tr>"]],
> [["<tr>Eve</tr>"],["<tr>Jackson</tr>"],["<tr>94</tr>"]]]
>

{XML, _} = xmerl_parse:string(table()),
Rows = xmerl_xpath:string("/table/tr", XML),


>From there:

[[[ Text || #xmlText{value=Text} <- Col]
           || #xmlElement{content=Col} <- Cols]
             || #xmlElement{content=Cols} <- Rows].

Gives:

[[["Firstname"],["Lastname"],["Age"]],
 [["Jill"],["Smith"],["50"]],
 [["Eve"],["Jackson"],["94"]]]

If you want to keep the node's type:

[[[ {Name,Text} || #xmlText{value=Text} <- Col]
           || #xmlElement{content=Col, name=Name} <- Cols]
             || #xmlElement{content=Cols} <- Rows].

Gives:

[[[{th,"Firstname"}],[{th,"Lastname"}],[{th,"Age"}]],
 [[{td,"Jill"}],[{td,"Smith"}],[{td,"50"}]],
 [[{td,"Eve"}],[{td,"Jackson"}],[{td,"94"}]]]

This is a bit obtuse due to using list comprehensions, I haven't taken 
the time to clean the code up.

>Is it reasonable to require xmerl as a dependency of erlpress_core and 
>go through the many transformations required to extract cell data for 
>every cell in a, perhaps, large table?
>

It's not that bad, considering xmerl is part of the standard library, 
but that's a fair concern anyway.

>How likely is that a user will include full-fledged XML data in table cells?
>

The sad thing is that XML is _simpler_ to parse than HTML and all its 
variants (because they are less strict, they allow for more stuff to 
happen).

The question though is what is the syntax you aim to support? Should 
people be able to style text using tags like <strong>, <em>, <code>, 
<tt>, and so on? Or do you expect literal text always? What you accept 
or refuse defines what you can deal with.

>If so then maybe we need to suck up the pain.
>
>Or, maybe we just specify that XML data is not permitted in tables 
>submitted to erlpress_core.
>
>Any thoughts?
>

You could say XML data is not supported. That does not prevent you from 
using the XML parser rather than writing your own.

For example, what does someone do when they want to use the '<td>' 
string from within the table to avoid breaking your own parser? What's 
the escape sequence?  Using XML as a parser, you get it for free: > 
is > and < is <:

72> xmerl_scan:string("<td> bf<td>aaa</td>").
{#xmlElement{...
             content = [#xmlText{value = " bf<td>aaa" ...}],
             ...},
 []}

You can see the resulting string being " bf<td>aaa" despite already 
being in a <td> element. No confusion to be had.

If you don't use the parser, you have to come up with these rules 
yourself, and implement them properly. That's a lot of work :)





More information about the erlang-questions mailing list