[erlang-questions] How to extract string between XML tags

Lloyd R. Prentice lloyd@REDACTED
Wed Sep 26 05:17:51 CEST 2018


Hi Fred,

This is gorgeous! in it will go.

Thank you so much. 

All the best,

Lloyd

Sent from my iPad

> On Sep 25, 2018, at 7:38 PM, Fred Hebert <mononcqc@REDACTED> wrote:
> 
>> On 09/25, lloyd@REDACTED wrote:
>> Thanks all,
>> 
>> This is definitely useful info for another aspect of my project.
>> 
>> But perhaps there's a simpler solution to my immediate problem if I pose my question more specifically.
>> 
>> Goal: Add tables to erlpress_core.
>> 
>> Problem: Extract cell info from html tables:
>> 
>> table() ->
>> ...
>> 
>> My function ep_parse_table/1 gives me:
>> 
>> [[["<th>Firstname</th>"],
>> ["<th>Lastname</th>"],
>> ["<th>Age</th>"]],
>> [["<tr>Jill</tr>"],["<tr>Smith</tr>"],["<tr>50</tr>"]],
>> [["<tr>Eve</tr>"],["<tr>Jackson</tr>"],["<tr>94</tr>"]]]
>> 
> 
> {XML, _} = xmerl_parse:string(table()),
> Rows = xmerl_xpath:string("/table/tr", XML),
> 
> 
> From there:
> 
> [[[ Text || #xmlText{value=Text} <- Col]
>          || #xmlElement{content=Col} <- Cols]
>            || #xmlElement{content=Cols} <- Rows].
> 
> Gives:
> 
> [[["Firstname"],["Lastname"],["Age"]],
> [["Jill"],["Smith"],["50"]],
> [["Eve"],["Jackson"],["94"]]]
> 
> If you want to keep the node's type:
> 
> [[[ {Name,Text} || #xmlText{value=Text} <- Col]
>          || #xmlElement{content=Col, name=Name} <- Cols]
>            || #xmlElement{content=Cols} <- Rows].
> 
> Gives:
> 
> [[[{th,"Firstname"}],[{th,"Lastname"}],[{th,"Age"}]],
> [[{td,"Jill"}],[{td,"Smith"}],[{td,"50"}]],
> [[{td,"Eve"}],[{td,"Jackson"}],[{td,"94"}]]]
> 
> This is a bit obtuse due to using list comprehensions, I haven't taken the time to clean the code up.
> 
>> Is it reasonable to require xmerl as a dependency of erlpress_core and go through the many transformations required to extract cell data for every cell in a, perhaps, large table?
>> 
> 
> It's not that bad, considering xmerl is part of the standard library, but that's a fair concern anyway.
> 
>> How likely is that a user will include full-fledged XML data in table cells?
>> 
> 
> The sad thing is that XML is _simpler_ to parse than HTML and all its variants (because they are less strict, they allow for more stuff to happen).
> 
> The question though is what is the syntax you aim to support? Should people be able to style text using tags like <strong>, <em>, <code>, <tt>, and so on? Or do you expect literal text always? What you accept or refuse defines what you can deal with.
> 
>> If so then maybe we need to suck up the pain.
>> 
>> Or, maybe we just specify that XML data is not permitted in tables submitted to erlpress_core.
>> 
>> Any thoughts?
>> 
> 
> You could say XML data is not supported. That does not prevent you from using the XML parser rather than writing your own.
> 
> For example, what does someone do when they want to use the '<td>' string from within the table to avoid breaking your own parser? What's the escape sequence?  Using XML as a parser, you get it for free: > is > and < is <:
> 
> 72> xmerl_scan:string("<td> bf<td>aaa</td>").
> {#xmlElement{...
>            content = [#xmlText{value = " bf<td>aaa" ...}],
>            ...},
> []}
> 
> You can see the resulting string being " bf<td>aaa" despite already being in a <td> element. No confusion to be had.
> 
> If you don't use the parser, you have to come up with these rules yourself, and implement them properly. That's a lot of work :)
> 
> 




More information about the erlang-questions mailing list