[erlang-questions] How to extract string between XML tags
Lloyd R. Prentice
lloyd@REDACTED
Wed Sep 26 05:17:51 CEST 2018
Hi Fred,
This is gorgeous! in it will go.
Thank you so much.
All the best,
Lloyd
Sent from my iPad
> On Sep 25, 2018, at 7:38 PM, Fred Hebert <mononcqc@REDACTED> wrote:
>
>> On 09/25, lloyd@REDACTED wrote:
>> Thanks all,
>>
>> This is definitely useful info for another aspect of my project.
>>
>> But perhaps there's a simpler solution to my immediate problem if I pose my question more specifically.
>>
>> Goal: Add tables to erlpress_core.
>>
>> Problem: Extract cell info from html tables:
>>
>> table() ->
>> ...
>>
>> My function ep_parse_table/1 gives me:
>>
>> [[["<th>Firstname</th>"],
>> ["<th>Lastname</th>"],
>> ["<th>Age</th>"]],
>> [["<tr>Jill</tr>"],["<tr>Smith</tr>"],["<tr>50</tr>"]],
>> [["<tr>Eve</tr>"],["<tr>Jackson</tr>"],["<tr>94</tr>"]]]
>>
>
> {XML, _} = xmerl_parse:string(table()),
> Rows = xmerl_xpath:string("/table/tr", XML),
>
>
> From there:
>
> [[[ Text || #xmlText{value=Text} <- Col]
> || #xmlElement{content=Col} <- Cols]
> || #xmlElement{content=Cols} <- Rows].
>
> Gives:
>
> [[["Firstname"],["Lastname"],["Age"]],
> [["Jill"],["Smith"],["50"]],
> [["Eve"],["Jackson"],["94"]]]
>
> If you want to keep the node's type:
>
> [[[ {Name,Text} || #xmlText{value=Text} <- Col]
> || #xmlElement{content=Col, name=Name} <- Cols]
> || #xmlElement{content=Cols} <- Rows].
>
> Gives:
>
> [[[{th,"Firstname"}],[{th,"Lastname"}],[{th,"Age"}]],
> [[{td,"Jill"}],[{td,"Smith"}],[{td,"50"}]],
> [[{td,"Eve"}],[{td,"Jackson"}],[{td,"94"}]]]
>
> This is a bit obtuse due to using list comprehensions, I haven't taken the time to clean the code up.
>
>> Is it reasonable to require xmerl as a dependency of erlpress_core and go through the many transformations required to extract cell data for every cell in a, perhaps, large table?
>>
>
> It's not that bad, considering xmerl is part of the standard library, but that's a fair concern anyway.
>
>> How likely is that a user will include full-fledged XML data in table cells?
>>
>
> The sad thing is that XML is _simpler_ to parse than HTML and all its variants (because they are less strict, they allow for more stuff to happen).
>
> The question though is what is the syntax you aim to support? Should people be able to style text using tags like <strong>, <em>, <code>, <tt>, and so on? Or do you expect literal text always? What you accept or refuse defines what you can deal with.
>
>> If so then maybe we need to suck up the pain.
>>
>> Or, maybe we just specify that XML data is not permitted in tables submitted to erlpress_core.
>>
>> Any thoughts?
>>
>
> You could say XML data is not supported. That does not prevent you from using the XML parser rather than writing your own.
>
> For example, what does someone do when they want to use the '<td>' string from within the table to avoid breaking your own parser? What's the escape sequence? Using XML as a parser, you get it for free: > is > and < is <:
>
> 72> xmerl_scan:string("<td> bf<td>aaa</td>").
> {#xmlElement{...
> content = [#xmlText{value = " bf<td>aaa" ...}],
> ...},
> []}
>
> You can see the resulting string being " bf<td>aaa" despite already being in a <td> element. No confusion to be had.
>
> If you don't use the parser, you have to come up with these rules yourself, and implement them properly. That's a lot of work :)
>
>
More information about the erlang-questions
mailing list