[erlang-questions] How can I break this string into a list of strings?

Sun Dec 25 06:21:00 CET 2016

Hi Lloyd!

On 2016年12月24日 土曜日 16:34:22 lloyd@REDACTED wrote:
> Hello,
> 
> A naive question. 
> 
> But every time I look into the re module my heart starts pounding. I spend uncountable time figuring out what I need to do. But this time I failed.
> 
> Suppose I have the following string:
> 
>     "<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n and more text.</p>"
> 
> I would like to break it into a list:
> 
>     ["<h1>Hello!</h1>", "<h2>How are you?</h2>", "<p>Some text\n and more text.</p>"]
> 
> string:token(MyString, "$\n") doesn't work because it would break the paragraph. 
> 
> So I try:
> 
>     re:replace(Copy, [$>,$\n], [$>,$",$,,$"], [global, {return, list}]).
> 
> But I get:
> 
>    "<h1>Hello!</h1>\",\"     <h2>How are you?</h2>\",\"    <p>This is....
> 
> I don't want those pesky escaped quotes at the end of the heads. I want the real deal. 
> 
> But how can I make it happen?

Maybe we can elevate this to a more general problem, like "how can I interpret a list of angle-brackety notation and return a list that contains the parts of it in a way that is meaningful to my Erlang program?"

Fortunately the days of "its OK if you have a closing tag, or not, but maybe just in case, sometimes" are over. So these days we know that, at least by HTML5 and XML type rules, we will always have an
explicit closing tag, a slash, or a very few tags that are "self-closing" meaning that when we encounter them we can know right away that we won't encounter an ending one.

We could get into some listy wizardry with lists of regexes operating over your list of charcters that flip it between binaries and lists and lists of binaries and other nonsense... but I've never seen that lead to a happy place, an understandable place, and it always winds up having dramatically worse performance characteristics (by every criteria, which is remarkable).

So let's instead decide that we have an accumulator, which is a deep list (so it mimics the way the elements in the HTML look, like a tree of tags), your original list, and the idea that we need to have a mode for recording label of the tag (the "h1" or "h2" or whatever), recording the data enclosed by the tag ("hello!" and other text we actually care about), and a matching mode to find previous tags that are being closed. 3 modes, basically.

To start with I want to move through the list, passing whatever I encounter to an accumulator. Most simple explicit recursion we can do -- so let's write a tinkering module for it:

  -module(htmler).
  -export([consume/1]).

  -spec consume(String) -> Result
      when String :: unicode:chardata(),
           Result :: {ok, Structure :: list()}
                   | {error, Reason :: term()}.

  consume(String) ->
      consume(String, "").

  consume([Char | Rest], Acc) ->
      consume(Rest, [Char | Acc]);
  consume([], Acc) ->
      {ok, lists:reverse(Acc)}.

...and now give it a try...

  1> c(htmler).
  {ok,htmler}
  2> htmler:consume("<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n and more text.</p>").
  {ok, "<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n and more text.</p>"}

OK, works as expected.

If I encounter a $< then I want to start recording the tag label. If I hit whitespace I want to just pass by it until I hit the first meaningful character of the label. I add the label characters to an accumulator for the label. When I encounter whitespace I want to stop recording the tag label, but still move through the next characters until the closing $>.

  -module(htmler).
  -export([consume/1]).

  -spec consume(String) -> Result
      when String :: unicode:chardata(),
           Result :: {ok, Structure :: list()}
                   | {error, Reason :: term()}.

  consume(String) ->
      consume(String, "").

  consume([$< | Rest], Acc) ->
      case consume_label(Rest, "") of
          {ok, Label, Remainder} ->
              consume(Remainder, [Label | Acc]);
          {error, Label} ->
              {error, {bad_tag, Label}}
      end;
  consume([Char | Rest], Acc) ->
      consume(Rest, [Char | Acc]);
  consume([], Acc) ->
      {ok, lists:reverse(Acc)}.

  consume_label([$\s | Rest], Acc) ->
      consume_label(Rest, Acc);
  consume_label([$\t | Rest], Acc) ->
      consume_label(Rest, Acc);
  consume_label(String, Acc) ->
      consume_label_text(String, Acc).

  consume_label_text([$> | Rest], Acc) ->
      Label = lists:reverse(Acc),
      {ok, Label, Rest};
  consume_label_text([$\s | Rest], Acc) ->
      consume_label_close(Rest, Acc);
  consume_label_text([$\t | Rest], Acc) ->
      consume_label_close(Rest, Acc);
  consume_label_text([Char | Rest], Acc) ->
      consume_label_text(Rest, [Char | Acc]);
  consume_label_text([], Acc) ->
      Label = lists:reverse(Acc),
      {error, Label}.

  consume_label_close([$> | Rest], Acc) ->
      Label = lists:reverse(Acc),
      {ok, Label, Rest};
  consume_label_close([_ | Rest], Acc) ->
      consume_label_close(Rest, Acc).

And now let's give that a try...

  3> c(htmler).
  {ok,htmler}
  4> io:format("~tp~n", [htmler:consume("<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n and more text.</p>")]).
  {ok,["h1",72,101,108,108,111,33,"/h1",10,32,32,32,32,32,"h2",72,111,119,32,97,
       114,101,32,121,111,117,63,"/h2",10,32,32,32,32,"p",83,111,109,101,32,116,
       101,120,116,10,32,97,110,100,32,109,111,114,101,32,116,101,120,116,46,
       "/p"]}
  ok

Ah, so we're on the right track, but not quite where we want to be. Looking at this I can tell that we probably don't want to just split things up, but extract some meaning. Maybe what we really want is something like a list of tuples with a label that tells us the kind of thing it is, and then has a list containing the data that the label applies to. Like:

  [{"h1", "Hello!"}, {"h2", "How are you?"}, {"p", "Some text\n and more text."}]

I think we can change this a bit to get to that point, but it will require that we remember the tag we are currently in case, for example, there is an <em> tag inside the <p> component or whatever else -- with these kinds of tags its not about finding a closing tag, it is about finding out which closing tag you've found. Maybe we want some part of this to actually be non-tail recursive -- so we can work on just the tag in which we are parsing at the moment and not remember everything else.

  -module(htmler).
  -export([consume/1]).

  -spec consume(String) -> Result
      when String :: unicode:chardata(),
           Result :: {ok, Structure :: list()}
                   | {error, Reason :: term()}.

  consume(String) ->
      consume(String, []).

  consume([$< | Rest], Acc) ->
      case consume_element(Rest) of
          {ok, Element, Remainder} ->
              consume(Remainder, [Element | Acc]);
          {error, Element} ->
              {error, {bad_tag, Element}}
      end;
  consume([Char | Rest], Acc) ->
      consume(Rest, [Char | Acc]);
  consume([], Acc) ->
      {ok, lists:reverse(Acc)}.

  consume_element(String) ->
      case consume_label(String, "") of
          {ok, Label, Rest} ->
              {ok, Contents, Remainder} = consume_contents(Label, Rest, ""),
              {ok, {Label, Contents}, Remainder};
          Error ->
              Error
      end.

  consume_label([$\s | Rest], Acc) ->
      consume_label(Rest, Acc);
  consume_label([$\t | Rest], Acc) ->
      consume_label(Rest, Acc);
  consume_label(String, Acc) ->
      consume_label_text(String, Acc).

  consume_label_text([$> | Rest], Acc) ->
      Label = lists:reverse(Acc),
      {ok, Label, Rest};
  consume_label_text([$\s | Rest], Acc) ->
      consume_label_end(Rest, Acc);
  consume_label_text([$\t | Rest], Acc) ->
      consume_label_end(Rest, Acc);
  consume_label_text([Char | Rest], Acc) ->
      consume_label_text(Rest, [Char | Acc]);
  consume_label_text([], Acc) ->
      Label = lists:reverse(Acc),
      {error, Label}.

  consume_label_end([$/, $> | Rest], Acc) ->
      Label = lists:reverse(Acc),
      {ok, Label, Rest};
  consume_label_end([$> | Rest], Acc) ->
      Label = lists:reverse(Acc),
      {ok, Label, Rest};
  consume_label_end([_ | Rest], Acc) ->
      consume_label_end(Rest, Acc).

  consume_contents(Label, [$<, $/ | Rest], Acc) ->
      {ok, Label, Remainder} = consume_label(Rest, ""),
      Contents = lists:reverse(Acc),
      {ok, Contents, Remainder};
  consume_contents(Label, [$< | Rest], Acc) ->
      {ok, Element, Remainder} = consume_element(Rest),
      consume_contents(Label, Remainder, [Element | Acc]);
  consume_contents(Label, [Char | Rest], Acc) ->
      consume_contents(Label, Rest, [Char | Acc]);
  consume_contents(_, "", Acc) ->
      Contents = lists:reverse(Acc),
      {ok, Contents}.

And how will that work out, I wonder...

  13> io:format("~tp~n", [htmler:consume("<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n and more text.</p>")]).
  {ok,[{"h1","Hello!"},
       10,32,32,32,32,32,
       {"h2","How are you?"},
       10,32,32,32,32,
       {"p","Some text\n and more text."}]}
  ok

How about with that nested tag case I was worried about earlier?

  14> io:format("~tp~n", [htmler:consume("<h1>Hello!</h1>\n     <h2>How are <em>you?</em></h2>\n    <p>Some text\n and more text.</p>")]).
  {ok,[{"h1","Hello!"},
       10,32,32,32,32,32,
       {"h2",[72,111,119,32,97,114,101,32,{"em","you?"}]},
       10,32,32,32,32,
       {"p","Some text\n and more text."}]}
  ok

Ah! Nice. So it really does nest the "contents" list.

So now we have a more semantically correct split happening, and it lets us descend down a tree of HTML tags -- assuming some really huge things that we can't normally assume on the big steaming pile of errors, exceptions and malformed output we call the web. For example:

  2> htmler:consume("<body><h1>Hello</h1><p>foo</body>").
  ** exception error: no match of right hand side value {ok,"body",[]}
       in function  htmler:consume_contents/3 (htmler.erl, line 66)
       in call from htmler:consume_element/1 (htmler.erl, line 28)
       in call from htmler:consume_contents/3 (htmler.erl, line 70)
       in call from htmler:consume_element/1 (htmler.erl, line 28)
       in call from htmler:consume/2 (htmler.erl, line 13)

Whoops! Forgot the closing </p> tag, so everything catches fire.

  3> htmler:consume("<body><h1>Hello</h1><p>foo</p></body>").
  {ok,[{"body",[{"h1","Hello"},{"p","foo"}]}]}

Ah, much better.

Anyway, handling bad input in HTML is an art, not a science. This could be made dramatically more complex in the interest of accepting as much crap in the tubes as Firefox does -- or left simple and made to return explicit error messages with an indication what part of consuming the input failed (hint: this is a great way to get a headstart on creating a version that accepts bad input). (Also, of course, we would need to put in some cases to catch when the label is one of the implicitly "self-closing" tags, and explicitly self-closed tag, and a few other things to be complete.)

Anyway, this doesn't exactly address the problem you have, but hopefully it gives you some ideas about how to parse angle-brackety syntaxes that manage to be both dramatically more complex and profoundly less useful than S-expressions.

Regexs just don't cut it in this particular case.

Merry Christmas! I hope you are having a good holiday with the family!

-Craig