[erlang-questions] List Question

Tue Aug 8 00:24:06 CEST 2017

On 8/08/17 1:46 AM, Andrew McIntyre wrote:
> Hello Craig,
>
> Thanks for your help.
>
> I am trying to store the data as efficiently as possible.
What do you mean "efficiently"?

Do you mean "in the least SPACE possible"?
In that case you should consider using binaries instead
of strings.

Do you mean "taking the least TIME possible for operations"?
In that case, you need to tag your data so that the code
that carries out the operations KNOWS right away what it is
dealing with.

> Its HL7
> natively and this is my test:

hl7.org lists a lot of standards; being unfamiliar with
them all I don't know which you are referring to.  And
you have to sign up in order to read any of them, and I
dislike signing up for things when I don't know the
consequences of signing up.

>
> OBX|17|FT~TEST|8265-1^^LN&SUBCOMP|1&2&3&4|\H\Spot Image 2\N\||||||F
>
> |~^& are delimiters.

I take it that "|" is the first level delimiter (why the heck
are they not using the Information Separator characters ASCII
has specifically for this purpose?), ~ is the second level,
^ is the third level, and & is the fourth level.

Ah, wait:

<hl7 message> ::= <hl7 segment> {"\r" <hl7 segment>}...

<hl7 segment> ::= <hl7 field> {"|" <hl7 field>}...

<hl7 field> ::= <hl7 subfield> {"^" <hl7 subfield>}...

<hl7 subfield> ::= <hl7 subsubfield> {"&" <subsubfield>}...

<hl7 subsubfield> ::= <hl7 repeating> {"~" <hl7 repeating>}

<hl7 repeating> ::= {[^\r|^&~\\]|\.]}...

The first field of a segment appears to be a
message type dentifier.

One data structure for this would be
-type message()     = list(segment()).
-type segment()     = list(field()).
-type field()       = list(subfield()).
-type subfield()    = list(subsubfield())
-type subsubfield() = list(repeating()).
-type repeating()   = string().

[[[["OBX"]]], [[["17"]]], [[["FT", "TEST"]]], [[["8265-1"]], [[[]]], 
[["LN"], ["SUBCOMP"]]], [[["1"], ["2"], ["3"], ["4"]]], [[["\\H\\Spot 
Image 2\\N\\"]]], [[[""]]], [[[""]]], [[[""]]], [[[""]]], [[[""]]], 
[[["F"]]]]

This is completely unambiguous.

Now let's suppose we use a slightly different data structure.
-type hier() = string() | {many,list(hier())}.

We get
{many,["OBX","17",{many,["FT","TEST"]},{many,["8265-1","",{many,["LN","SUBCOMP"]}]},{many,["1","2","3","4"]},"\\H\\Spot 
Image 2\\N\\","","","","","","F"]}

Note that this is unambiguous but pays the price of a
{many,_} wrapper only when there is more than one subpart,
so the relative overhead is not so large.

first({many,[H|_]) -> first(H);
first(X)           -> X.

> Currently a typical system might have 12 million of these records so
> want to keep format as small as possible in the erlang format,

The way to do that would be to compress each segment, keep the entire
segment as a binary, and only decompress and parse it when you need
to look at that particular segment.

A rough estimate of the size of this particular example is 108 words,
which is actually less than the string took.  Let's say that an
average message takes 200 words in the {many,_} | string() form.
32-bit:  9.6 GB, not going to fit.
64-bit: 19.2 GB, this machine has 16 GB, oh dear.

Even if we assume the average segment is 100 bytes,
12 million of those is 1.2 GB (excluding the cost
of a binary for each segment.  That's definitely
doable.

So we're looking at three options, I think:
(1) Keeping segments as binaries, and parsing them every time
     you need to look inside.
(2) Keeping segments as Erlang data structures in a Mnesia
     table (that is, on disc).
(3) Streaming messages through an Erlang program instead of
     storing them all in memory at once.

  hence
> reluctant to tag 2 much, but know how to get value of interest. Maybe
> that is my non erlang background showing up? Traversing 4 small lists
> by index should be fast??
>
> I guess I could save strings as binary in the lists then is_binary
> should work?? Is that the case. I gather on 64bit system especially
> binary is more space efficient.
>
> Monday, August 7, 2017, 10:53:11 PM, you wrote:
>
> z> On 2017年08月07日 月曜日 22:29:31 you wrote:
>>> Hello zxq9,
>>>
>>> Thanks, Unfortunately I do not know the value of the string that will
>>> be there. Its an extensible hierarchy that can be several lists deep -
>>> or not. Might need to revise the data structure
>
> z> In this case it can be useful to consider a way of tagging values.
>
> z> Imagine we want to represent a directory tree structure and have a
> z> descent-first traversal function recurse over it while creating the
> z> tree. We have two things that can happen, there is a flat list of
> z> new directories that need to be created, and there is the
> z> possibility that the tree depth extends deeper at each node.
>
> z> The naive version would look like what you have:
>
> z> ["top_dir_1",
> z>  "top_dir_2",
> z>  ["next_level_1",
> z>   "next_level_2"]]
>
> z> This leaves a bit to be desired, not only because of the problem
> z> you have pointed out that makes it difficult to know what is deep
> z> and what is shallow, but also because we don't really have a good
> z> way to represent a full tree (what would be the name of a directory containing other directories?).
>
> z> So consider instead something like this:
>
> z> [{"top_dir_1", []},
> z>  {"top_dir_2", []},
> z>  {"top_dir_3",
> z>   [{"next_level_1", []},
> z>    {"next_level_2", []}]}]
>
> z> Now we have a representation of each directory's name AND its contents.
>
> z> We can traverse this laterally AND in depth without any ambiguity
> z> or need for carrying around a record of where we have been (by
> z> using depth recursion and tail-call recursion):
>
>
> z> make_tree([{Dir, Contents} | Rest]) ->
> z>     ok =
> z>         case filelib:is_dir(Dir) of
> z>             true ->
> z>                 ok;
> z>             false ->
> z>                 ok = log(info, "Creating dir: ~p", [Dir]),
> z>                 file:make_dir(Dir)
> z>         end,
> z>     ok = file:set_cwd(Dir),
> z>     ok = make_tree(Contents),
> z>     ok = file:set_cwd(".."),
> z>     make_tree(Rest);
> make_tree([]) ->>
> z>     ok.
>
>
> z> Not so bad.
>
> z> In your case we could represent things perhaps a bit better by
> z> separating the types and tagging them. Instead of just "FT" and
> z> whatever other string labels you might want, you could either use
> z> atoms (totally unambiguous) or tuples as we have in the example
> z> able (also totally unambiguous). I prefer tuples, though, because they are easier to read.
>
> z> [{value, "foo"},
> z>  {tree,
> z>   [{value, "bar"},
> z>    {value, "foo"}]},
> z>  {value, "baz"}]
>
>
> z> So then we do something like:
>
>
> z> traverse([{value, Value} | Rest]) ->
> z>    ok = do_thing(Value),
> z>    traverse(Rest);
> z> traverse([{tree, Contents} | Rest]) ->
> z>    ok = traverse(Contents),
> z>    traverse(Rest);
> traverse([]) ->>
> z>    ok.
>
>
> z> Anyway, don't be afraid of varying your value types to say exactly
> z> what you mean. If your strings like "FT" only had meaning within
> z> your system consider NOT USING STRINGS, and using atoms instead. That makes it even easier:
>
>
> z> [foo,
> z>  bar,
> z>  [foo,
> z>   bar],
> z>  foo]
>
>
> z> So then we can do:
>
>
> z> traverse([foo | Rest]) ->
> z>     ok = do_foo(),
> z>     traverse(Rest);
> z> traverse([bar | Rest]) ->
> z>     ok = do_bar(),
> z>     traverse(Rest);
> z> traverse([Value | Rest]) when is_list(Value) ->
> z>     ok = traverse(Value),
> z>     traverse(Rest);
> traverse([]) ->>
> z>     ok.
>
>
> z> And of course, you can not use a guard if you want to match on a
> z> list shape in the listy clause there, but that is a minor detail.
> z> The point is to make your data types MEAN SOMETHING REASONABLE
> z> within your system. Use atoms when your values are meaningful only
> z> within your system. Strings are for the birds.
>
> z> -Craig
> z> _______________________________________________
> z> erlang-questions mailing list
> z> erlang-questions@REDACTED
> z> http://erlang.org/mailman/listinfo/erlang-questions
>
>
>