[erlang-questions] List Question

Mon Aug 7 16:39:05 CEST 2017

On 2017年08月07日 月曜日 23:46:52 you wrote:
> Hello Craig,
> 
> Thanks for your help.
> 
> I am trying to store the data as efficiently as possible. Its HL7
> natively and this is my test:
> 
> OBX|17|FT~TEST|8265-1^^LN&SUBCOMP|1&2&3&4|\H\Spot Image 2\N\||||||F
> 
> |~^& are delimiters. The hierarchy is only so deep and using lists of
> lists to provide a tree like way to access the data eg Field 3, repeat
> 1 component 2 subcomponent1
> 
> Parsed it looks like this:
> 
> [["OBX","17",
>   ["FT","TEST"],
>   [["8265-1",[],["LN","SUBCOMP"]]],
>   [[["1","2","3","4"]]],
>   "\\H\\Spot Image 2\\N\\",[],[],[],[],[],"F"]]
> 
> As the format evolves over time the hierarchy can be extended, but
> older clients can still read the value they are expecting if they
> follow the rules, like reading the first value in the list when you
> only expect one value to be there.
> 
> Currently a typical system might have 12 million of these records so
> want to keep format as small as possible in the erlang format, hence
> reluctant to tag 2 much, but know how to get value of interest. Maybe
> that is my non erlang background showing up? Traversing 4 small lists
> by index should be fast??
> 
> I guess I could save strings as binary in the lists then is_binary
> should work?? Is that the case. I gather on 64bit system especially
> binary is more space efficient.

We would really want to know a bit about the data's natural semantics before leaping to conclusions regarding "most efficient" or especially "best" internal representation.

On the note of "internal representation"... do you need 12 million records in memory at a given time, or are these things you can store as strings and then pull the ones you want as you go? Do you need random access to records? To be able to query them? Are you performing aggregate operations over them? All of these might change the answer of "what is best".

For example, if you need a few tens of millions of records in memory at a given time AND the short strings like "FT" and "TEST" are always members of a smallish, finite set you know beforehand then atoms will be much more space efficient if that is your main concern. Atoms are tiny in memory (essentially like using integer constants in a C program), but they serialize back to strings in storage so that can change things. For example, the atom 'test' (or 'TEST') is smaller in memory than either the string "TEST" or the binary <<"TEST">>.

In any case, just to get the code right first so you have a platform to change things around I would recommend TOTALLY FORGETTING ABOUT THIS to start with and jump straight to tagged pairs. This will keep your semantics straight and dramatically reduce the mental overhead of writing traversal functions. You can switch that stuff around later -- but first you need things to work at least partially before you can start measuring stuff.

For example:

> [["OBX","17",
>   ["FT","TEST"],
>   [["8265-1",[],["LN","SUBCOMP"]]],
>   [[["1","2","3","4"]]],
>   "\\H\\Spot Image 2\\N\\",[],[],[],[],[],"F"]]

I would parse that to something like this:

[{foo, "OBX"},
 {bar, 17},
 {baz, {"FT", "TEST"}},
 [[{biz, "8265-1"}
   {boz, []},
   {ballz, {"LN", "SUBCOMP"}}]],
 [[{bonz, [1, 2, 3, 4]}]],
 {coz, "\\H\\Spot Image 2\\N\\"},
 {fuzz, "F"}]

The leading tag should always disambiguate the MEANING of the element you are encountering in your lists. Here I used junk terms, but I think you get the idea. Whatever atom labels you use should be unambiguous and mean something to the humans reading the code (or crash dumps).

The overhead here VS the list version is the atoms and enclosing tuple (a few bytes for both) -- which is negligible compared to the cost of using actual strings directly in your code. Remember, an atom is basically an alias for a constant, so don't fret over it too much -- just don't start doing insane things like dynamically generating an arbitrary number of unique atoms!

But again, remember, we don't care how big this is in memory yet. What we care about is that it is easy to traverse and know what you mean when you do so, and also that it is easy to unambiguously cast the data from an old format to a new one (which is inherently easier with tagged data than with semantically amibugous lists of lists).

Obviously giving further advice would be a bit easier if I knew more about the meaning of the data itself, but anyway, the primary concern here is getting the logic right first. You can crunch things down later -- and most of the time that usually has much less to do with finding a "small" internal representation for those moments while you are actively handling some bits of data in memory and a LOT more to do with how you are going to index serialized data on disk (ETS/DETS/Mnesia may be a good route for that, or Postgres, or maps, whatever -- depending on what you're up to).

Hopefully I explained more than confused.

-Craig