[erlang-questions] Update with complicated data structure

Wed Nov 25 05:13:23 CET 2015

On 24/11/2015, at 7:51 pm, YuanZhiqian <on-your-mark@REDACTED> wrote:

> Hi Richard,
> 
>   Thanks for your elaborative comments. Some of the advices are helpful and I appreciate a lot for that; but some of them are not quite on the track ...

Well, if I'm not _shown_ the track it's hard to stay on it.
> 
>   For example, regarding the array size issue, you see, it's just a sample where my point is focused on the assignment expression, so I use an array instead of a hash table which I actually have used in my project, for that would make the code messed up with less important parts. I am certainly not a C language tard = =

"Tard"?  What does that mean?
https://en.wikipedia.org/wiki/Tardigrade ?

> 
>   And please allow me to clarify my purpose of this question, this is not an erlang v.s. c&c++ debate,

That was understood; no clarification needed.

> I am just facing an actual problem in real scenario, and need some practical advice.

While it's not a C vs Erlang debate, it *is* a matter
of trying to do things the C way in an Erlang program.
> 
>   So it is not that I insist in making erlang behave the same as imperial language, but that it is a real requirement in the project that I'm trying to solve.

I doubt that very much.  Requirements talk about what a program
does, as seen from outside.  How the collection of companies is
held, whether as a list, as a gb tree, as a map, as an ETS
table, or even in the process dictionary (for safety, that
had better be the process dictionary of a separate process)
should be entirely up to you.  I have serious trouble
believing in requirements that say
 "You must use a data structure that makes search
  and update take a long time."

Your requirements may well say
  "Update the consumption of company X by D"
but of course you haven't shown us the requirements.

One thing may have struck you already.
If you are using a relational data base,
changing one field of one record at a time is
a good way to run quite slowly.  One way to get
good performance out of RDBs is *batching*.

Now if you kept the list of N companies *sorted* on increasing
company_id, and if you had a *batch* of M companies to be
updated, also in a list sorted on increasing company_id,
then you can *merge* those two lists and produce a new list
in O(N+M) time.  (This is what old-timers used to do with
tape drives before even CODASYL data bases.)  But if you do
it one company at a time, it's going to take O(N*M) time.
If you use a gb_tree, O(lg(N)*M).

So there is a really serious question about whether you really
*HAVE* to update one field of one record, or whether you have
a *batch* of fields, and possibly multiple updates.
> 
>   As for the naming issue ... ok you are right, but does it really matter?

Yes, it does.

> At least not as important in my case ...
> 
>   And you are asking me why I am using size_t instead of int, I don't know if I am right, but size_t seems an intended type for array indices,

(a) It cannot possibly be.  C existed long before size_t did.
(b) The name *tells* you what it's meant for: it is meant for
    the *sizes* of objects.
(c) Kernighan & Ritchie, "The C programming language", 2nd edition
    consistently uses int for array indexing.  But what do they
    know?  They only invented the language.

> since it is unassigned,

I think you mean 'unsigned', and that's actually bad.

> and codes from google use size_t as well.

Google invented Go.  They did not invent C and are
not authorities on how it was meant to be used.

> Try to compile using int, but configure your error level to the most strict mode, turn on "warning as error" option, then you won't get the code compiled.

I have no idea what you are talking about.
Code that indexes arrays using int gets through gcc, clang,
lint, and splint at the highest levels with no warnings.

Good C compilers warn about using *char* as an array index,
because the programmer has no way of telling beforehand
whether char is signed or unsigned, so
a[(char)255] might or might not involve a negative subscript.
But invalidating int would mean breaking almost all the C
code ever written for Unix.

I just checked the source code for (original) vi, and yep,
it uses int subscripts, including "tp[-1]", which is not
actually an unusual thing to do in C (tp being a pointer
after the beginning of an array), and no usable C compiler is
going to treat that as an error.

>   So the point is, thanks again for your help, but unfortunately I am getting more confused with too much irrelevant comments.

How about telling us what the actual requirements really
actually are?
How about explaining why you are using a list?
I advised you to use a gb_tree (or something of that sort).
Someone else suggested a new-fangled 'map'.
Both are good suggestions.  What's wrong with them?
How many companies are you carrying information about?