optimization of list comprehensions
Richard A. O'Keefe
ok@REDACTED
Fri Mar 10 02:24:00 CET 2006
Mats Cronqvist <mats.cronqvist@REDACTED> wrote:
not to be outdone by wiger (and since o'keefe said please), i ran a few greps
on 1,748,162 lines of erlang sources (comments included). i estimate that some
3-400 people have contributed to the code sample.
i counted how many times each of the patterns below appeared per module.
"[ (]fun[ (]" "lists:fold" "lists:map" "lists:foreach"
never 64% 89% 84% 85%
<3 times 81% 97% 94% 94%
The pattern illustrated for "fun", if it is to be taken seriously as a
grep-style pattern, is guaranteed to miss many of the funs in the OTP
sources. I have a program called 'm2h' (for Many To Html -- and other
output formats) which can be told to extract selected tokens.
So
find . -name '*.[ehy]rl' -exec m2h -ik "{}" ";" | \
grep fun | wc
tells me how many lines contain the keyword 'fun' in the OTP sources.
The answer is 3300 funs in 412598 SLOC or about one fun every 125 lines;
3300 funs in 1359 modules or about 2.4 funs per module.
There are more *.[ehy]rl files (1675) than modules; unsurprisingly
.hrl files tend not to contain funs.
> summary(f)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 1.971 1.000 105.000
> table(f)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1233 112 65 39 41 31 13 15 12 6 14 6 13 10 6 6
16 17 18 19 20 21 22 23 24 26 27 29 30 32 34 36
10 5 4 2 1 1 1 3 2 1 2 2 1 1 3 3
38 39 46 47 56 62 70 78 92 105
1 1 2 1 1 1 1 1 1 1
So about 74% of the OTP R9 source files did not contain even one 'fun'.
(For .hrl and .yrl files this is completely unsurprising.)
But files that _do_ contain 'funs' tend to contain a lot of them:
> summary(f[f>0])
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 4.000 7.468 9.000 105.000
36% of the modules defined at least one fun. however, 15% of all modules had
at least one call to mnesia:transaction. so i feel pretty confident stating that
~80% of all modules have no (non-mnesia related) funs.
This is not that far from the 75% of OTP source files that have no 'fun's.
another observation is that although there were ~2 fun definition per module,
4% of the modules accounts for 50% of the fun definitions.
In the OTP R9 sources, files with 15 or more funs accounted for just under
50% of the fun definitions. That's just 53 files, or about 4% of modules.
So again, a similar figure.
For what it's worth, I find only 906 occurrences of "||" in the
OTP R9 sources, so funs are nearly four times as common as list
comprehensions.
one possible interpretation is that most of this code was produced by
programmers that rarely use funs if they don't have to.
obviously, the methodology is much too weak to *prove* this.
i'm not sure it can be proven, since there's no way to establish
who wrote what. most modules have been touched by at least a
handful of people, and everyone has worked on more than one module.
When it comes to version control systems, I'm a troglodyte. I still use
SCCS whenever I can. One of the things SCCS has been able to do since I
first used it about 25 years ago is tell you who wrote each line. Surely
any decent version control system can do this?
note also that most of this code sample has been used for
years in a very challenging environment,
And this is surely the explanation. Old code does not contain funs
because old Erlang didn't *have* funs. They are not described in the
classic Erlang book.
i does seem safe to say that one can write well working Erlang code while
rarely, if ever, using funs (excluding mnesia use, as always).
The common characteristics of these two code bodies probably reflect
similar histories: old code written before various language features
were added, and new features used only in new or changed code.
The language has changed. What *was* 'perfectly fine' Erlang isn't any
more. It's perfectly fine in that it still works, but new code should
not be written that way.
For an analogy, consider Fortran. I consider myself an expert Fortran 77
programmer. I have written, and can still write, 'perfectly fine'
Fortran 77 code. I wrote some only two months ago. However, that code
is *not* 'perfectly fine' Fortran 90, let alone Fortran 95. It uses
COMMON blocks instead of MODULEs, the occasional statement function
instead of nested procedures, has more GOTOs than modern Fortran needs,
and makes no use at all of array expressions, even when they are
appropriate (which they often are).
To decide whether programmers are using the new language features
effectively, we have to look at code written *after* the new language
features were described in training material and *after* programmers
could trust that their code would not need to run under old releases
that didn't support these features.
In fact, given that we are talking about source code that has been
around for a while and not fixed when it wasn't broken, I draw the
opposite conclusion to Mats from his own figures.
Hypothesis:
The proportion of 'funs' (and list comprehensions) in files
is increasing with time, so that 'industrial programmers'
*are* taking up the new features, they just aren't rewriting
old code for the fun of it.
More information about the erlang-questions
mailing list