optimization of list comprehensions

Fri Mar 10 02:24:00 CET 2006

Mats Cronqvist <mats.cronqvist@REDACTED> wrote:
	not to be outdone by wiger (and since o'keefe said please), i ran a few greps 
	on 1,748,162 lines of erlang sources (comments included). i estimate that some 
	3-400 people have contributed to the code sample.
	   i counted how many times each of the patterns below appeared per module.

	         "[ (]fun[ (]"  "lists:fold" "lists:map" "lists:foreach"
	never          64%            89%          84%          85%
	<3 times       81%            97%          94%          94%

The pattern illustrated for "fun", if it is to be taken seriously as a
grep-style pattern, is guaranteed to miss many of the funs in the OTP
sources.   I have a program called 'm2h' (for Many To Html -- and other
output formats) which can be told to extract selected tokens.
So
    find . -name '*.[ehy]rl' -exec m2h -ik "{}" ";" | \
    grep fun | wc
tells me how many lines contain the keyword 'fun' in the OTP sources. 
The answer is 3300 funs in 412598 SLOC or about one fun every 125 lines;
3300 funs in 1359 modules or about 2.4 funs per module.

There are more *.[ehy]rl files (1675) than modules; unsurprisingly
.hrl files tend not to contain funs.

> summary(f)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   1.971   1.000 105.000 

> table(f)
   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
1233  112   65   39   41   31   13   15   12    6   14    6   13   10    6    6 

  16   17   18   19   20   21   22   23   24   26   27   29   30   32   34   36 
  10    5    4    2    1    1    1    3    2    1    2    2    1    1    3    3 

  38   39   46   47   56   62   70   78   92  105 
   1    1    2    1    1    1    1    1    1    1 

So about 74% of the OTP R9 source files did not contain even one 'fun'.
(For .hrl and .yrl files this is completely unsurprising.)
But files that _do_ contain 'funs' tend to contain a lot of them:

> summary(f[f>0])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   4.000   7.468   9.000 105.000 

	   36% of the modules defined at least one fun. however, 15% of all modules had 
	at least one call to mnesia:transaction. so i feel pretty confident stating that 
	  ~80% of all modules have no (non-mnesia related) funs.

This is not that far from the 75% of OTP source files that have no 'fun's.

	   another observation is that although there were ~2 fun definition per module, 
	4% of the modules accounts for 50% of the fun definitions.

In the OTP R9 sources, files with 15 or more funs accounted for just under
50% of the fun definitions.  That's just 53 files, or about 4% of modules.
So again, a similar figure.

For what it's worth, I find only 906 occurrences of "||" in the
OTP R9 sources, so funs are nearly four times as common as list
comprehensions.

	one possible interpretation is that most of this code was produced by 
	programmers that rarely use funs if they don't have to.
	obviously, the methodology is much too weak to *prove* this.
	i'm not sure it can be proven, since there's no way to establish
	who wrote what.  most modules have been touched by at least a
	handful of people, and everyone has worked on more than one module.

When it comes to version control systems, I'm a troglodyte.  I still use
SCCS whenever I can.  One of the things SCCS has been able to do since I
first used it about 25 years ago is tell you who wrote each line.  Surely
any decent version control system can do this?

	   note also that most of this code sample has been used for
	years in a very challenging environment,

And this is surely the explanation.  Old code does not contain funs
because old Erlang didn't *have* funs.  They are not described in the
classic Erlang book.

	   i does seem safe to say that one can write well working Erlang code while 
	rarely, if ever, using funs (excluding mnesia use, as always).

The common characteristics of these two code bodies probably reflect
similar histories:  old code written before various language features
were added, and new features used only in new or changed code.

The language has changed.  What *was* 'perfectly fine' Erlang isn't any
more.  It's perfectly fine in that it still works, but new code should
not be written that way.

For an analogy, consider Fortran.  I consider myself an expert Fortran 77
programmer.  I have written, and can still write, 'perfectly fine'
Fortran 77 code.  I wrote some only two months ago.  However, that code
is *not* 'perfectly fine' Fortran 90, let alone Fortran 95.  It uses
COMMON blocks instead of MODULEs, the occasional statement function
instead of nested procedures, has more GOTOs than modern Fortran needs,
and makes no use at all of array expressions, even when they are
appropriate (which they often are).

To decide whether programmers are using the new language features
effectively, we have to look at code written *after* the new language
features were described in training material and *after* programmers
could trust that their code would not need to run under old releases
that didn't support these features.

In fact, given that we are talking about source code that has been
around for a while and not fixed when it wasn't broken, I draw the
opposite conclusion to Mats from his own figures.

    Hypothesis:
        The proportion of 'funs' (and list comprehensions) in files
        is increasing with time, so that 'industrial programmers'
        *are* taking up the new features, they just aren't rewriting
        old code for the fun of it.