[erlang-questions] disk merging

Joe Armstrong <>
Fri Oct 26 20:26:17 CEST 2007

To answer my own question I think I have found  a good algorithm.

1) Start with an empty tree on machine 1 (the master tree)
 the goal is to move all files into the  master tree (or delete these files)

2) write code that adds directories to the master tree subject to a
number of rules

   a) The system is told where in the tree to add the files.


  b) The system suggests places where the tree should be added. It
might offer several
       alternatives and the user selects which is correct.

   Nothing is added if any files in the new directory are already in
the tree (use md5 checksums to check this)

   If files in dir are very similar to the files in the master tree
(same name, small difference) then the
user is ask to choose which is "best" file.

Having moved files into the master tree on machine 1 I can replicate
the entire tree  on machine 2.

The key to this is a good algorithm to guess where in the master tree
new directories and individual files should be placed.

For images this is easy (I think) extract the date from the metadata
and move them
to /tree/pics/year/ ... here year comes from the image meta data.

Most of my images have names like  /some/funny/path/holidayInOozamba/xxx.jpg
If I know the year of xxx.jpg is 2001 (from the metadata) then I can
move these to

I'd also use the md5 sum of xxx.jpg to make sure the file is not in
the master tree with
some different name.

Music? I need to guess the artist/album/track from a combination of
the filename and
any embedded Id3 tags. Now I know that about 5-10% of all mp3 files
have incorrect ID3 tags
(I found this writing the erlang book).

Are there any text data listings of (artict,album,track) tuples for
the 10^6 most popular
songs? - this would be useful.

If all else fails (I have a file xxx.mp3) with no decent tags, no file
or directory name hints,
is there some program that can predict the information I need? is
there a web service that can do this automatically?


On 10/26/07, Robert Raschke <> wrote:
> > Does anybody know of a good algorithm to consolidate/merge all this
> > data or do I have
> > to write my own? One immediate thought is to compute the MD5 sums of
> > all files on all
> > disk and thus find all duplicates - then create a master copy of all
> > unique files
> > but the file names will be wrong and this might result in a big mess.
> >
> > This cannot be an uncommon problem - any ideas how to solve it?
> >
> > /Joe
> For a low-level (i.e., file system) approach, have a look at Plan 9's Venti (consolidates on block level, not file):
> http://cm.bell-labs.com/sys/doc/venti/venti.html
> I believe there's an implementation that can run as a user level program under Unix in http://swtch.com/plan9port/
> Robby

More information about the erlang-questions mailing list