[erlang-questions] Erlang Exercise 19-3

Richard O'Keefe raoknz@REDACTED
Wed Oct 31 09:29:10 CET 2018


The question means something like this.
You are to construct and then use a table
   source : checksum -> set of filename
by doing
   for each filename given
      open the file
      for each 40-character window of the contents of the file
         compute the checksum of the window
         add filename to source{checkum}
      close the file

   s := {}
   open the test data
      for each 40-character window of the contents of the testdata
         compute the checksum of the window
         let s' be source{checksum} with default {}
         s := s U s'
      close the test data
   report the set of possible source file names s

Now the test is to accomplish, in Erlang, the same
effect as that would, without necessarily doing it
that way.

This is actually a fairly simplistic version of an
Information Retrieval task called Passage Retrieval.
With a little more work, you can not only report
that the test data may contain portions of various
files, but *which* portions.  With a bit more work
than that, you can rank them, distinguishing files
that contribute many passages from ones that contribute
few.  The neat thing here is that, done with care,
you can handle very large document collections in
Erlang without as much work as you might have to do in
languages lacking a native-data-structures persistent store.

When you look at this functionally, you see that this is
basically a Map (over files) Reduce (the set unions) and
that the index building could be done in a distributed
way, again with relatively little effort.  (The set
unions I'm talking about are the "add ... to ..." ones.
This is idempotent, so indexing the same document on two
nodes is wasteful but not dangerous.)


On Wed, 31 Oct 2018 at 18:55, カカキ <heturing@REDACTED> wrote:

> Hello, everyone.
>
> I'm a little confused with an Exercise, and I cannot understand the
> question.
>
> Here is the question.
>
> Write a program to detect plagiarisms in text. To do this, use a two-pass
> algorithm. In pass 1, break the text into 40-character blocks and compute a
> checksum for each 40-character block. Store the checksum and filename in an
> ETS table. In pass 2, compute the checksum of each 40-character block in
> the data and compare with the checksums in the ETS table.
>
> Hint: You will need to compute a  "rolling checksum" to do this. For
> example, if C1 = B1 + B2 + ... + B40 and C2 = B2 + B3+ ... + B41, then C2
> can be quickly computed by obversing that C2 = C1 + B41 - B1.
>
> What I've done is finding two files, say file1.txt and file2.txt. I want
> to check whether file2 plagiarizes file1. So I get first 40-characters of
> file1 and calculate the checksum of it. And then I move the block one
> character to the right and calculate the checksum. At last, I will get a
> list of checksum of file1, and I store them into an ETS table.
>
> On pass two, I do the same thing with file2 and get a list of checksum
> [C1, C2, ..., Cn]. For each element, I check whether it exists in ETS table
> or not. If exists, it means that there is a block with the same content in
> file 1. So it is plagiarism. Finally, I count the number of these blocks
> and output it.
>
> However, it seems that I haven't use the filename. So I am wondering that
> there is something wrong with my opinion.
>
> Thanks.
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20181031/16c9e7a8/attachment.htm>


More information about the erlang-questions mailing list