[erlang-questions] Erlang Exercise 19-3

Wed Oct 31 11:10:44 CET 2018

I'd have two types of entries in  the ets table
first a set of files
{{file,1},"/path/to/file1"}
{{file,2},"/path/to/file2"}
and a set of hashes
{{<Hash>, {<FileNumber>,<ByteOffset>}}

<Hash> is your 40 character Hash, <FileNumber> is 1,... (ie one of the
numbers in the table)
<ByteOffset> is the offset in the file that has this hash.

You compute this table in pass one iterating over all files in your collection

In pass 2 you compute the rolling hash and look at each value in the
ets table. Don't bother to store all the rolling hashes in a list. If
you get a hit try to extend the match to see how much more data
matches.

The sum of bytes hash is actually rather bad and will give a lot of
false hits (ie the hash agrees but the
 text differs) If you want a better rolling hash algorithm use one of
the methods from this reference
https://en.wikipedia.org/wiki/Rolling_hash

Cheers

/Joe
On Wed, Oct 31, 2018 at 9:29 AM Richard O'Keefe <raoknz@REDACTED> wrote:
>
> The question means something like this.
> You are to construct and then use a table
>    source : checksum -> set of filename
> by doing
>    for each filename given
>       open the file
>       for each 40-character window of the contents of the file
>          compute the checksum of the window
>          add filename to source{checkum}
>       close the file
>
>    s := {}
>    open the test data
>       for each 40-character window of the contents of the testdata
>          compute the checksum of the window
>          let s' be source{checksum} with default {}
>          s := s U s'
>       close the test data
>    report the set of possible source file names s
>
> Now the test is to accomplish, in Erlang, the same
> effect as that would, without necessarily doing it
> that way.
>
> This is actually a fairly simplistic version of an
> Information Retrieval task called Passage Retrieval.
> With a little more work, you can not only report
> that the test data may contain portions of various
> files, but *which* portions.  With a bit more work
> than that, you can rank them, distinguishing files
> that contribute many passages from ones that contribute
> few.  The neat thing here is that, done with care,
> you can handle very large document collections in
> Erlang without as much work as you might have to do in
> languages lacking a native-data-structures persistent store.
>
> When you look at this functionally, you see that this is
> basically a Map (over files) Reduce (the set unions) and
> that the index building could be done in a distributed
> way, again with relatively little effort.  (The set
> unions I'm talking about are the "add ... to ..." ones.
> This is idempotent, so indexing the same document on two
> nodes is wasteful but not dangerous.)
>
>
> On Wed, 31 Oct 2018 at 18:55, カカキ <heturing@REDACTED> wrote:
>>
>> Hello, everyone.
>>
>> I'm a little confused with an Exercise, and I cannot understand the question.
>>
>> Here is the question.
>>
>> Write a program to detect plagiarisms in text. To do this, use a two-pass algorithm. In pass 1, break the text into 40-character blocks and compute a checksum for each 40-character block. Store the checksum and filename in an ETS table. In pass 2, compute the checksum of each 40-character block in the data and compare with the checksums in the ETS table.
>>
>> Hint: You will need to compute a  "rolling checksum" to do this. For example, if C1 = B1 + B2 + ... + B40 and C2 = B2 + B3+ ... + B41, then C2 can be quickly computed by obversing that C2 = C1 + B41 - B1.
>>
>> What I've done is finding two files, say file1.txt and file2.txt. I want to check whether file2 plagiarizes file1. So I get first 40-characters of file1 and calculate the checksum of it. And then I move the block one character to the right and calculate the checksum. At last, I will get a list of checksum of file1, and I store them into an ETS table.
>>
>> On pass two, I do the same thing with file2 and get a list of checksum [C1, C2, ..., Cn]. For each element, I check whether it exists in ETS table or not. If exists, it means that there is a block with the same content in file 1. So it is plagiarism. Finally, I count the number of these blocks and output it.
>>
>> However, it seems that I haven't use the filename. So I am wondering that there is something wrong with my opinion.
>>
>> Thanks.
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions