Skip to content

Commit

Permalink
add jplag doc
Browse files Browse the repository at this point in the history
  • Loading branch information
fanghon committed Dec 3, 2019
1 parent 04d9f43 commit 20dd9a8
Show file tree
Hide file tree
Showing 30 changed files with 633 additions and 91 deletions.
19 changes: 13 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# antiplag 程序代码及文档作业相似度检查软件
软件主要检查、比较学生提交的电子档作业之间的相似度,能对多种编程语言(如java、c/c++、python等)、多种格式(txt、doc、docx、pdf、html)的中英文、简繁体文档(如实验报告)之间的文本相似度进行比较分析,输出相似度高的文档,进而辅助发现学生之间互相抄袭的行为。
软件主要检查、比较学生提交的电子档作业之间的相似度,能对多种编程语言(如java、c/c++、python等)、多种格式(txt、doc、docx、pdf等)的中英文、简繁体文档(如实验报告)之间的文本相似度进行比较分析,输出相似度高的文档,进而辅助发现学生之间互相抄袭的行为。

## 需求
[jdk11](https://www.oracle.com/technetwork/java/javase/downloads/jdk11-downloads-5066655.html)
Expand All @@ -14,28 +14,31 @@
![程序主界面](./maingui.png)

## 原理
系统采用的主要技术是自然语言处理(nlp)中的文本相似度计算。程序类文本的相似度比较基于3个开放系统:
系统采用的主要技术是字符串及自然语言处理(nlp)中的文本相似度计算。

程序类文本的相似度比较基于3个开放系统:
* 一是基于网络服务的[MOSS系统](http://theory.stanford.edu/~aiken/moss/)(斯坦福大学开放的支持多种编程语言代码相似度比较的系统);
* 二是本地执行的[sim系统](https://dickgrune.com/Programs/similarity_tester/)(支持java、c等语言的文本相似度比较)。
* 三是本地执行的[jplag系统](https://github.com/jplag/jplag/)(支持java、c/c++、python等语言的文本相似度比较)。

本系统在它们基础上进行了二次开发和封装,针对moss系统,开发出了客户端存取模块,实现了代码文件提交、结果获取和解析、结果排序等功能;针对sim和jplag,则将其集成到系统中,在moss因网络故障等原因不可用时,可作为替代产品使用。

中英文文档作业相似度的比较则基于[shinglecloud算法](https://www.kom.tu-darmstadt.de/de/research-results/0/1/shinglecloud/)(一种基于文本指纹的、语言无关的相似度快速计算方法),文档主要处理过程如下:
中英文文档作业相似度的比较提供了两种算法:

第一种是基于[shinglecloud算法](https://www.kom.tu-darmstadt.de/de/research-results/0/1/shinglecloud/)(一种基于文本指纹的、语言无关的相似度快速计算方法),文档主要处理过程如下:
1. 使用tika读取不同格式(txt、doc、docx、pdf、html等)不同编码文件中的文本内容,并将其转换成能统一处理的文本;
2. 使用hanlp对文本进行预处理、分词;
3. 使用shinglecloud算法计算文本之间的相似度;
4. 根据相似度排序,输出比较结果。

第二种是基于jplag的GST算法,对其功能进行了扩展,增加的“doc”语言类型,可以对各种文档进行相似度计算,并提供基于网页的可视化比对功能。

### 参考文献:
1. [Software Plagiarism Detection Techniques:A Comparative Study](http://www.ijcsit.com/docs/Volume%205/vol5issue04/ijcsit2014050441.pdf)
2. [JPlag: Finding plagiarisms among a set of programs](http://page.mi.fu-berlin.de/prechelt/Biblio/jplagTR.pdf)
3. [Winnowing: Local Algorithms for Document Fingerprinting](http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf) moss系统采用的核心算法
4. [软件抄袭检测研究综述](https://faculty.ist.psu.edu/wu/papers/spd-survey-16.pdf)

## 更新情况
1. 2019.12.1 使用hanlp作为分词组件,支持pdf、html文件文本的查重,修复若干bug,发布v2.8.6版。

## TODO
1. 将jplag整合进系统。已实现。
2. 支持html,jsp文件代码的查重。
Expand All @@ -45,4 +48,8 @@

源于开源,还于开源,开源是美德,加星也是美德 :smile:

## 更新情况
1. 2019.12.1 使用hanlp作为分词组件,支持pdf、html文件文本的查重,修复若干bug,发布v2.8.6版。
2. 2019.12.3 扩展jplag功能,提供“doc”语言类型,实现了对多种格式文档文本的相似度计算及可视化比对功能。更新使用帮助,测试数据,发布v2.8.8版。


3 changes: 3 additions & 0 deletions bin/.gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
/preprocess/
/gui/
/jplag/
/utils/
/shingle/
Binary file modified bin/gui/plag/edu/PlagGUI$5.class
Binary file not shown.
Binary file modified bin/gui/plag/edu/PlagGUI$6.class
Binary file not shown.
Binary file modified bin/gui/plag/edu/PlagGUI$7.class
Binary file not shown.
Binary file modified bin/gui/plag/edu/PlagGUI$8.class
Binary file not shown.
Binary file modified bin/gui/plag/edu/PlagGUI$9.class
Binary file not shown.
Binary file modified bin/gui/plag/edu/PlagGUI.class
Binary file not shown.
Binary file modified bin/preprocess/plag/edu/TextExtractor.class
Binary file not shown.
Binary file modified bin/shingle/plag/edu/ShingleSim$Fileter.class
Binary file not shown.
Binary file modified bin/shingle/plag/edu/ShingleSim.class
Binary file not shown.
Binary file modified bin/utils/edu/AntFile.class
Binary file not shown.
Binary file modified bin/utils/edu/FileIO.class
Binary file not shown.
Binary file modified bin/utils/edu/StreamGobbler.class
Binary file not shown.
Binary file modified bin/utils/edu/WinCMD.class
Binary file not shown.
19 changes: 10 additions & 9 deletions help.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
��������������£�
��1��ѡ�񱻼���ļ�Ŀ¼�������ѡ����ҵ����ť�����ļ�ѡ��Ի����У�ѡ��testdata
Ŀ¼�µ���Ŀ¼���硰javaabctograde����Ŀ¼��
��2��ȷ����������ȷ����ҵ�����ǡ�������ҵ�������ƶ���ֵ30����⹤��moss������
��2��ȷ����������ȷ����ҵ�����ǡ�������ҵ�������ƶ���ֵ�ʵ�����⹤��moss������
��������java��
��3��ִ�бȽϡ������ִ�бȽϡ���ť���ȴ�ϵͳȥstanford��mossϵͳ��վ���ύ
��javaabctograde���µ���չ��Ϊjava�Ĵ����ļ������������صĽ����������Ϻ󣬻ᵯ
Expand All @@ -18,14 +18,15 @@
��ϵͳ��testdataĿ¼���ṩ�˳������Ķ������������ĵ�����ѧϰ������ʹ�á�

2 �ĵ��ı������ƶȼ��
�ĵ����IJ�������ͳ������ļ�ⲽ�����һ�£�ֻ�ǡ�ѡ����ҵ��ʱ��ѡ������ĵ���ҵ��
�磺testdata/doccn�µ���ҵ���ĵ��ļ�����չ��������txt��doc��docx��pdf��html
���е�һ�֡�
��ҵ�����ǡ��ı���ҵ����Ȼ������ִ�бȽϡ���ť���ȴ���ȷ�ϴ�����������������鿴�����
��ť��ϵͳ��򿪡��ȽϽ�������ڣ����Բ鿴�ȽϽ����
�ı��ĵ��ıȽ�Ŀǰ�ݲ�֧��ͨ���������ҳ���п��ӻ��Աȡ�
����ֻ���Ӣ���ĵ��������������ѡ��jplag�����µġ�text�����ͷ�ʽ�����Զ�Ӣ���ĵ���
sql���롢html��jsp�ļ�(���ǵ��ļ���չ������ij�txt)�Ƚ��м򵥿��ӻ��ȶԡ�
Ŀǰ֧�����ַ�ʽ��
(1)ʹ��shinglecloud�㷨�Ƚϡ������������ͳ������ļ�ⲽ�����һ�£�ֻ��ѡ����ҵ����ʱ��
ѡ����ǡ��ı���ҵ������(�磺testdata/doccn�µ��ļ�)���ĵ��ļ�����չ��������txt��doc��
docx��pdf��html��
���ı���ҵ����ʽ�µıȽ�Ŀǰ�ݲ�֧��ͨ���������ҳ���п��ӻ��Աȡ�
(2)ʹ��jplag��GST�㷨�Ƚϡ�ϵͳ��չ��ԭJplag�Ĺ��ܣ������ˡ�doc���������ͣ����Լ����ָ�
ʽ�ĵ������ƶȣ�����֧�ֻ�����ҳ�Ŀ��ӻ��ȶԡ���������롰������롱���IJ�����ͬ����⹤��ѡ��
Jplag����������ѡ��doc�����ɡ�Jplag�µ�text�������͸��ʺϼ�ⴿӢ���ĵ������ƶȡ�


�� ����׼��
ϵͳ֧��2�����������ʽ.
Expand Down
112 changes: 59 additions & 53 deletions out.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,56 +2,62 @@
2 92.47312% gumingzhu-2.doc zhucuiyun_2.doc
3 91.408936% wangmeng-2.doc zhucuiyun_2.doc
4 87.63636% dongxiao-2.docx dongxiaoutf8-2.txt
5 84.717606% gumingzhu-2.doc wangmeng-2.doc
6 84.310844% dongxiao-2.doc dongxiao-2.pdf
7 84.168015% dongxiao-2.doc dongxiaoutf8-2.txt
8 83.870964% dongxiao-2.pdf dongxiaogbk.txt
9 83.68336% dongxiaogbk.txt dongxiaoutf8-2.txt
10 82.954544% dongxiao-2.docx dongxiaogbk.txt
11 82.552505% dongxiao-2.doc dongxiao-2.docx
12 75.74404% lijie-2.doc wangmeng-2.doc
13 74.96063% gumingzhu-2.doc wuchangqing-2.doc
14 71.703705% dongxiao-2.pdf dongxiaoutf8-2.txt
15 71.49254% dongxiao-2.docx dongxiao-2.pdf
16 69.92366% wuchangqing-2.doc zhucuiyun_2.doc
17 68.584076% lijie-2.doc zhucuiyun_2.doc
18 65.61151% wangmeng-2.doc wuchangqing-2.doc
19 65.12301% gumingzhu-2.doc lijie-2.doc
20 57.454544% dongxiaogbk.txt meitao-2.doc
21 57.246376% dongxiao-2.doc meitao-2.doc
22 52.258064% lijie-2.doc wuchangqing-2.doc
23 50.757576% dongxiao-2.docx meitao-2.doc
24 50.284416% dongxiao-2.pdf meitao-2.doc
25 48.87218% makai��2.doc wangxuan_2.doc.doc
26 48.45869% dongxiaoutf8-2.txt meitao-2.doc
27 46.67074% liuchuanyang-2.doc tangwenpeng-2.doc
28 41.64096% heliwen_2.doc liufan_2.doc
29 40.54834% liufan_2.doc wangchunming_2.doc
30 38.75061% gechunlong-2.doc hanchao_2.doc
31 36.930233% luxiang-2.doc tangwenpeng-2.doc
32 36.89095% jiangfeng-2.doc lijie-2.doc
33 35.925926% weixiao-2.doc yinxu-2.doc
34 35.424637% liuchuanyang-2.doc wuliangchao-2.doc
35 35.039577% gechunlong-2.doc yinxu-2.doc
36 34.839073% gechunlong-2.doc weixiao-2.doc
37 34.325184% wangmeng-2.doc wuliangchao-2.doc
38 34.069096% guozhiquan -2.doc wuliangchao-2.doc
39 33.98907% wuliangchao-2.doc zhucuiyun_2.doc
40 32.858547% tangwenpeng-2.doc xuqiwei-2.doc
41 32.557137% tangwenpeng-2.doc wangchen-2.doc
42 32.296955% liuchuanyang-2.doc yinxu-2.doc
43 32.073547% lijie-2.doc wuliangchao-2.doc
44 32.070206% gechunlong-2.doc wangchen-2.doc
45 32.058823% jiangfeng-2.doc yinpeiyan_2.doc
46 31.946404% sunxiaolei-2.doc wangchunming_2.doc
47 31.471535% gumingzhu-2.doc wuliangchao-2.doc
48 30.698889% sunxiaolei-2.doc yinxu-2.doc
49 30.651136% liuchuanyang-2.doc xuqiwei-2.doc
50 30.63007% heliwen_2.doc wangchunming_2.doc
51 30.559345% liuchuanyang-2.doc weixiao-2.doc
52 30.494392% wangchen-2.doc xuqiwei-2.doc
53 30.429863% tangwenming-2.doc xuqiwei-2.doc
54 30.424183% tangwenming-2.doc wangchen-2.doc
55 30.095451% sunxiaolei-2.doc tangwenpeng-2.doc
56 30.065361% guozhiquan -2.doc liuchuanyang-2.doc
from fh Sun Dec 01 18:57:44 CST 2019
5 84.765625% dongxiao-2.docx dongxiao-2.html
6 84.717606% gumingzhu-2.doc wangmeng-2.doc
7 84.310844% dongxiao-2.doc dongxiao-2.pdf
8 84.168015% dongxiao-2.doc dongxiaoutf8-2.txt
9 83.870964% dongxiao-2.pdf dongxiaogbk.txt
10 83.68336% dongxiaogbk.txt dongxiaoutf8-2.txt
11 83.14176% dongxiao-2.html dongxiaoutf8-2.txt
12 82.954544% dongxiao-2.docx dongxiaogbk.txt
13 82.552505% dongxiao-2.doc dongxiao-2.docx
14 75.74404% lijie-2.doc wangmeng-2.doc
15 74.96063% gumingzhu-2.doc wuchangqing-2.doc
16 71.703705% dongxiao-2.pdf dongxiaoutf8-2.txt
17 71.49254% dongxiao-2.docx dongxiao-2.pdf
18 70.34036% dongxiao-2.html dongxiaogbk.txt
19 70.0% dongxiao-2.doc dongxiao-2.html
20 69.92366% wuchangqing-2.doc zhucuiyun_2.doc
21 68.584076% lijie-2.doc zhucuiyun_2.doc
22 65.61151% wangmeng-2.doc wuchangqing-2.doc
23 65.12301% gumingzhu-2.doc lijie-2.doc
24 60.869564% dongxiao-2.html dongxiao-2.pdf
25 57.454544% dongxiaogbk.txt meitao-2.doc
26 57.246376% dongxiao-2.doc meitao-2.doc
27 52.258064% lijie-2.doc wuchangqing-2.doc
28 50.757576% dongxiao-2.docx meitao-2.doc
29 50.284416% dongxiao-2.pdf meitao-2.doc
30 48.87218% makai��2.doc wangxuan_2.doc.doc
31 48.45869% dongxiaoutf8-2.txt meitao-2.doc
32 46.67074% liuchuanyang-2.doc tangwenpeng-2.doc
33 41.878174% dongxiao-2.html meitao-2.doc
34 41.64096% heliwen_2.doc liufan_2.doc
35 40.54834% liufan_2.doc wangchunming_2.doc
36 38.75061% gechunlong-2.doc hanchao_2.doc
37 36.930233% luxiang-2.doc tangwenpeng-2.doc
38 36.89095% jiangfeng-2.doc lijie-2.doc
39 35.925926% weixiao-2.doc yinxu-2.doc
40 35.424637% liuchuanyang-2.doc wuliangchao-2.doc
41 35.039577% gechunlong-2.doc yinxu-2.doc
42 34.839073% gechunlong-2.doc weixiao-2.doc
43 34.325184% wangmeng-2.doc wuliangchao-2.doc
44 34.069096% guozhiquan -2.doc wuliangchao-2.doc
45 33.98907% wuliangchao-2.doc zhucuiyun_2.doc
46 32.858547% tangwenpeng-2.doc xuqiwei-2.doc
47 32.557137% tangwenpeng-2.doc wangchen-2.doc
48 32.296955% liuchuanyang-2.doc yinxu-2.doc
49 32.073547% lijie-2.doc wuliangchao-2.doc
50 32.070206% gechunlong-2.doc wangchen-2.doc
51 32.058823% jiangfeng-2.doc yinpeiyan_2.doc
52 31.946404% sunxiaolei-2.doc wangchunming_2.doc
53 31.471535% gumingzhu-2.doc wuliangchao-2.doc
54 30.698889% sunxiaolei-2.doc yinxu-2.doc
55 30.651136% liuchuanyang-2.doc xuqiwei-2.doc
56 30.63007% heliwen_2.doc wangchunming_2.doc
57 30.559345% liuchuanyang-2.doc weixiao-2.doc
58 30.494392% wangchen-2.doc xuqiwei-2.doc
59 30.429863% tangwenming-2.doc xuqiwei-2.doc
60 30.424183% tangwenming-2.doc wangchen-2.doc
61 30.095451% sunxiaolei-2.doc tangwenpeng-2.doc
62 30.065361% guozhiquan -2.doc liuchuanyang-2.doc
from fh Mon Dec 02 19:18:34 CST 2019
3 changes: 2 additions & 1 deletion src/gui/plag/edu/PlagGUI.java
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ public void stateChanged(ChangeEvent arg0) {
panel_1.add(label);

txtThreshold = new JTextField();
txtThreshold.setText("30");
txtThreshold.setText("50");
txtThreshold.setToolTipText("\u8BF7\u8F93\u51650-100\u4E4B\u95F4\u7684\u503C");
txtThreshold.setBounds(80, 26, 70, 21);
panel_1.add(txtThreshold);
Expand Down Expand Up @@ -202,6 +202,7 @@ public void itemStateChanged(ItemEvent arg0) {
combLang.addItem("c/c++");
combLang.addItem("python3");
combLang.addItem("text");
combLang.addItem("doc");
}
}
});
Expand Down
80 changes: 80 additions & 0 deletions src/jplag/doc/DocToken.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
package jplag.doc;


public class DocToken extends jplag.Token {

private static final long serialVersionUID = 3800987170521573780L;


public static int getSerial(String text, Parser parser) {
text = text.toLowerCase();
Integer obj = (Integer) parser.tokenStructure.table.get(text);
if(obj == null) {
obj = new Integer(parser.tokenStructure.serial);
if(parser.tokenStructure.serial == Integer.MAX_VALUE)
parser.outOfSerials();
else
parser.tokenStructure.serial++;
parser.tokenStructure.table.put(text, obj);
if(parser.tokenStructure.reverseMapping != null)
parser.tokenStructure.reverseMapping = null;
}
return obj.intValue();
}

// throw away this method soon:

public static String type2string(int i, TokenStructure tokenStructure) {
if(tokenStructure.reverseMapping == null)
tokenStructure.createReverseMapping();
return tokenStructure.reverseMapping[i];
}

// ///////////////////// END OF STATIC MEMBERS

private int line, column, length;
private String text;

public DocToken(int type, String file, Parser parser) {
super(type, file, -1, -1, -1);
}

public DocToken(String text, String file, int line, int column,
int length, Parser parser) {
super(-1, file, line, column, length);
this.type = getSerial(text, parser);
this.text = text.toLowerCase();
}

public int getLine() {
return line;
}

public int getColumn() {
return column;
}

public int getLength() {
return length;
}

public void setLine(int line) {
this.line = line;
}

public void setColumn(int column) {
this.column = column;
}

public void setLength(int length) {
this.length = length;
}

public String getText() {
return this.text;
}

public static int numberOfTokens(TokenStructure tokenStructure) {
return tokenStructure.table.size();
}
}
72 changes: 72 additions & 0 deletions src/jplag/doc/Language.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@

package jplag.doc;

import java.io.File;

import jplag.ProgramI;

/**
* @Changed by fanghong 2019.12.1
*
*/
public class Language implements jplag.Language {

private ProgramI program;

private jplag.doc.Parser parser = new jplag.doc.Parser();

public Language(ProgramI program) {
this.program = program;
this.parser.setProgram(this.program);
}

public int errorsCount() {
return this.parser.errorsCount();
}

public String[] suffixes() {
String[] res = { ".txt", ".doc", ".docx", ".pdf", ".html" };
return res;
}

public String name() {
return "Doc Parser";
}

public String getShortName() {
return "doc";
}

public int min_token_match() {
return 12;
}

public jplag.Structure parse(File dir, String[] files) {
return this.parser.parse(dir, files);
}

public boolean errors() {
return this.parser.getErrors();
}

public boolean supportsColumns() {
return true;
}

public boolean isPreformated() {
return false;
}

public boolean usesIndex() {
return false;
}

public int noOfTokens() {
return parser.tokenStructure.serial;
// return jplag.text.TextToken.numberOfTokens(); // always returns 1 ....
}

public String type2string(int type) {
return jplag.text.TextToken.type2string(type);
}
}
Loading

0 comments on commit 20dd9a8

Please sign in to comment.