diff --git a/README.md b/README.md index 5ec8d2d..4bfcde5 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@ # antiplag 程序代码及文档作业相似度检查软件 -软件主要检查、比较学生提交的电子档作业之间的相似度,能对多种编程语言(如java、c/c++、python等)、多种格式(txt、doc、docx、pdf、html)的中英文、简繁体文档(如实验报告)之间的文本相似度进行比较分析,输出相似度高的文档,进而辅助发现学生之间互相抄袭的行为。 +软件主要检查、比较学生提交的电子档作业之间的相似度,能对多种编程语言(如java、c/c++、python等)、多种格式(txt、doc、docx、pdf等)的中英文、简繁体文档(如实验报告)之间的文本相似度进行比较分析,输出相似度高的文档,进而辅助发现学生之间互相抄袭的行为。 ## 需求 [jdk11](https://www.oracle.com/technetwork/java/javase/downloads/jdk11-downloads-5066655.html) @@ -14,28 +14,31 @@ ![程序主界面](./maingui.png) ## 原理 -系统采用的主要技术是自然语言处理(nlp)中的文本相似度计算。程序类文本的相似度比较基于3个开放系统: +系统采用的主要技术是字符串及自然语言处理(nlp)中的文本相似度计算。 + +程序类文本的相似度比较基于3个开放系统: * 一是基于网络服务的[MOSS系统](http://theory.stanford.edu/~aiken/moss/)(斯坦福大学开放的支持多种编程语言代码相似度比较的系统); * 二是本地执行的[sim系统](https://dickgrune.com/Programs/similarity_tester/)(支持java、c等语言的文本相似度比较)。 * 三是本地执行的[jplag系统](https://github.com/jplag/jplag/)(支持java、c/c++、python等语言的文本相似度比较)。 本系统在它们基础上进行了二次开发和封装,针对moss系统,开发出了客户端存取模块,实现了代码文件提交、结果获取和解析、结果排序等功能;针对sim和jplag,则将其集成到系统中,在moss因网络故障等原因不可用时,可作为替代产品使用。 -中英文文档作业相似度的比较则基于[shinglecloud算法](https://www.kom.tu-darmstadt.de/de/research-results/0/1/shinglecloud/)(一种基于文本指纹的、语言无关的相似度快速计算方法),文档主要处理过程如下: +中英文文档作业相似度的比较提供了两种算法: + +第一种是基于[shinglecloud算法](https://www.kom.tu-darmstadt.de/de/research-results/0/1/shinglecloud/)(一种基于文本指纹的、语言无关的相似度快速计算方法),文档主要处理过程如下: 1. 使用tika读取不同格式(txt、doc、docx、pdf、html等)不同编码文件中的文本内容,并将其转换成能统一处理的文本; 2. 使用hanlp对文本进行预处理、分词; 3. 使用shinglecloud算法计算文本之间的相似度; 4. 根据相似度排序,输出比较结果。 +第二种是基于jplag的GST算法,对其功能进行了扩展,增加的“doc”语言类型,可以对各种文档进行相似度计算,并提供基于网页的可视化比对功能。 + ### 参考文献: 1. [Software Plagiarism Detection Techniques:A Comparative Study](http://www.ijcsit.com/docs/Volume%205/vol5issue04/ijcsit2014050441.pdf) 2. [JPlag: Finding plagiarisms among a set of programs](http://page.mi.fu-berlin.de/prechelt/Biblio/jplagTR.pdf) 3. [Winnowing: Local Algorithms for Document Fingerprinting](http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf) moss系统采用的核心算法 4. [软件抄袭检测研究综述](https://faculty.ist.psu.edu/wu/papers/spd-survey-16.pdf) -## 更新情况 -1. 2019.12.1 使用hanlp作为分词组件,支持pdf、html文件文本的查重,修复若干bug,发布v2.8.6版。 - ## TODO 1. 将jplag整合进系统。已实现。 2. 支持html,jsp文件代码的查重。 @@ -45,4 +48,8 @@ 源于开源,还于开源,开源是美德,加星也是美德 :smile: 。 +## 更新情况 +1. 2019.12.1 使用hanlp作为分词组件,支持pdf、html文件文本的查重,修复若干bug,发布v2.8.6版。 +2. 2019.12.3 扩展jplag功能,提供“doc”语言类型,实现了对多种格式文档文本的相似度计算及可视化比对功能。更新使用帮助,测试数据,发布v2.8.8版。 + \ No newline at end of file diff --git a/bin/.gitignore b/bin/.gitignore index e960251..67c6176 100644 --- a/bin/.gitignore +++ b/bin/.gitignore @@ -1,2 +1,5 @@ /preprocess/ /gui/ +/jplag/ +/utils/ +/shingle/ diff --git a/bin/gui/plag/edu/PlagGUI$5.class b/bin/gui/plag/edu/PlagGUI$5.class index 0351610..41d3178 100644 Binary files a/bin/gui/plag/edu/PlagGUI$5.class and b/bin/gui/plag/edu/PlagGUI$5.class differ diff --git a/bin/gui/plag/edu/PlagGUI$6.class b/bin/gui/plag/edu/PlagGUI$6.class index c45e147..1ebf030 100644 Binary files a/bin/gui/plag/edu/PlagGUI$6.class and b/bin/gui/plag/edu/PlagGUI$6.class differ diff --git a/bin/gui/plag/edu/PlagGUI$7.class b/bin/gui/plag/edu/PlagGUI$7.class index d7fcfd1..4690a5d 100644 Binary files a/bin/gui/plag/edu/PlagGUI$7.class and b/bin/gui/plag/edu/PlagGUI$7.class differ diff --git a/bin/gui/plag/edu/PlagGUI$8.class b/bin/gui/plag/edu/PlagGUI$8.class index cb9a61d..97f46d1 100644 Binary files a/bin/gui/plag/edu/PlagGUI$8.class and b/bin/gui/plag/edu/PlagGUI$8.class differ diff --git a/bin/gui/plag/edu/PlagGUI$9.class b/bin/gui/plag/edu/PlagGUI$9.class index f96ccbb..dbb0182 100644 Binary files a/bin/gui/plag/edu/PlagGUI$9.class and b/bin/gui/plag/edu/PlagGUI$9.class differ diff --git a/bin/gui/plag/edu/PlagGUI.class b/bin/gui/plag/edu/PlagGUI.class index af337f3..52073d0 100644 Binary files a/bin/gui/plag/edu/PlagGUI.class and b/bin/gui/plag/edu/PlagGUI.class differ diff --git a/bin/preprocess/plag/edu/TextExtractor.class b/bin/preprocess/plag/edu/TextExtractor.class index 9a1ac00..50c1ebf 100644 Binary files a/bin/preprocess/plag/edu/TextExtractor.class and b/bin/preprocess/plag/edu/TextExtractor.class differ diff --git a/bin/shingle/plag/edu/ShingleSim$Fileter.class b/bin/shingle/plag/edu/ShingleSim$Fileter.class index 89c83ef..1e38153 100644 Binary files a/bin/shingle/plag/edu/ShingleSim$Fileter.class and b/bin/shingle/plag/edu/ShingleSim$Fileter.class differ diff --git a/bin/shingle/plag/edu/ShingleSim.class b/bin/shingle/plag/edu/ShingleSim.class index 8eef08c..504264b 100644 Binary files a/bin/shingle/plag/edu/ShingleSim.class and b/bin/shingle/plag/edu/ShingleSim.class differ diff --git a/bin/utils/edu/AntFile.class b/bin/utils/edu/AntFile.class index 311eb1c..328c848 100644 Binary files a/bin/utils/edu/AntFile.class and b/bin/utils/edu/AntFile.class differ diff --git a/bin/utils/edu/FileIO.class b/bin/utils/edu/FileIO.class index 749b5a8..9650dc3 100644 Binary files a/bin/utils/edu/FileIO.class and b/bin/utils/edu/FileIO.class differ diff --git a/bin/utils/edu/StreamGobbler.class b/bin/utils/edu/StreamGobbler.class index 3e662e4..7bda5e8 100644 Binary files a/bin/utils/edu/StreamGobbler.class and b/bin/utils/edu/StreamGobbler.class differ diff --git a/bin/utils/edu/WinCMD.class b/bin/utils/edu/WinCMD.class index a0c58dc..f642d82 100644 Binary files a/bin/utils/edu/WinCMD.class and b/bin/utils/edu/WinCMD.class differ diff --git a/help.txt b/help.txt index f87fe3c..3149762 100644 --- a/help.txt +++ b/help.txt @@ -7,7 +7,7 @@ £ 1ѡ񱻼ļĿ¼ѡҵťļѡԻУѡtestdata Ŀ¼µĿ¼硰javaabctogradeĿ¼ -2ȷȷҵǡҵƶֵ30⹤moss +2ȷȷҵǡҵƶֵʵ⹤moss java 3ִбȽϡִбȽϡťȴϵͳȥstanfordmossϵͳվύ javaabctogradeµչΪjavaĴļصĽϺ󣬻ᵯ @@ -18,14 +18,15 @@ ϵͳtestdataĿ¼ṩ˳Ķĵѧϰʹá 2 ĵıƶȼ -ĵIJͳļⲽһ£ֻǡѡҵʱѡĵҵ -磺testdata/doccnµҵĵļչtxtdocdocxpdfhtml -еһ֡ -ҵǡıҵȻִбȽϡťȴȷϴ鿴 -ťϵͳ򿪡ȽϽڣԲ鿴ȽϽ -ıĵıȽĿǰݲ֧ͨҳпӻԱȡ -ֻӢĵѡjplagµġtextͷʽԶӢĵ -sql롢htmljspļ(ǵļչijtxt)Ƚм򵥿ӻȶԡ +Ŀǰַ֧ʽ +(1)ʹshinglecloud㷨Ƚϡͳļⲽһ£ֻѡҵʱ +ѡǡıҵ(磺testdata/doccnµļ)ĵļչtxtdoc +docxpdfhtml +ıҵʽµıȽĿǰݲ֧ͨҳпӻԱȡ +(2)ʹjplagGST㷨ȽϡϵͳչԭJplagĹܣˡdocͣԼָ +ʽĵƶȣֻ֧ҳĿӻȶԡ롰롱IJͬ⹤ѡ +JplagѡdocɡJplagµtext͸ʺϼⴿӢĵƶȡ + ׼ ϵͳ֧2ʽ. diff --git a/out.txt b/out.txt index d3dfe10..a703264 100644 --- a/out.txt +++ b/out.txt @@ -2,56 +2,62 @@ 2 92.47312% gumingzhu-2.doc zhucuiyun_2.doc 3 91.408936% wangmeng-2.doc zhucuiyun_2.doc 4 87.63636% dongxiao-2.docx dongxiaoutf8-2.txt -5 84.717606% gumingzhu-2.doc wangmeng-2.doc -6 84.310844% dongxiao-2.doc dongxiao-2.pdf -7 84.168015% dongxiao-2.doc dongxiaoutf8-2.txt -8 83.870964% dongxiao-2.pdf dongxiaogbk.txt -9 83.68336% dongxiaogbk.txt dongxiaoutf8-2.txt -10 82.954544% dongxiao-2.docx dongxiaogbk.txt -11 82.552505% dongxiao-2.doc dongxiao-2.docx -12 75.74404% lijie-2.doc wangmeng-2.doc -13 74.96063% gumingzhu-2.doc wuchangqing-2.doc -14 71.703705% dongxiao-2.pdf dongxiaoutf8-2.txt -15 71.49254% dongxiao-2.docx dongxiao-2.pdf -16 69.92366% wuchangqing-2.doc zhucuiyun_2.doc -17 68.584076% lijie-2.doc zhucuiyun_2.doc -18 65.61151% wangmeng-2.doc wuchangqing-2.doc -19 65.12301% gumingzhu-2.doc lijie-2.doc -20 57.454544% dongxiaogbk.txt meitao-2.doc -21 57.246376% dongxiao-2.doc meitao-2.doc -22 52.258064% lijie-2.doc wuchangqing-2.doc -23 50.757576% dongxiao-2.docx meitao-2.doc -24 50.284416% dongxiao-2.pdf meitao-2.doc -25 48.87218% makai2.doc wangxuan_2.doc.doc -26 48.45869% dongxiaoutf8-2.txt meitao-2.doc -27 46.67074% liuchuanyang-2.doc tangwenpeng-2.doc -28 41.64096% heliwen_2.doc liufan_2.doc -29 40.54834% liufan_2.doc wangchunming_2.doc -30 38.75061% gechunlong-2.doc hanchao_2.doc -31 36.930233% luxiang-2.doc tangwenpeng-2.doc -32 36.89095% jiangfeng-2.doc lijie-2.doc -33 35.925926% weixiao-2.doc yinxu-2.doc -34 35.424637% liuchuanyang-2.doc wuliangchao-2.doc -35 35.039577% gechunlong-2.doc yinxu-2.doc -36 34.839073% gechunlong-2.doc weixiao-2.doc -37 34.325184% wangmeng-2.doc wuliangchao-2.doc -38 34.069096% guozhiquan -2.doc wuliangchao-2.doc -39 33.98907% wuliangchao-2.doc zhucuiyun_2.doc -40 32.858547% tangwenpeng-2.doc xuqiwei-2.doc -41 32.557137% tangwenpeng-2.doc wangchen-2.doc -42 32.296955% liuchuanyang-2.doc yinxu-2.doc -43 32.073547% lijie-2.doc wuliangchao-2.doc -44 32.070206% gechunlong-2.doc wangchen-2.doc -45 32.058823% jiangfeng-2.doc yinpeiyan_2.doc -46 31.946404% sunxiaolei-2.doc wangchunming_2.doc -47 31.471535% gumingzhu-2.doc wuliangchao-2.doc -48 30.698889% sunxiaolei-2.doc yinxu-2.doc -49 30.651136% liuchuanyang-2.doc xuqiwei-2.doc -50 30.63007% heliwen_2.doc wangchunming_2.doc -51 30.559345% liuchuanyang-2.doc weixiao-2.doc -52 30.494392% wangchen-2.doc xuqiwei-2.doc -53 30.429863% tangwenming-2.doc xuqiwei-2.doc -54 30.424183% tangwenming-2.doc wangchen-2.doc -55 30.095451% sunxiaolei-2.doc tangwenpeng-2.doc -56 30.065361% guozhiquan -2.doc liuchuanyang-2.doc -from fh Sun Dec 01 18:57:44 CST 2019 \ No newline at end of file +5 84.765625% dongxiao-2.docx dongxiao-2.html +6 84.717606% gumingzhu-2.doc wangmeng-2.doc +7 84.310844% dongxiao-2.doc dongxiao-2.pdf +8 84.168015% dongxiao-2.doc dongxiaoutf8-2.txt +9 83.870964% dongxiao-2.pdf dongxiaogbk.txt +10 83.68336% dongxiaogbk.txt dongxiaoutf8-2.txt +11 83.14176% dongxiao-2.html dongxiaoutf8-2.txt +12 82.954544% dongxiao-2.docx dongxiaogbk.txt +13 82.552505% dongxiao-2.doc dongxiao-2.docx +14 75.74404% lijie-2.doc wangmeng-2.doc +15 74.96063% gumingzhu-2.doc wuchangqing-2.doc +16 71.703705% dongxiao-2.pdf dongxiaoutf8-2.txt +17 71.49254% dongxiao-2.docx dongxiao-2.pdf +18 70.34036% dongxiao-2.html dongxiaogbk.txt +19 70.0% dongxiao-2.doc dongxiao-2.html +20 69.92366% wuchangqing-2.doc zhucuiyun_2.doc +21 68.584076% lijie-2.doc zhucuiyun_2.doc +22 65.61151% wangmeng-2.doc wuchangqing-2.doc +23 65.12301% gumingzhu-2.doc lijie-2.doc +24 60.869564% dongxiao-2.html dongxiao-2.pdf +25 57.454544% dongxiaogbk.txt meitao-2.doc +26 57.246376% dongxiao-2.doc meitao-2.doc +27 52.258064% lijie-2.doc wuchangqing-2.doc +28 50.757576% dongxiao-2.docx meitao-2.doc +29 50.284416% dongxiao-2.pdf meitao-2.doc +30 48.87218% makai2.doc wangxuan_2.doc.doc +31 48.45869% dongxiaoutf8-2.txt meitao-2.doc +32 46.67074% liuchuanyang-2.doc tangwenpeng-2.doc +33 41.878174% dongxiao-2.html meitao-2.doc +34 41.64096% heliwen_2.doc liufan_2.doc +35 40.54834% liufan_2.doc wangchunming_2.doc +36 38.75061% gechunlong-2.doc hanchao_2.doc +37 36.930233% luxiang-2.doc tangwenpeng-2.doc +38 36.89095% jiangfeng-2.doc lijie-2.doc +39 35.925926% weixiao-2.doc yinxu-2.doc +40 35.424637% liuchuanyang-2.doc wuliangchao-2.doc +41 35.039577% gechunlong-2.doc yinxu-2.doc +42 34.839073% gechunlong-2.doc weixiao-2.doc +43 34.325184% wangmeng-2.doc wuliangchao-2.doc +44 34.069096% guozhiquan -2.doc wuliangchao-2.doc +45 33.98907% wuliangchao-2.doc zhucuiyun_2.doc +46 32.858547% tangwenpeng-2.doc xuqiwei-2.doc +47 32.557137% tangwenpeng-2.doc wangchen-2.doc +48 32.296955% liuchuanyang-2.doc yinxu-2.doc +49 32.073547% lijie-2.doc wuliangchao-2.doc +50 32.070206% gechunlong-2.doc wangchen-2.doc +51 32.058823% jiangfeng-2.doc yinpeiyan_2.doc +52 31.946404% sunxiaolei-2.doc wangchunming_2.doc +53 31.471535% gumingzhu-2.doc wuliangchao-2.doc +54 30.698889% sunxiaolei-2.doc yinxu-2.doc +55 30.651136% liuchuanyang-2.doc xuqiwei-2.doc +56 30.63007% heliwen_2.doc wangchunming_2.doc +57 30.559345% liuchuanyang-2.doc weixiao-2.doc +58 30.494392% wangchen-2.doc xuqiwei-2.doc +59 30.429863% tangwenming-2.doc xuqiwei-2.doc +60 30.424183% tangwenming-2.doc wangchen-2.doc +61 30.095451% sunxiaolei-2.doc tangwenpeng-2.doc +62 30.065361% guozhiquan -2.doc liuchuanyang-2.doc +from fh Mon Dec 02 19:18:34 CST 2019 \ No newline at end of file diff --git a/src/gui/plag/edu/PlagGUI.java b/src/gui/plag/edu/PlagGUI.java index d3d9b60..697d8cf 100644 --- a/src/gui/plag/edu/PlagGUI.java +++ b/src/gui/plag/edu/PlagGUI.java @@ -161,7 +161,7 @@ public void stateChanged(ChangeEvent arg0) { panel_1.add(label); txtThreshold = new JTextField(); - txtThreshold.setText("30"); + txtThreshold.setText("50"); txtThreshold.setToolTipText("\u8BF7\u8F93\u51650-100\u4E4B\u95F4\u7684\u503C"); txtThreshold.setBounds(80, 26, 70, 21); panel_1.add(txtThreshold); @@ -202,6 +202,7 @@ public void itemStateChanged(ItemEvent arg0) { combLang.addItem("c/c++"); combLang.addItem("python3"); combLang.addItem("text"); + combLang.addItem("doc"); } } }); diff --git a/src/jplag/doc/DocToken.java b/src/jplag/doc/DocToken.java new file mode 100644 index 0000000..b4b1d60 --- /dev/null +++ b/src/jplag/doc/DocToken.java @@ -0,0 +1,80 @@ +package jplag.doc; + + +public class DocToken extends jplag.Token { + + private static final long serialVersionUID = 3800987170521573780L; + + + public static int getSerial(String text, Parser parser) { + text = text.toLowerCase(); + Integer obj = (Integer) parser.tokenStructure.table.get(text); + if(obj == null) { + obj = new Integer(parser.tokenStructure.serial); + if(parser.tokenStructure.serial == Integer.MAX_VALUE) + parser.outOfSerials(); + else + parser.tokenStructure.serial++; + parser.tokenStructure.table.put(text, obj); + if(parser.tokenStructure.reverseMapping != null) + parser.tokenStructure.reverseMapping = null; + } + return obj.intValue(); + } + + // throw away this method soon: + + public static String type2string(int i, TokenStructure tokenStructure) { + if(tokenStructure.reverseMapping == null) + tokenStructure.createReverseMapping(); + return tokenStructure.reverseMapping[i]; + } + + // ///////////////////// END OF STATIC MEMBERS + + private int line, column, length; + private String text; + + public DocToken(int type, String file, Parser parser) { + super(type, file, -1, -1, -1); + } + + public DocToken(String text, String file, int line, int column, + int length, Parser parser) { + super(-1, file, line, column, length); + this.type = getSerial(text, parser); + this.text = text.toLowerCase(); + } + + public int getLine() { + return line; + } + + public int getColumn() { + return column; + } + + public int getLength() { + return length; + } + + public void setLine(int line) { + this.line = line; + } + + public void setColumn(int column) { + this.column = column; + } + + public void setLength(int length) { + this.length = length; + } + + public String getText() { + return this.text; + } + + public static int numberOfTokens(TokenStructure tokenStructure) { + return tokenStructure.table.size(); + } +} diff --git a/src/jplag/doc/Language.java b/src/jplag/doc/Language.java new file mode 100644 index 0000000..0ab0dce --- /dev/null +++ b/src/jplag/doc/Language.java @@ -0,0 +1,72 @@ + +package jplag.doc; + +import java.io.File; + +import jplag.ProgramI; + +/** + * @Changed by fanghong 2019.12.1 + * + */ +public class Language implements jplag.Language { + + private ProgramI program; + + private jplag.doc.Parser parser = new jplag.doc.Parser(); + + public Language(ProgramI program) { + this.program = program; + this.parser.setProgram(this.program); + } + + public int errorsCount() { + return this.parser.errorsCount(); + } + + public String[] suffixes() { + String[] res = { ".txt", ".doc", ".docx", ".pdf", ".html" }; + return res; + } + + public String name() { + return "Doc Parser"; + } + + public String getShortName() { + return "doc"; + } + + public int min_token_match() { + return 12; + } + + public jplag.Structure parse(File dir, String[] files) { + return this.parser.parse(dir, files); + } + + public boolean errors() { + return this.parser.getErrors(); + } + + public boolean supportsColumns() { + return true; + } + + public boolean isPreformated() { + return false; + } + + public boolean usesIndex() { + return false; + } + + public int noOfTokens() { + return parser.tokenStructure.serial; +// return jplag.text.TextToken.numberOfTokens(); // always returns 1 .... + } + + public String type2string(int type) { + return jplag.text.TextToken.type2string(type); + } +} diff --git a/src/jplag/doc/Parser.java b/src/jplag/doc/Parser.java new file mode 100644 index 0000000..875bac4 --- /dev/null +++ b/src/jplag/doc/Parser.java @@ -0,0 +1,127 @@ +package jplag.doc; + +import java.io.BufferedReader; +import java.io.File; +import java.io.FileInputStream; +import java.io.FileNotFoundException; +import java.io.FileReader; +import java.io.IOException; +import java.io.StringReader; +import java.util.HashSet; +import java.util.List; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.seg.common.Term; + +import jplag.InputState; +import jplag.ParserToken; +import jplag.Structure; +import preprocess.plag.edu.TextExtractor; +import utils.edu.FileIO; + +/** + * @Changed by Emeric Kwemou 29.01.2005 + * + */ +public class Parser extends jplag.Parser implements jplag.TokenConstants { + + protected TokenStructure tokenStructure = new TokenStructure(); + + private Structure struct; + + private String currentFile; + + public jplag.Structure parse(File dir, String files[]) { + struct = new Structure(); + errors = 0; + for (int i = 0; i < files.length; i++) { + //getProgram().print("", "Parsing file " + files[i] + "\n"); + if (!parseFile(dir, files[i])) + errors++; + struct.addToken(new DocToken(FILE_END, files[i], this)); + } + + Structure tmp = struct; + struct = null; + this.parseEnd(); + return tmp; + } + + public boolean parseFile(File dir, String file) { + + try { + currentFile = file; + String[] strs = FileIO.readFile(new File(dir, file),"utf-8"); + for(int line=0;line tokens = HanLP.segment(strs[line]); + int col = 1; + for(int j=0;j table = new Hashtable(); + protected String[] reverseMapping = null; + protected int serial = 1; // 0 is FILE_END token + + protected void createReverseMapping() { + if(this.reverseMapping == null) { + this.reverseMapping = new String[this.table.size() + 1]; + for (Entry entry : table.entrySet()) { + int type = (entry.getValue()).intValue(); + String text = entry.getKey(); + this.reverseMapping[type] = text; + } + } + } + + public Set> entrySet() { + return this.table.entrySet(); + } + + public String tableStatus() { + return "Size of table: " + this.table.size(); + } +} diff --git a/src/jplag/options/CommandLineOptionsExt.java b/src/jplag/options/CommandLineOptionsExt.java new file mode 100644 index 0000000..0af03c1 --- /dev/null +++ b/src/jplag/options/CommandLineOptionsExt.java @@ -0,0 +1,46 @@ +package jplag.options; + +import jplag.ExitException; + +public class CommandLineOptionsExt extends CommandLineOptions { + + public CommandLineOptionsExt(String[] args) throws ExitException { + super(args); + initLangs(); + } + + public CommandLineOptionsExt(String[] args, String cmdInString) throws ExitException { + super(args,cmdInString); + initLangs(); + } + //ʼԼϣڴֵ֧ + void initLangs() { + String[] langs= {"doc","jplag.doc.Language"}; + addLanguages(langs); + } + public String[] getLanguages() { + return this.languages; + } + + public void addLanguages(String[] langs) { + String[] strs = new String[languages.length+langs.length]; + System.arraycopy(languages, 0, strs, 0, languages.length); + System.arraycopy(langs, 0, strs, languages.length, langs.length); + this.languages = strs ; + } + public static void main(String[] args) { + // TODO Auto-generated method stub + String[] langs= {"doc","jplag.doc.Language"}; + try { + CommandLineOptionsExt cmdop = new CommandLineOptionsExt(langs); + cmdop.addLanguages(langs); + for(String str:cmdop.getLanguages()) { + System.out.print(str+","); + } + } catch (ExitException e) { + // TODO Auto-generated catch block + e.printStackTrace(); + } + } + +} diff --git a/src/preprocess/plag/edu/TextExtractor.java b/src/preprocess/plag/edu/TextExtractor.java index bbd33ea..3bc3ebd 100644 --- a/src/preprocess/plag/edu/TextExtractor.java +++ b/src/preprocess/plag/edu/TextExtractor.java @@ -105,9 +105,10 @@ public static String fileToTxt(File f,Metadata metadata) { public static void main(String[] args) { // TODO Auto-generated method stub // File f = new File("./testdata/doccn/dongxiao-2.doc"); - File f = new File("./testdata/doccn/dongxiao-2.pdf"); + // File f = new File("./testdata/doccn/dongxiao-2.pdf"); // File f = new File("./testdata/doccn/dongxiaogbk.txt"); // File f = new File("./testdata/doccn/dongxiaoutf8-2.txt"); + File f = new File("./testdata/doccn/dongxiao-2.html"); System.out.println(TextExtractor.getTxt(f)); Metadata metadata = new Metadata(); System.out.println(TextExtractor.fileToTxt(f,metadata)); diff --git a/src/preprocess/plag/edu/Tokenizer.java b/src/preprocess/plag/edu/Tokenizer.java index 7d9037b..0c44e60 100644 --- a/src/preprocess/plag/edu/Tokenizer.java +++ b/src/preprocess/plag/edu/Tokenizer.java @@ -5,7 +5,9 @@ import com.hankcs.hanlp.HanLP; import com.hankcs.hanlp.dictionary.CustomDictionary; import com.hankcs.hanlp.seg.common.Term; +import com.hankcs.hanlp.tokenizer.IndexTokenizer; import com.hankcs.hanlp.tokenizer.NotionalTokenizer; +import com.hankcs.hanlp.tokenizer.StandardTokenizer; public class Tokenizer { //ַתָָķִʹַ @@ -23,7 +25,8 @@ public static void main(String[] args) { // TODO Auto-generated method stub HanLP.Config.Normalization = true; //->壬ȫ->ǣд->Сд CustomDictionary.insert("4G", "nz 1000"); - String text = "i am from china.Сеķιèеľȴ޳ɡιЩС,i will go back HomeҐ "; + String text = "i am from china." + + "Сеķιèеľȴ޳ɡιЩС,i will go back HomeҐ "; System.out.println(text); //ȷִ List tokens = HanLP.segment(text); @@ -33,6 +36,24 @@ public static void main(String[] args) { } System.out.println(); + System.out.println("ȷִ"); + //׼ִ + tokens = StandardTokenizer.segment("Ʒͷ"); + System.out.println(tokens); + for (Term token : tokens) { + System.out.print("("+token.word+","+token.offset+","+token.length()+")"); + + } + System.out.println(); + System.out.println("ִʣ"); + //ִ + List termList = IndexTokenizer.segment("ʳƷ"); + for (Term term : termList) + { + System.out.println(term + " [" + term.offset + ":" + (term.offset + term.word.length()) + "]"); + } + System.out.println(); + System.out.println("ȥͣôʡŷִʣ"); // Զȥͣô,ᶪʧԭļеλϢ tokens = NotionalTokenizer.segment(text); System.out.println(tokens); // ͣôʵλdata/dictionary/stopwords.txt޸ diff --git a/src/shingle/plag/edu/ShingleSim.java b/src/shingle/plag/edu/ShingleSim.java index 5b1c776..9a283f7 100644 --- a/src/shingle/plag/edu/ShingleSim.java +++ b/src/shingle/plag/edu/ShingleSim.java @@ -43,7 +43,7 @@ public void explore(File file) { } } - // ʵļ˽ӿڣڲ෽ʽ,ֻdoctxtdocxpdfļĿ¼ + // ʵļ˽ӿڣڲ෽ʽ,ֻdoctxtdocxpdfhtmlļĿ¼ class Fileter implements FileFilter { @Override public boolean accept(File arg0) { @@ -53,6 +53,8 @@ public boolean accept(File arg0) { || fn.endsWith(".txt") || fn.endsWith(".docx") || fn.endsWith(".pdf") + || fn.endsWith(".html") + || fn.endsWith(".htm") || arg0.isDirectory()) return true; return false; diff --git a/src/utils/edu/AntFile.java b/src/utils/edu/AntFile.java index 4bf3d5d..b4c6efd 100644 --- a/src/utils/edu/AntFile.java +++ b/src/utils/edu/AntFile.java @@ -184,13 +184,13 @@ public static void copy(File srcdir,File desdir,String match){ public static void main(String[] args){ File src =new File("./demo/7/Selenium.zip"); //֧rarļĽѹ - File dest=new File("./demo/7/"); - AntFile.unzip(src, dest); + File dest=new File("./testdata/doccn/"); + // AntFile.unzip(src, dest); //AntFile.deleteFile(src); //pass test //AntFile.deleteDir(new File(dest.getAbsoluteFile()+"\\zhengchaota_atm")); //ȡָĿ¼µjavaļĿ¼µ - String[] filter={"**/*.java"}; //"*.zip" + String[] filter={"**/*.doc"}; //"*.zip" String[] files = AntFile.scanFiles(dest, filter); if(files!=null){ for(String str:files){ @@ -199,7 +199,7 @@ public static void main(String[] args){ } //ڵǰ·´һĿ¼ - AntFile.makeDir(new File("./temp")); + // AntFile.makeDir(new File("./temp")); } diff --git a/src/utils/edu/FileIO.java b/src/utils/edu/FileIO.java index beef207..ae24fd3 100644 --- a/src/utils/edu/FileIO.java +++ b/src/utils/edu/FileIO.java @@ -1,8 +1,16 @@ package utils.edu; +import java.io.BufferedReader; +import java.io.BufferedWriter; import java.io.File; +import java.io.FileInputStream; +import java.io.FileNotFoundException; +import java.io.FileOutputStream; import java.io.FileWriter; import java.io.IOException; +import java.io.InputStreamReader; +import java.io.OutputStreamWriter; +import java.util.ArrayList; import java.util.Date; import java.util.List; @@ -42,13 +50,67 @@ public static void saveFile(File outfile,List listsd,int type,String la } } } + //strָ뷽ʽдļ + public static void saveFile(File outfile,String str,String encode){ + BufferedWriter fr = null; + try { + fr = new BufferedWriter (new OutputStreamWriter (new FileOutputStream (outfile,true),encode));; + + fr.write(str); + } catch (Exception e) { + // TODO Auto-generated catch block + e.printStackTrace(); + }finally{ + try { + if(fr!=null) + fr.close(); + } catch (IOException e) { + // TODO Auto-generated catch block + e.printStackTrace(); + } + } + + } + + public static String[] readFile(File infile,String encode){ + BufferedReader in = null; + String str = null ; + ArrayList list = new ArrayList(); + String[] res = null; + try { + in = new BufferedReader(new InputStreamReader(new FileInputStream(infile), encode)); + while ((str = in.readLine()) != null) { + list.add(str); + } + res = new String[list.size()]; + for(int i=0;i lists){ int res = -1; + File tmpf = null; + long t = System.currentTimeMillis(); + try { String INPUT_FILE_FOLDER_NAME=files ; //ļĿ¼ + + if("doc".equals(lang)) { + tmpf = preJplag(files); + INPUT_FILE_FOLDER_NAME=tmpf.getAbsolutePath() ; //ļĿ¼ + } + String jplagResultsFolderName="./jplagresult/"; //ĿĿ¼ + // AntFile.deleteDir(new File(jplagResultsFolderName )); //ɾĿ¼ + float MINIMUM_FILE_SIMILARITY = threshold ; String EXCLUDE_FILES = null ; ArrayList args = new ArrayList(); @@ -156,24 +190,27 @@ public int execJplag(String lang,float threshold,String files,List list args.add("-x"); args.add(EXCLUDE_FILES); } + // args.add("-clustertype"); //Խֳ࣬鳭Ϯ + // args.add("avr"); + args.add(INPUT_FILE_FOLDER_NAME); String[] toPass = new String[args.size()]; toPass = args.toArray(toPass); - // System.out.println(toPass.toString()); - // JPlag.main(toPass); - try { - CommandLineOptions options = new CommandLineOptions(toPass, null); - Program program = new Program(options); - - System.out.println("jplag initialize ok "+program.get_commandLine()); - program.run(); - res = 0; //ִгɹ - } - catch(ExitException ex) { - System.out.println("Error: "+ex.getReport()); + + CommandLineOptionsExt options = new CommandLineOptionsExt(toPass, null); - } - + Program program = new Program(options); + + System.out.println("jplag initialize ok "+program.get_commandLine()); + program.run(); + res = 0; //ִгɹ + + } catch(Exception e) { + e.printStackTrace(); + }finally { + postJplag(tmpf); + } + System.out.println("time:"+(System.currentTimeMillis()-t)+"ms"); return res ; } diff --git a/testdata/doccn/dongxiao-2.html b/testdata/doccn/dongxiao-2.html new file mode 100644 index 0000000..197775b --- /dev/null +++ b/testdata/doccn/dongxiao-2.html @@ -0,0 +1,40 @@ + + + + +Insert title here + + +
+2. 单元测试: unit testing 出处:http://www.igsgroup.com.cn/common/ISTQB%E8%BD%AF%E4%BB%B6%E6%B5%8B%E8%AF%95%E4%B8%93%E4%B8%9A%E6%9C%AF%E8%AF%AD%E5%AF%B9%E7%85%A7%E8%A1%A8v2.1.pdf + 定义:依据详细设计规格说明书,对模块内所有重要控制路径设计测试用例,来发现模块内部错误 P94 + 3. 集成测试: integration testing 出处同上 定义:在单元测试的基础上,将所有程序模块进行有序、递增的测试,检验程序单元或部件的接口关系,使之符合要求 P25 + 4. 系统测试:system testing 出处同上 定义:对集成的软件和硬件系统进行的测试 P26 + 5. 验收测试: acceptance testing 出处同上 定义:按照项目要求和合同,供需双方签订的验收文档进行的测试和评审 P26 + 6. 功能测试:functional testing 出处同上 定义:功能测试就是对产品的各功能进行验证,根据功能测试用例,逐项测试,检查产品是否达到用户要求的功能。 出处: http://baike.baidu.com/view/651435.htm + 7. 黑盒测试:black-box testing 出处同上 定义:未知程序内部结构进行的测试 P26 + 8. 白盒测试:white-box testing 出处同上 定义:已知程序内部结构进行的测试 P26 + 9. 性能测试:performance testing 出处同上 定义:用来测试软件在集成系统中的运行性能。P135 + 10. α测试:αtesting 定义:对即将面市的软件产品进行测试 P158 + 11.CMM:Capability Maturity Model for Software 能力成熟度模型 http://baike.baidu.com/view/8110.htm 定义:对于软件组织在定义、实施、度量、控制和改善其软件过程的实践中各个发展阶段的描述 http://baike.baidu.com/view/8110.htm + 12. ISO9000:质量管理体系标准 定义:由TC176(质量管理体系技术委员会)制定的所有国际标准。 http://baike.baidu.com/view/9486.htm +简答题:(2x12) +1 黑盒测试和白盒测试的区别?哪些错误使用黑盒测试更容易发现?哪些错误使用白盒测试更容易发现?各举2例。 +黑盒测试是不知道软件程序内部结构,白盒测试是知道软件程序内部结构。 +黑盒测试便于发现1、是否有不正确或遗漏的功能?2、在接口上,输入是否能正确的接受?能否输出正确的结果? +白盒测试易于发现:1、对所有的逻辑判定,取“真”与取“假”的两种情况都能至少测一遍。2、在循环的边界和运行的界限内执行循环体 +http://zhidao.baidu.com/question/13988876.html +2 集成测试和系统测试的区别和联系? +P132 集成测试对象是模块间的接口,系统测试对象是整个系统。集成测试和系统测试都用到黑盒测试 +问答题:(52) +1 (10)描述软件开发的瀑布模型,并结合自己参与的具体项目,回答以下问题: + 瀑布模型:可行性研究和计划—需求分析—设计—编码—测试—运行维护 http://baike.baidu.com/view/551037.htm +(1) 实际项目开发经历了哪些阶段?(先简单阐述所做的项目) + 做一个航空售票系统。一开始老师说要求(可行性研究和分析),同学们听见后,开始分析老师想要什么东西(需求分析),然后脑子里大概有个相应的想法(设计),开始打代码(编码),最后检查有没有报错,看能不能运行(测试) +(2) 作为程序员,依次写出你认为最重要的3个阶段,并说明原因? + 需求分析,设计,编码。需求分析,只有知道自己想要什么,才知道自己要做成什么东西;设计,有个大体的模子,才能知道该怎么弄;编码,既然是程序员,不编码能叫程序员吗。 + +
+百度 + + \ No newline at end of file