diff --git a/.classpath b/.classpath
index 054cdd5..c1667f6 100644
--- a/.classpath
+++ b/.classpath
@@ -6,11 +6,11 @@
-
+
diff --git a/.settings/org.eclipse.core.resources.prefs b/.settings/org.eclipse.core.resources.prefs
new file mode 100644
index 0000000..d2e9a5a
--- /dev/null
+++ b/.settings/org.eclipse.core.resources.prefs
@@ -0,0 +1,2 @@
+eclipse.preferences.version=1
+encoding//testdata/doccn/dongxiaoutf8-2.txt=UTF-8
diff --git a/README.md b/README.md
index 9ec83ad..5155277 100644
--- a/README.md
+++ b/README.md
@@ -22,8 +22,8 @@
本系统在它们基础上进行了二次开发和封装,针对moss系统,开发出了客户端存取模块,实现了代码文件提交、结果获取和解析、结果排序等功能;针对sim和jplag,则将其集成到系统中,在moss因网络故障等原因不可用时,可作为替代产品使用。
中英文文档作业相似度的比较则基于[shinglecloud算法](https://www.kom.tu-darmstadt.de/de/research-results/0/1/shinglecloud/)(一种基于文本指纹的、语言无关的相似度快速计算方法),文档主要处理过程如下:
-1. 使用tika读取不同格式(txt、doc、docx等)的文档,并将其转换成能统一处理的文本;
-2. 使用ikanalyzer对文本进行预处理、精确分词;
+1. 使用tika读取不同格式(txt、doc、docx、pdf、html等)不同编码文件中的文本内容,并将其转换成能统一处理的文本;
+2. 使用hanlp对文本进行预处理、分词;
3. 使用shinglecloud算法计算文本之间的相似度;
4. 根据相似度排序,输出比较结果。
@@ -33,6 +33,9 @@
3. [Winnowing: Local Algorithms for Document Fingerprinting](http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf) moss系统采用的核心算法
4. [软件抄袭检测研究综述](https://faculty.ist.psu.edu/wu/papers/spd-survey-16.pdf)
+## 更新情况
+1. 2019.12.1 使用hanlp作为分词组件,增加支持pdf、html文件文本的查重,修复若干bug,发布v2.8.6版。
+
## TODO
1. 将jplag整合进系统。已实现。
2. 支持html,jsp文件的查重。
diff --git a/bin/.gitignore b/bin/.gitignore
index 6debac6..e960251 100644
--- a/bin/.gitignore
+++ b/bin/.gitignore
@@ -1,2 +1,2 @@
-/utils/
+/preprocess/
/gui/
diff --git a/bin/gui/plag/edu/FileConvertFrame$4.class b/bin/gui/plag/edu/FileConvertFrame$4.class
index 3d5219e..b66aa2f 100644
Binary files a/bin/gui/plag/edu/FileConvertFrame$4.class and b/bin/gui/plag/edu/FileConvertFrame$4.class differ
diff --git a/bin/gui/plag/edu/FileConvertFrame.class b/bin/gui/plag/edu/FileConvertFrame.class
index a5bb7c8..51e4e7c 100644
Binary files a/bin/gui/plag/edu/FileConvertFrame.class and b/bin/gui/plag/edu/FileConvertFrame.class differ
diff --git a/bin/gui/plag/edu/PlagGUI.class b/bin/gui/plag/edu/PlagGUI.class
index 964a44c..af337f3 100644
Binary files a/bin/gui/plag/edu/PlagGUI.class and b/bin/gui/plag/edu/PlagGUI.class differ
diff --git a/bin/preprocess/plag/edu/IKAnalyzer.class b/bin/preprocess/plag/edu/IKAnalyzer.class
deleted file mode 100644
index 1d3e3cf..0000000
Binary files a/bin/preprocess/plag/edu/IKAnalyzer.class and /dev/null differ
diff --git a/bin/preprocess/plag/edu/TextExtractor.class b/bin/preprocess/plag/edu/TextExtractor.class
index 0297362..9a1ac00 100644
Binary files a/bin/preprocess/plag/edu/TextExtractor.class and b/bin/preprocess/plag/edu/TextExtractor.class differ
diff --git a/bin/shingle/plag/edu/ShingleSim$Fileter.class b/bin/shingle/plag/edu/ShingleSim$Fileter.class
index f3c2ba1..89c83ef 100644
Binary files a/bin/shingle/plag/edu/ShingleSim$Fileter.class and b/bin/shingle/plag/edu/ShingleSim$Fileter.class differ
diff --git a/bin/shingle/plag/edu/ShingleSim.class b/bin/shingle/plag/edu/ShingleSim.class
index d882a38..8eef08c 100644
Binary files a/bin/shingle/plag/edu/ShingleSim.class and b/bin/shingle/plag/edu/ShingleSim.class differ
diff --git a/help.txt b/help.txt
index 9966e05..f87fe3c 100644
--- a/help.txt
+++ b/help.txt
@@ -1,4 +1,4 @@
-ҵϵͳʹð(v2.8.2)
+ҵϵͳʹð(v2.8.6)
һ
ϵͳwindow10jdk11 64λвͨԡ
@@ -19,7 +19,8 @@
2 ĵıƶȼ
ĵIJͳļⲽһ£ֻǡѡҵʱѡĵҵ
-磺testdata/doccnµҵĵļչtxtdocdocxеһ֣
+磺testdata/doccnµҵĵļչtxtdocdocxpdfhtml
+еһ֡
ҵǡıҵȻִбȽϡťȴȷϴ鿴
ťϵͳȽϽڣԲ鿴ȽϽ
ıĵıȽĿǰݲ֧ͨҳпӻԱȡ
diff --git a/lib/IKAnalyzer2012_u6.jar b/lib/IKAnalyzer2012_u6.jar
deleted file mode 100644
index e3d9aa6..0000000
Binary files a/lib/IKAnalyzer2012_u6.jar and /dev/null differ
diff --git a/lib/hanlp-portable-1.7.5.jar b/lib/hanlp-portable-1.7.5.jar
new file mode 100644
index 0000000..a1b9db2
Binary files /dev/null and b/lib/hanlp-portable-1.7.5.jar differ
diff --git a/out.txt b/out.txt
index b28741f..d3dfe10 100644
--- a/out.txt
+++ b/out.txt
@@ -1,2 +1,57 @@
-1 8.0% testdata\python\stu1_demo.py testdata\python\stu1_lprcmd.py
-from stanford:http://moss.stanford.edu/results/874773796 Fri Oct 25 19:19:17 CST 2019
\ No newline at end of file
+1 99.51535% dongxiao-2.doc dongxiaogbk.txt
+2 92.47312% gumingzhu-2.doc zhucuiyun_2.doc
+3 91.408936% wangmeng-2.doc zhucuiyun_2.doc
+4 87.63636% dongxiao-2.docx dongxiaoutf8-2.txt
+5 84.717606% gumingzhu-2.doc wangmeng-2.doc
+6 84.310844% dongxiao-2.doc dongxiao-2.pdf
+7 84.168015% dongxiao-2.doc dongxiaoutf8-2.txt
+8 83.870964% dongxiao-2.pdf dongxiaogbk.txt
+9 83.68336% dongxiaogbk.txt dongxiaoutf8-2.txt
+10 82.954544% dongxiao-2.docx dongxiaogbk.txt
+11 82.552505% dongxiao-2.doc dongxiao-2.docx
+12 75.74404% lijie-2.doc wangmeng-2.doc
+13 74.96063% gumingzhu-2.doc wuchangqing-2.doc
+14 71.703705% dongxiao-2.pdf dongxiaoutf8-2.txt
+15 71.49254% dongxiao-2.docx dongxiao-2.pdf
+16 69.92366% wuchangqing-2.doc zhucuiyun_2.doc
+17 68.584076% lijie-2.doc zhucuiyun_2.doc
+18 65.61151% wangmeng-2.doc wuchangqing-2.doc
+19 65.12301% gumingzhu-2.doc lijie-2.doc
+20 57.454544% dongxiaogbk.txt meitao-2.doc
+21 57.246376% dongxiao-2.doc meitao-2.doc
+22 52.258064% lijie-2.doc wuchangqing-2.doc
+23 50.757576% dongxiao-2.docx meitao-2.doc
+24 50.284416% dongxiao-2.pdf meitao-2.doc
+25 48.87218% makai2.doc wangxuan_2.doc.doc
+26 48.45869% dongxiaoutf8-2.txt meitao-2.doc
+27 46.67074% liuchuanyang-2.doc tangwenpeng-2.doc
+28 41.64096% heliwen_2.doc liufan_2.doc
+29 40.54834% liufan_2.doc wangchunming_2.doc
+30 38.75061% gechunlong-2.doc hanchao_2.doc
+31 36.930233% luxiang-2.doc tangwenpeng-2.doc
+32 36.89095% jiangfeng-2.doc lijie-2.doc
+33 35.925926% weixiao-2.doc yinxu-2.doc
+34 35.424637% liuchuanyang-2.doc wuliangchao-2.doc
+35 35.039577% gechunlong-2.doc yinxu-2.doc
+36 34.839073% gechunlong-2.doc weixiao-2.doc
+37 34.325184% wangmeng-2.doc wuliangchao-2.doc
+38 34.069096% guozhiquan -2.doc wuliangchao-2.doc
+39 33.98907% wuliangchao-2.doc zhucuiyun_2.doc
+40 32.858547% tangwenpeng-2.doc xuqiwei-2.doc
+41 32.557137% tangwenpeng-2.doc wangchen-2.doc
+42 32.296955% liuchuanyang-2.doc yinxu-2.doc
+43 32.073547% lijie-2.doc wuliangchao-2.doc
+44 32.070206% gechunlong-2.doc wangchen-2.doc
+45 32.058823% jiangfeng-2.doc yinpeiyan_2.doc
+46 31.946404% sunxiaolei-2.doc wangchunming_2.doc
+47 31.471535% gumingzhu-2.doc wuliangchao-2.doc
+48 30.698889% sunxiaolei-2.doc yinxu-2.doc
+49 30.651136% liuchuanyang-2.doc xuqiwei-2.doc
+50 30.63007% heliwen_2.doc wangchunming_2.doc
+51 30.559345% liuchuanyang-2.doc weixiao-2.doc
+52 30.494392% wangchen-2.doc xuqiwei-2.doc
+53 30.429863% tangwenming-2.doc xuqiwei-2.doc
+54 30.424183% tangwenming-2.doc wangchen-2.doc
+55 30.095451% sunxiaolei-2.doc tangwenpeng-2.doc
+56 30.065361% guozhiquan -2.doc liuchuanyang-2.doc
+from fh Sun Dec 01 18:57:44 CST 2019
\ No newline at end of file
diff --git a/src/gui/plag/edu/FileConvertFrame.java b/src/gui/plag/edu/FileConvertFrame.java
index 7777095..180d6ff 100644
--- a/src/gui/plag/edu/FileConvertFrame.java
+++ b/src/gui/plag/edu/FileConvertFrame.java
@@ -121,11 +121,14 @@ public void actionPerformed(ActionEvent arg0) {
if("python".equals(type)) {
filter[0]="**/*.py";
}
- if("doc".equals(type)){ //ĵ֧ͣdoc txt docx
- filter = new String[3];
+ if("doc".equals(type)){ //ĵ֧ͣdoc txt docx pdf html
+ filter = new String[6];
filter[0] = "**/*.doc";
filter[1] = "**/*.txt";
filter[2] = "**/*.docx";
+ filter[3] = "**/*.pdf";
+ filter[4] = "**/*.html";
+ filter[5] = "**/*.htm";
}
String[] filestrs = AntFile.scanFiles(srcf, filter); //غĿ¼ļ
diff --git a/src/gui/plag/edu/PlagGUI.java b/src/gui/plag/edu/PlagGUI.java
index 311f6a1..d3d9b60 100644
--- a/src/gui/plag/edu/PlagGUI.java
+++ b/src/gui/plag/edu/PlagGUI.java
@@ -133,7 +133,7 @@ public void stateChanged(ChangeEvent arg0) {
panel.add(radBntProgram);
radBntText = new JRadioButton("\u6587\u672C\u4F5C\u4E1A");
- radBntText.setToolTipText("\u652F\u6301\u6587\u6863\u7C7B\u578B\uFF1Adoc docx txt");
+ radBntText.setToolTipText("\u652F\u6301\u6587\u6863\u7C7B\u578B\uFF1Adoc docx txt pdf html\u7B49");
radBntText.addChangeListener(new ChangeListener() {
public void stateChanged(ChangeEvent arg0) {
//ıҵťѡ
diff --git a/src/preprocess/plag/edu/IKAnalyzer.java b/src/preprocess/plag/edu/IKAnalyzer.java
deleted file mode 100644
index b99cdc2..0000000
--- a/src/preprocess/plag/edu/IKAnalyzer.java
+++ /dev/null
@@ -1,70 +0,0 @@
-package preprocess.plag.edu;
-
-import java.io.IOException;
-import java.io.StringReader;
-
-import org.wltea.analyzer.core.IKSegmenter;
-import org.wltea.analyzer.cfg.*;
-import org.wltea.analyzer.core.Lexeme;
-/**
- * 2013.7.25 ʹֲ
- * 1 IKAnalyzer2012_u6.jar ,jar ѾԴֵ
- * 2 IKAnalyzer.cfg.xmlstopword.dicĿ·
- * 3
- * òûȥͣô a
- * ܣcpuɼڴռò
- * һչʿ⡢ͣô
- * IKAnalyzer.cfg.xmlstopword.dic\binĿ¼
- * ԭҪֵַƥдʣܷʽǾȷִʣȥֱ,ӢͳһСд
- */
-public class IKAnalyzer {
-
- /**
- * @param args
- */
- public static void main(String[] args) {
- // TODO Auto-generated method stub
- Configuration cfg = DefaultConfig.getInstance();
- System.out.println("main dic:"+cfg.getMainDictionary());
- System.out.println("ext dic:"+cfg.getExtDictionarys());
- System.out.println("stopword dic:"+cfg.getExtStopWordDictionarys());
-
-
- IKSegmenter ik = new IKSegmenter(new StringReader("a Hello " +
- " л 'world java('"2013꣨,: 19:28 " +
- "Ansjķִһictʵ.ҼԼһЩݽṹ㷨ķִ.ʵ˸Чʺȷʵ!" ),true);
- Lexeme le = null;
-
- try {
- while((le=ik.next())!=null){
- System.out.print(le.getLexemeText()+"|" );
- }
- } catch (IOException e) {
- // TODO Auto-generated catch block
- e.printStackTrace();
- }
-
- System.out.println(ik.toString());
- }
- public static String segment(String str,boolean bsmart){
-
- return segment(str,bsmart,"");
- }
- public static String segment(String str,boolean bsmart,String split){
- if(str!=null){
- IKSegmenter ik = new IKSegmenter(new StringReader(str),bsmart);
- Lexeme le = null;
- StringBuffer sb = new StringBuffer();
- try {
- while((le=ik.next())!=null){
- sb.append(le.getLexemeText()+split);
- }
- } catch (IOException e) {
- // TODO Auto-generated catch block
- e.printStackTrace();
- }
- return sb.toString();
- }
- return null;
- }
-}
diff --git a/src/preprocess/plag/edu/TextExtractor.java b/src/preprocess/plag/edu/TextExtractor.java
index 4f566e5..bbd33ea 100644
--- a/src/preprocess/plag/edu/TextExtractor.java
+++ b/src/preprocess/plag/edu/TextExtractor.java
@@ -37,7 +37,11 @@ public static String getTxt(File f) {
try {
is = new FileInputStream(f);
Tika tika = new Tika();
- String str = tika.parseToString(new FileInputStream(f));
+ Metadata metadata = new Metadata();
+ metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName()); //gbktxtıȡ
+ String str = tika.parseToString(new FileInputStream(f),metadata);
+ // System.out.println(f.getName());
+ // System.out.println(str);
return str;
} catch (FileNotFoundException e) {
@@ -100,7 +104,10 @@ public static String fileToTxt(File f,Metadata metadata) {
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
- File f = new File("D:\\fh\\ѧ\\201302\\\\ѧύҵ\\һҵ\\sunxiaolei-1.doc");
+ // File f = new File("./testdata/doccn/dongxiao-2.doc");
+ File f = new File("./testdata/doccn/dongxiao-2.pdf");
+ // File f = new File("./testdata/doccn/dongxiaogbk.txt");
+ // File f = new File("./testdata/doccn/dongxiaoutf8-2.txt");
System.out.println(TextExtractor.getTxt(f));
Metadata metadata = new Metadata();
System.out.println(TextExtractor.fileToTxt(f,metadata));
diff --git a/src/preprocess/plag/edu/Tokenizer.java b/src/preprocess/plag/edu/Tokenizer.java
new file mode 100644
index 0000000..7d9037b
--- /dev/null
+++ b/src/preprocess/plag/edu/Tokenizer.java
@@ -0,0 +1,54 @@
+package preprocess.plag.edu;
+
+import java.util.List;
+
+import com.hankcs.hanlp.HanLP;
+import com.hankcs.hanlp.dictionary.CustomDictionary;
+import com.hankcs.hanlp.seg.common.Term;
+import com.hankcs.hanlp.tokenizer.NotionalTokenizer;
+
+public class Tokenizer {
+ //ַתָָķִʹַ
+ public static String segment(String text,String sep) {
+ StringBuilder sb = new StringBuilder();
+ HanLP.Config.Normalization = true; //->壬ȫ->ǣд->Сд
+ List tokens = NotionalTokenizer.segment(text);//ִʣȥͣô
+ for(Term token : tokens) {
+ sb.append(token.word+sep);
+ }
+ return sb.toString();
+ }
+
+ public static void main(String[] args) {
+ // TODO Auto-generated method stub
+ HanLP.Config.Normalization = true; //->壬ȫ->ǣд->Сд
+ CustomDictionary.insert("4G", "nz 1000");
+ String text = "i am from china.СеķιèеľȴɡιЩС,i will go back HomeҐ ";
+ System.out.println(text);
+ //ȷִ
+ List tokens = HanLP.segment(text);
+ System.out.println(tokens); // ͣôʵλdata/dictionary/stopwords.txt
+ for (Term token : tokens) {
+ System.out.print("("+token.word+","+token.offset+","+token.length()+")");
+
+ }
+ System.out.println();
+ // Զȥͣô,ᶪʧԭļеλϢ
+ tokens = NotionalTokenizer.segment(text);
+ System.out.println(tokens); // ͣôʵλdata/dictionary/stopwords.txt
+ for (Term token : tokens) {
+ System.out.print("("+token.word+","+token.offset+","+token.length()+")");
+
+ }
+ System.out.println();
+ // ԶϾ+ȥͣô
+ for (List sentence : NotionalTokenizer.seg2sentence(text))
+ {
+ System.out.println(sentence);
+ }
+ //ӢеͣôҲᱻȥ
+ String str = Tokenizer.segment(text," ");
+ System.out.println(str);
+ }
+
+}
diff --git a/src/shingle/plag/edu/ShingleSim.java b/src/shingle/plag/edu/ShingleSim.java
index c9424bd..5b1c776 100644
--- a/src/shingle/plag/edu/ShingleSim.java
+++ b/src/shingle/plag/edu/ShingleSim.java
@@ -14,26 +14,16 @@
import java.util.Collections;
import java.util.List;
-import preprocess.plag.edu.IKAnalyzer;
import preprocess.plag.edu.TextExtractor;
+import preprocess.plag.edu.Tokenizer;
import utils.edu.FileIO;
-//import sim.edu.TestWinnowing.Fileter;
-//import preprocess.plag.edu;
import data.plag.edu.SimData;
-import de.tud.kom.stringmatching.gst.GST;
-import de.tud.kom.stringmatching.gst.GSTTile;
-import de.tud.kom.stringmatching.gst.utils.GSTHighlighter;
import de.tud.kom.stringmatching.shinglecloud.ShingleCloud;
-import de.tud.kom.stringmatching.shinglecloud.ShingleCloudMatch;
-import de.tud.kom.stringutils.preprocessing.WhiteSpaceRemovalPreprocessing;
-import de.tud.kom.stringutils.tokenization.CharacterTokenizer;
-import de.tud.kom.stringutils.tokenization.WordTokenizer;
-//import fengci.edu.IKAnalyzer;
public class ShingleSim {
String dic = null; //ҵ·
- float threshold = 0.3f; //0.3
+ float threshold = 0.3f; //Ĭ0.3
List filels = new ArrayList(); //ҪȽϵļ
List listsd = new ArrayList(); //ļȽϵĽ
@@ -53,14 +43,17 @@ public void explore(File file) {
}
}
- // ʵļ˽ӿڣڲʽ,ֻdoctxtdocxļĿ¼
+ // ʵļ˽ӿڣڲʽ,ֻdoctxtdocxpdfļĿ¼
class Fileter implements FileFilter {
@Override
public boolean accept(File arg0) {
// TODO Auto-generated method stub
- if (arg0.getName().endsWith(".doc") //
- || arg0.getName().endsWith(".txt")
- || arg0.getName().endsWith(".docx") || arg0.isDirectory())
+ String fn = arg0.getName().toLowerCase();
+ if (fn.endsWith(".doc") //
+ || fn.endsWith(".txt")
+ || fn.endsWith(".docx")
+ || fn.endsWith(".pdf")
+ || arg0.isDirectory())
return true;
return false;
}
@@ -70,7 +63,8 @@ public String processZHText(File file){
String resstr=null;
try {
String str = TextExtractor.getTxt(file);
- resstr = IKAnalyzer.segment(str,true," "); //ִܷʡͣôʹˣոֿ
+ //resstr = IKAnalyzer.segment(str,true," "); //ִܷʡͣôʹˣոֿ
+ resstr = Tokenizer.segment(str," ");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
diff --git a/testdata/doccn/dongxiao-2.docx b/testdata/doccn/dongxiao-2.docx
new file mode 100644
index 0000000..f858587
Binary files /dev/null and b/testdata/doccn/dongxiao-2.docx differ
diff --git a/testdata/doccn/dongxiao-2.pdf b/testdata/doccn/dongxiao-2.pdf
new file mode 100644
index 0000000..dbed521
Binary files /dev/null and b/testdata/doccn/dongxiao-2.pdf differ
diff --git a/testdata/doccn/dongxiaogbk.txt b/testdata/doccn/dongxiaogbk.txt
new file mode 100644
index 0000000..e5c91dc
--- /dev/null
+++ b/testdata/doccn/dongxiaogbk.txt
@@ -0,0 +1,41 @@
+ʶ壨ƸӢļij12x2
+1.ԣSoftware Testinghttps://zh.wikipedia.org/wiki/%E8%BD%AF%E4%BB%B6%E6%B5%8B%E8%AF%95 壺ҲΪ˶ж P18
+ 2. Ԫ: unit testing http://www.igsgroup.com.cn/common/ISTQB%E8%BD%AF%E4%BB%B6%E6%B5%8B%E8%AF%95%E4%B8%93%E4%B8%9A%E6%9C%AF%E8%AF%AD%E5%AF%B9%E7%85%A7%E8%A1%A8v2.1.pdf
+ 壺ϸƹ˵飬ģҪ·Ʋģڲ P94
+ 3. ɲ: integration testing ͬ 壺ڵԪԵĻϣгģIJԣԪĽӿڹϵʹ֮Ҫ P25
+ 4. ϵͳԣsystem testing ͬ 壺ԼɵӲϵͳеIJ P26
+ 5. ղ: acceptance testing ͬ 壺ĿҪͺͬ˫ǩĵеIJԺ P26
+ 6. ܲԣfunctional testing ͬ 壺ܲԾǶԲƷĸ֤ܽݹܲԣƷǷﵽûҪĹܡ http://baike.baidu.com/view/651435.htm
+ 7. ںвԣblack-box testing ͬ 壺δ֪ڲṹеIJ P26
+ 8. вԣwhite-box testing ͬ 壺֪ڲṹеIJ P26
+ 9. ܲԣperformance testing ͬ 壺ڼϵͳеܡP135
+ 10. ԣtesting 壺ԼеƷв P158
+ 11.CMMCapability Maturity Model for Software ģ http://baike.baidu.com/view/8110.htm 壺֯ڶ塢ʵʩƺ̵ʵиչε http://baike.baidu.com/view/8110.htm
+ 12. ISO9000ϵ 壺TC176ϵίԱᣩƶйʱ http://baike.baidu.com/view/9486.htm
+⣺2x12
+1 ںвԺͰвԵЩʹúںвԸ?ЩʹðвԸ֣2
+ںвDz֪ڲṹв֪ڲṹ
+ںвԱڷ1Ƿвȷ©Ĺܣ2ڽӿϣǷȷĽܣܷȷĽ
+вڷ֣1ежȡ桱ȡ١ٲһ顣2ѭı߽еĽִѭ
+http://zhidao.baidu.com/question/13988876.html
+2 ɲԺϵͳԵϵ
+P132 ɲԶģĽӿڣϵͳԶϵͳɲԺϵͳԶõںв
+ʴ⣺(52)
+1 (10)ٲģͣԼľĿش⣺
+ ٲģͣоͼƻơ롪ԡά http://baike.baidu.com/view/551037.htm
+1 ʵĿЩΣȼĿ
+ һƱϵͳһʼʦ˵ҪоͷͬѧʼʦҪʲôȻиӦ뷨ƣʼ루룩ûбܲУԣ
+2 ΪԱдΪҪ3Σ˵ԭ
+ ƣ롣ֻ֪ԼҪʲô֪ԼҪʲôƣиģӣ֪ôŪ룬ȻdzԱܽгԱ
+
+2 (12)дԵ2ֲͬ壬ָǵϲһ֣Ϊʲô
+һ֣P18 Bill Hetzel ԵĿIJΪ˷ȱݺʹҲǶж
+ڶ֣ P18 Grenford J.Myers Ϊ֤д֤
+ڶƬ㡣ԭһȫ㣬Ϊǿ϶дģûbugģҸϲڶ
+
+3 (30)Vģͣ˵ԹǴĸοʼģϾĿʵĿоЩԽΣЩ͵IJԣ繦ܡڰеȣΪĸԽҪΪʲôP30
+ ûҪϸ롪Ԫԡɲԡϵͳԡղ
+ Ʊϵͳÿ࣬Ū֮϶ȼûдлᱨԪԣһЩࡢĵãܱܲãɲԣһԺƱIJԣܲܳɹϵͳԣʦղԣ
+ вԣ֮һûʲô
+ ںвԣʦʱûЧ
+ ҾûͺҪΪԽ緢֣ʧԽС
diff --git a/testdata/doccn/dongxiaoutf8-2.txt b/testdata/doccn/dongxiaoutf8-2.txt
new file mode 100644
index 0000000..952f2ff
--- /dev/null
+++ b/testdata/doccn/dongxiaoutf8-2.txt
@@ -0,0 +1,36 @@
+姓名:董晓 学号:112127130103
+
+名词定义(中文名称给出英文及定义的出处12x2)
+1.软件测试:Software Testing——出处:https://zh.wikipedia.org/wiki/%E8%BD%AF%E4%BB%B6%E6%B5%8B%E8%AF%95 定义:发现软件错误,也是为了对软件质量进行度量和评估 出处: P18
+ 2. 单元测试: unit testing 出处:http://www.igsgroup.com.cn/common/ISTQB%E8%BD%AF%E4%BB%B6%E6%B5%8B%E8%AF%95%E4%B8%93%E4%B8%9A%E6%9C%AF%E8%AF%AD%E5%AF%B9%E7%85%A7%E8%A1%A8v2.1.pdf
+ 定义:依据详细设计规格说明书,对模块内所有重要控制路径设计测试用例,来发现模块内部错误 P94
+ 3. 集成测试: integration testing 出处同上 定义:在单元测试的基础上,将所有程序模块进行有序、递增的测试,检验程序单元或部件的接口关系,使之符合要求 P25
+ 4. 系统测试:system testing 出处同上 定义:对集成的软件和硬件系统进行的测试 P26
+ 5. 验收测试: acceptance testing 出处同上 定义:按照项目要求和合同,供需双方签订的验收文档进行的测试和评审 P26
+ 6. 功能测试:functional testing 出处同上 定义:功能测试就是对产品的各功能进行验证,根据功能测试用例,逐项测试,检查产品是否达到用户要求的功能。 出处: http://baike.baidu.com/view/651435.htm
+ 7. 黑盒测试:black-box testing 出处同上 定义:未知程序内部结构进行的测试 P26
+ 8. 白盒测试:white-box testing 出处同上 定义:已知程序内部结构进行的测试 P26
+ 9. 性能测试:performance testing 出处同上 定义:用来测试软件在集成系统中的运行性能。P135
+ 10. α测试:αtesting 定义:对即将面市的软件产品进行测试 P158
+ 11.CMM:Capability Maturity Model for Software 能力成熟度模型 http://baike.baidu.com/view/8110.htm 定义:对于软件组织在定义、实施、度量、控制和改善其软件过程的实践中各个发展阶段的描述 http://baike.baidu.com/view/8110.htm
+ 12. ISO9000:质量管理体系标准 定义:由TC176(质量管理体系技术委员会)制定的所有国际标准。 http://baike.baidu.com/view/9486.htm
+简答题:(2x12)
+1 黑盒测试和白盒测试的区别?哪些错误使用黑盒测试更容易发现?哪些错误使用白盒测试更容易发现?各举2例。
+黑盒测试是不知道软件程序内部结构,白盒测试是知道软件程序内部结构。
+黑盒测试便于发现1、是否有不正确或遗漏的功能?2、在接口上,输入是否能正确的接受?能否输出正确的结果?
+白盒测试易于发现:1、对所有的逻辑判定,取“真”与取“假”的两种情况都能至少测一遍。2、在循环的边界和运行的界限内执行循环体
+http://zhidao.baidu.com/question/13988876.html
+2 集成测试和系统测试的区别和联系?
+P132 集成测试对象是模块间的接口,系统测试对象是整个系统。集成测试和系统测试都用到黑盒测试
+问答题:(52)
+1 (10)描述软件开发的瀑布模型,并结合自己参与的具体项目,回答以下问题:
+ 瀑布模型:可行性研究和计划—需求分析—设计—编码—测试—运行维护 http://baike.baidu.com/view/551037.htm
+(1) 实际项目开发经历了哪些阶段?(先简单阐述所做的项目)
+ 做一个航空售票系统。一开始老师说要求(可行性研究和分析),同学们听见后,开始分析老师想要什么东西(需求分析),然后脑子里大概有个相应的想法(设计),开始打代码(编码),最后检查有没有报错,看能不能运行(测试)
+(2) 作为程序员,依次写出你认为最重要的3个阶段,并说明原因?
+ 需求分析,设计,编码。需求分析,只有知道自己想要什么,才知道自己要做成什么东西;设计,有个大体的模子,才能知道该怎么弄;编码,既然是程序员,不编码能叫程序员吗。
+
+2 (12)写出软件测试的2种不同定义,指出它们的区别,你喜欢哪一种?为什么?
+第一种:P18 Bill Hetzel 提出测试的目的不仅仅是为了发现软件缺陷和错误,也是对软件质量进行度量和评估。以提高软件质量。
+第二种: P18 Grenford J.Myers 测试是为了证明程序有错,而不是证明程序无错误
+第二种片面点。原因:第一种提出更加全面点,因为软件是肯定有错的,不可能软件是没bug的,所以我更喜欢第二种