-
Notifications
You must be signed in to change notification settings - Fork 102
搜索算法
Birdy edited this page Oct 1, 2019
·
9 revisions
- 1.搜索引擎解决什么问题
- 2.搜索的分类
- 3.搜索的过程
- 4.搜索的算法
- 顺序查找
- 分块查找
- 二分查找
- 插值查找
- 斐波那契查找
- 树表查找
- 哈希查找
- 顺序查找
- 5.文档与字段
- 6.分词
- 7.索引
- 反向索引
- 正向索引
- 8.查询
专门解决搜索领域的海量结构化,半结构化,非结构化数据的实时搜索问题.
索引分为三类:
-
基于树的索引(tree-based index)
-
基于哈希的索引(hash-based index)
-
基于倒排的索引(inverted-based index)
索引原理:对列值排序存储,数据结构={列值,行地址}.利用二分查找找到行地址,再根据行地址获取行数据.
-
顺序扫描
-
倒排索引
构建文档->分析文档->索引文档
构建查询->分析查询->检索查询
分词(Tokenization)->词元(Token)->语言处理->词项(Term)
对于英语,语言处理分为以下两步:
将单词缩减为词根形式(drivers->driver),称为Stemming;
将单词转变为词根形式(drove->drive),称为:Lemmatization;
Stemming和Lemmatization的异同:
* 相同之处
Stemming和Lemmatization都要使词汇成为词根形式.
* 不同之处:
Stemming基于规则(drivers->driver,driving->driver)
Lemmatization基于词典(drove->drive,driving->driver)
名称 | 说明 |
---|---|
IntPoint | |
FloatPoint | |
LongPoint | |
DoublePoint | |
BinaryDocValuesField | 单值 |
SortedDocValues | 单值 |
SortedSetDocValues | 多值 |
NumericDocValuesField | 单值 |
SortedNumericDocValuesField | 多值 |
StringField | |
TextField | |
StoredField |
评估资源icwb2-data中包含了 (Academia Sinica, City University, Peking University, Microsoft Research)四家单位的训练集,测试集.
在统一环境下评估若干分词软件,使用的模型/词库为各分词软件自带.
名称 | 说明 |
---|---|
// 全部匹配查询
Query query = new MatchAllDocsQuery();
TopDocs search = searcher.search(query, 1000000);
Assert.assertEquals(1681, search.totalHits.value);
// 全不匹配查询
Query query = new MatchNoDocsQuery();
TopDocs search = searcher.search(query, 1000);
Assert.assertEquals(0, search.totalHits.value);
名称 | 说明 |
---|---|
// 词项查询
Query query = new TermQuery(new Term("title", "Toy"));
TopDocs search = searcher.search(query, 1000);
Assert.assertEquals(1, search.totalHits.value);
// 范围查询
Query query = new TermRangeQuery("title", new BytesRef("Toa"), new BytesRef("Toz"), true, true);
TopDocs search = searcher.search(query, 1000);
Assert.assertEquals(22, search.totalHits.value);
// 前缀查询
PrefixQuery query = new PrefixQuery(new Term("title", "Touc"));
TopDocs search = searcher.search(query, 1000);
Assert.assertEquals(2, search.totalHits.value);
// 通配符查询
{
// *代表0个或者多个字母
Query query = new WildcardQuery(new Term("title", "*ouc*"));
TopDocs search = searcher.search(query, 1000);
Assert.assertEquals(2, search.totalHits.value);
}
{
// ?代表0个或者1个字母
Query query = new WildcardQuery(new Term("title", "?ouc?"));
TopDocs search = searcher.search(query, 1000);
Assert.assertEquals(2, search.totalHits.value);
}
// 正则查询
RegexpQuery query = new RegexpQuery(new Term("title", "To[a-z]"));
TopDocs search = searcher.search(query, 1000);
Assert.assertEquals(7, search.totalHits.value);
// 模糊查询
Query query = new FuzzyQuery(new Term("title", "Stori"));
TopDocs search = searcher.search(query, 1000);
Assert.assertEquals(32, search.totalHits.value);
名称 | 说明 |
---|---|
// 短语查询
// 设置短语之间的跨度为2,也就是说Story和The之间的短语小于等于均可检索
PhraseQuery build = new PhraseQuery.Builder().setSlop(2).add(new Term("title", "Story")).add(new Term("title", "The")).build();
TopDocs search = searcher.search(build, 1000);
Assert.assertEquals(2, search.totalHits.value);
// 多短语查询
Term[] terms = new Term[] { new Term("title", "NeverEnding"), new Term("title", "Xinghua,") };
Term term = new Term("title", "The");
// add之间认为是OR操作,即"NeverEnding", "Xinghua,"和"The"之间的slop不大于3
MultiPhraseQuery multiPhraseQuery = new MultiPhraseQuery.Builder().add(terms).add(term).setSlop(3).build();
TopDocs search = searcher.search(multiPhraseQuery, 1000);
Assert.assertEquals(2, search.totalHits.value);
// 相当于TermQuery,区别是使用SpanTermQuery可以得到词项的跨度信息
Query query = new SpanTermQuery(new Term("title", "Toy"));
TopDocs search = searcher.search(query, 1000);
Assert.assertEquals(1, search.totalHits.value);
// 临近查询(匹配域中[0,n]范围内的词项)
SpanQuery spanQuery = new SpanTermQuery(new Term("title", "Story"));
SpanFirstQuery firstQuery = new SpanFirstQuery(spanQuery, 5);
TopDocs search = searcher.search(firstQuery, 1000);
Assert.assertEquals(5, search.totalHits.value);
// 跨度查询
SpanQuery[] spanQueries = new SpanQuery[] { new SpanTermQuery(new Term("title", "The")), new SpanTermQuery(new Term("title", "Story")) };
{
// 不考虑顺序的情况
SpanNearQuery nearQuery = new SpanNearQuery(spanQueries, 5, false);
TopDocs search = searcher.search(nearQuery, 1000);
Assert.assertEquals(3, search.totalHits.value);
}
{
// 考虑顺序的情况
SpanNearQuery nearQuery = new SpanNearQuery(spanQueries, 5, true);
TopDocs search = searcher.search(nearQuery, 1000);
Assert.assertEquals(1, search.totalHits.value);
}
{
// 考虑间隔的情况
{
SpanNearQuery.Builder builder = SpanNearQuery.newOrderedNearQuery("title");
builder.addClause(spanQueries[0]).addGap(2).setSlop(5).addClause(spanQueries[1]);
SpanNearQuery nearQuery = builder.build();
TopDocs search = searcher.search(nearQuery, 1000);
Assert.assertEquals(1, search.totalHits.value);
}
{
SpanNearQuery.Builder builder = SpanNearQuery.newOrderedNearQuery("title");
builder.addClause(spanQueries[0]).addGap(3).setSlop(5).addClause(spanQueries[1]);
SpanNearQuery nearQuery = builder.build();
TopDocs search = searcher.search(nearQuery, 1000);
Assert.assertEquals(0, search.totalHits.value);
}
}
SpanQuery[] spanQueries = new SpanQuery[] { new SpanTermQuery(new Term("title", "The")), new SpanTermQuery(new Term("title", "Story")) };
{
SpanQuery spanQuery = new SpanTermQuery(new Term("title", "Angels"));
SpanNearQuery nearQuery = new SpanNearQuery(spanQueries, 5, true);
SpanNotQuery notQuery = new SpanNotQuery(nearQuery, spanQuery);
TopDocs search = searcher.search(notQuery, 1000);
Assert.assertEquals(1, search.totalHits.value);
}
{
SpanQuery spanQuery = new SpanTermQuery(new Term("title", "Day"));
SpanNearQuery nearQuery = new SpanNearQuery(spanQueries, 5, true);
SpanNotQuery notQuery = new SpanNotQuery(nearQuery, spanQuery);
TopDocs search = searcher.search(notQuery, 1000);
Assert.assertEquals(0, search.totalHits.value);
}
SpanQuery[] spanQueries = new SpanQuery[] { new SpanTermQuery(new Term("title", "The")), new SpanTermQuery(new Term("title", "Story")) };
SpanNotQuery leftQuery = null;
{
SpanQuery spanQuery = new SpanTermQuery(new Term("title", "Angels"));
SpanNearQuery nearQuery = new SpanNearQuery(spanQueries, 5, true);
leftQuery = new SpanNotQuery(nearQuery, spanQuery);
}
SpanNotQuery rightQuery = null;
{
SpanQuery spanQuery = new SpanTermQuery(new Term("title", "Day"));
SpanNearQuery nearQuery = new SpanNearQuery(spanQueries, 5, true);
rightQuery = new SpanNotQuery(nearQuery, spanQuery);
}
SpanOrQuery orQuery = new SpanOrQuery(new SpanQuery[] { leftQuery, rightQuery });
TopDocs search = searcher.search(orQuery, 1000);
Assert.assertEquals(1, search.totalHits.value);
名称 | 说明 |
---|---|
// 精确查询
Query exactQuery = IntPoint.newExactQuery("id", 1);
TopDocs search = searcher.search(exactQuery, 1000);
Assert.assertEquals(1, search.totalHits.value);
// 范围查询
Query rangeQuery = IntPoint.newRangeQuery("id", 501, 1000);
TopDocs search = searcher.search(rangeQuery, 1000);
Assert.assertEquals(500, search.totalHits.value);
// 集合查询
Query setQuery = IntPoint.newSetQuery("id", 1, 10, 100, 1000);
TopDocs search = searcher.search(setQuery, 1000);
Assert.assertEquals(4, search.totalHits.value);
名称 | 说明 |
---|---|
名称 | 说明 |
---|---|
名称 | 说明 |
---|---|