Zend Search Lucene and large result sets
Index size does not affect Lucene's search speed per-se: what matters is the frequency of the search terms. And terms tend to have larger frequencies in larger indexes. (Doug Cutting on java-user)Given our index includes a keyword field that indicates a type, e.g. whether an index entry represents an article or a document. Queries should be made on these type subsets, for example to match only articles that contain `lucene' in the title. Approximately 80-90% of all indexed documents represent articles. The overall size of the index is roughly ~500'000 documents. A Boolean query consisting of two subqueries for `title:lucene' and `type:article' (both mandatory) takes unexpectedly long to execute. In this case, the blame for the delay can be clearly put to the `type:article' subquery that matches a very large result set.
$term1 = new Zend_Search_Lucene_Index_Term('t0', 'type');
$term2 = new Zend_Search_Lucene_Index_Term('lucene', 'title');
$query = new Zend_Search_Lucene_Search_Query_Boolean();
$queryt1 = new Zend_Search_Lucene_Search_Query_Term($term1);
$queryt2 = new Zend_Search_Lucene_Search_Query_Term($term2);
$query->addSubquery($queryt1, true);
$query->addSubquery($queryt2, true);
Measurements:
| Subquery 1 (type) : | 10.5533s |
| Subquery 2 (title): | 0.0889s |
| Combined: | 11.68558s |
Lucene's inherent way to retrieve documents is to successively search for every term of a query, collect the results and then perform calculations for conjunctions or intersections based on the search term operators. The complexity of the search syntax semantics probably prevents any chance for reasonable search-within-search features in order to already narrow the search space before execution of the next term (e.g. get all documents matching the title and then substract all items not matching the type). Moreover all limiting and sorting seems to be applied after the full retrieval is completed.
However, Java Lucene offers different Filters that work with cacheable BitSet objects to efficiently post-process the results, so that the expensive `type' subquery could be implemented in such a manner. Zend Search Lucene does not have filters (yet). Using termDocs() to retrieve the document ids for entries matching the title, followed by crude looping to leave out all non-article types proved to be efficient for this particular case (Measured 0.08670s).
$hits = array();
$count = 0;
$term = new Zend_Search_Lucene_Index_Term('lucene', 'title');
$docIds = $this->searchIndex->termDocs($term);
foreach($docIds as $key => $docId) {
$doc = $this->searchIndex->getDocument($docId);
if ($doc->type === 'article') {
array_push($hits, $doc);
$count++;
}
if ($count === $maxRes) break;
}
return $hits;







