Configure the Lucene Analyzer
Introduction
Hippo Repository uses org.hippoecm.repository.query.lucene.StandardHippoAnalyzer as default Lucene Analyzer for the stored content. This analyzer strips stopwords for the languages English, German, Dutch, French, Spanish and Brazilian. It also applies a ISO Latin 1 accent filter, this replaces a letter like ç with c and ï with i, etc
Customizing Stop Words of StandardHippoAnalyzer
Stop words of org.hippoecm.repository.query.lucene.StandardHippoAnalyzer are stored in the following classpath resource files for each different language, and so it is possible to customize those stop words by shadowing those resource files in the classpath if needed:
- Default: classpath:org/hippoecm/repository/query/lucene/StandardHippoAnalyzer.properties
- English: classpath:org/hippoecm/repository/query/lucene/StandardHippoAnalyzer_en.properties
- Spanish: classpath:org/hippoecm/repository/query/lucene/StandardHippoAnalyzer_es.properties
- French: classpath:org/hippoecm/repository/query/lucene/StandardHippoAnalyzer_fr.properties
- German: classpath:org/hippoecm/repository/query/lucene/StandardHippoAnalyzer_de.properties
- Dutch: classpath:org/hippoecm/repository/query/lucene/StandardHippoAnalyzer_nl.properties
- Brazilian Portuguese: classpath:org/hippoecm/repository/query/lucene/StandardHippoAnalyzer_pt_BR.properties
- Czech: classpath:org/hippoecm/repository/query/lucene/StandardHippoAnalyzer_cs.properties
For example, stop words for English language looks like the following:
# The delimiters to use when splitting stopwords.split.tokens value. stopwords.split.delimiters=, # Whether or not to preserve all the tokens including empty string token. stopwords.split.preserveAllTokens=true # Stopwords tokens. stopwords.split.tokens=a,and,are,as,at,be,but,by,for,if,in,into,is,it,no,not,of,on,or,s,such,t,that,the,their,then,there,these,they,this,to,was,will,with,,www
Just as an example, if you want to add more stop words like "etc" or "ie", then you can add those two words, delimited by a comma, to stopwords.split.tokens property. You can add cms/src/main/resources/org/hippoecm/repository/query/lucene/StandardHippoAnalyzer_en.properties with your custom change for instance if cms/ is the only submodule containing the repository instance.
Custom Lucene Analyzer
You can configure custom language analyzers, that for example also add stemming. The side effect is that it breaks wildcard searching. Explaining this is beyond the scope of this page, as it involves general concepts about inverted indexes, such as Lucene. We advice to stick to the StandardHippoAnalyzer if you want to avoid wildcard searching issues.
Modify the Analyzer class
The Analyzer class is configured in the repository.xml file.
Change the value of
<param name="analyzer" value="org.hippoecm.repository.query.lucene.StandardHippoAnalyzer"/>
to the classname of your analyzer.
See Repository deployment settings for how to use your customized repository.xml.