Magnolia 5.7 reached extended end of life on May 31, 2022. Support for this branch is limited, see End-of-life policy. Please note that to cover the extra maintenance effort, this EEoL period is a paid extension in the life of the branch. Customers who opt for the extended maintenance will need a new license key to run future versions of Magnolia 5.7. If you have any questions or to subscribe to the extended maintenance, please get in touch with your local contact at Magnolia.
This page describes how to configure the Content Indexer submodule of the Magnolia Solr module to index Magnolia workspaces and crawl a website. Solr module allows you to use Apache Solr, a standalone enterprise-grade search server with a REST-like API, for indexing and crawling Magnolia content, especially if you need to manage assets in high volumes (100,000+ DAM assets).
Configuring Solr clients
From version 5.2 the Solr module supports multiple Solr servers/cores. You can configure a client
for every server/core under Configuration > /modules/solr-search-provider/config/solrClientConfigs
. It's recommended to have one client named default
. This default client is used when no specific client is defined for the indexer, crawler, or search result page template.
If you need to have more servers/cores, duplicate the default
client and change the baseURL
property to point to another server/core.
Node name | Value |
---|---|
solr-search-provider | |
config | |
solrClientConfigs | |
default | |
allowCompression | false |
baseURL | http://localhost:8983/solr/magnolia |
connectionTimeout | 100 |
soTimeout | 1,000 |
The value entered for the baseURL
property should conform with the following syntax:
<protocol>://<domain_name>:<port>/solr/<solr_core_name>
If the Solr server is installed as described in Installing Apache Solr, then the value is
http://localhost:8983/solr/magnolia
. For a description of the other properties see the HttpSolrClient.Builder Javadoc and Using SolrJ - Common Configuration Options.
Indexing Magnolia workspaces
The Content Indexer module is a recursive repository indexer and an event based indexer. You can configure multiple indexers for different sites and document types. The content indexer also allows you to crawl external websites using JSoup and CSS selectors. You then define different field mappings that will be obtained for each node and indexed in the Solr index.
IndexService
Both the indexer and the crawler use the IndexService
to handle the indexing of a content. A basic implementation is configured by default: info.magnolia.search.solrsearchprovider.logic.indexer.BasicSolrIndexService
. You can define and configure your own IndexService
for specific needs.
Implement the IndexService
interface:
public class I18nIndexerService implements info.magnolia.module.indexer.indexservices.IndexService { private static final Logger log = LoggerFactory.getLogger(I18nIndexerService.class); @Override public boolean index(Node node, IndexerConfig config) { ...
For a globally configured indexing service, register the IndexService
in the configuration of the Content Indexer module. For your custom indexing service, use the
indexServiceClass
(see above in the properties table):
Node name | Value |
---|---|
modules | |
content-indexer | |
config | |
indexServiceClass | info.magnolia.search.solrsearchprovider.logic.indexer.BasicSolrIndexService |
Indexer configuration
You can configure an indexer in Configuration > /modules/content-indexer/config/indexers
. See an example configuration for indexing assets and folders in the DAM workspace or the below example configuration for indexing of content in the website
workspace:
Node name | Value |
---|---|
modules | |
content-indexer | |
config | |
indexers | |
websiteIndexer | |
clients | |
default | default |
fieldMappings | |
abstract | abstract |
author | author |
date | date |
teaserAbstract | mgnlmeta_teaserAbstract |
text | content |
title | title |
enabled | true |
indexed | false |
pull | false |
rootNode | / |
type | website |
workspace | website |
Properties
| required
|
| required Indicates whether indexing was done. When Solr finishes indexing content-indexer will set this property to |
| optional, default is JCR node type to index. For example, if you were indexing assets in the Magnolia DAM you would set this to |
| optional, default is Pull URLs instead of pushing. When |
assetProviderId | optional , default is If |
| required Node in the workspace where indexing starts. Use this property to limit indexing to a particular site branch. |
| required Sets the type of the indexed content such as |
| required Workspace to index. |
| optional (Solr module version 5.2+) Custom IndexService used by this indexer. If not defined, the global one is used. |
| required Field mappings defines how fields in Magnolia content are mapped to Solr fields. Left side is Magnolia, right side is Solr. |
|
You can use the fields available in the schema. If a field does not exist in Solr's schema you can use a dynamic field |
| optional, default is Solr clients which will be used by this indexer. Allows to index content for multiple instances of Solr. |
|
Crawler configuration
The crawler mechanism uses the Scheduler to crawl a site periodically.
You can configure crawlers in Configuration > /modules/content-indexer/config/crawlers/
.
Crawler properties
| required
When a crawler is enabled |
| required The max depth of a page in terms of distance in clicks from the root page. This should not be too high, ideally 2 or 3 max. |
| required The max number of simultaneous crawler threads that crawl a site. 2 or 3 is enough. |
| optional, since version 3.0, default value is Implementation of |
| optional, since version 3.0, default value is Name of the catalog where the command resides. |
| optional, since version 3.0, default value is Command which is used to instantiate and trigger the crawler. |
| optional, since version 3.0 If it's set to true then crawler should be triggered only during activation. No scheduler job will be registered for this crawler.
|
| optional, since version 3.0, default value is Defines the delay (in seconds) after which crawler should start when activation is done. Default value is 5s. |
| optional, default is every hour A CRON expression that specifies how often the site will be crawled. CronMaker is a useful tool for building expressions. |
| optional Sets the type of the crawled content such as |
| optional, since version 5.2 Custom IndexService used by this crawler. If not defined, the global one is used. |
| optional, since version 5.2 , default is Solr clients which will be used by this indexer. Allows index content into multiple Solr instances. |
| required Name of the client. |
| required Field mappings defines how fields parsed from the site pages are mapped to Solr fields. Left side is Solr field, right side is the crawled site. |
| required You can use any CSS selector to target an element on the page. For example, You can also use custom syntax to get content inside attributes. For example, meta keywords are extracted using |
| optional , since version 3.0 List of JCR items. If any of this items is activated crawler will be triggered. |
| optional, since version 3.0 Name of the JCR item. |
| required, since version 3.0 Workspace where JCR item is stored. |
| required, since version 3.0 Path of the JCR item. |
| optional , since version 5.0.2 Authentication information to allow crawling password restricted area. |
| required, since version 5.0.2 Username which is used for login into restricted area. |
| required, since version 5.0.2 User's password used for login into restricted area. |
| required, since version 5.0.2 Url to page with login form. |
| required, since version 5.0.2, default value is Name of input field for entering the username in login form. |
| required, since version 5.0.2, default value is Name of input field for entering the password in login form. |
| required, since version 5.0.2, default value is String which identifies the logout Url. Crawler doesn't crawl over the urls which contains |
Configuration of crawler commands
You can configure crawler commands in Configuration > /modules/content-indexer/commands/
.
By default, the crawler mechanism is connected with the CleanSolrIndexCommand
to clean the index from outdated indexes (pages). The CleanSolrIndexCommand
is chained before the CrawlerIndexerCommand
.
Node name | Value |
---|---|
modules | |
content-indexer | |
config | |
indexers | |
crawlers | |
commands | |
content-indexer | Note: Name of the folder is referenced by the crawler |
<crawler-name> | Note: Name of the node is referenced by the crawler |
cleanSolr | |
class | info.magnolia.search.solrsearchprovider.logic.commands.CleanSolrIndexCommand |
<crawler-name> | Note: Name of the node is arbitrary. |
class |
|
Properties for the cleanSolr command
| optional, since version 5.0.1, default value is Maximum number of documents to be checked. |
| optional, since version 5.5.1, default value is If set to If the |
| optional, since version 5.5.2, default value is If set to |
statusCodes | optional, since version 5.0.1 List of status codes. If a page returns any of the status codes listed, then the page will be removed from the index. By default, there is no list but if a page returns 404 at any time, the page is removed from the index. |
| optional, since version 5.5.4, default value is If set to |
| optional, since version 5.5.1, default value is If set to Normally, if the |
Crawling triggered by publishing (activation)
From version 3.0, crawlers can also be connected with the publishing (activation) process by adding info.magnolia.module.indexer.crawler.commands.CrawlerIndexerActivationCommand
into command chain with activation command. By default, this is done for these commands:
- If you are using the Publishing module:
- catalog:
default
, command:publish
- configured under/modules/publishing-core/commands/default/publish
- catalog:
default
, command:unpublish
- configured under/modules/publishing-core/commands/default/unpublish
- catalog:
- If you are using the Activation module:
- catalog:
default
, command:activation
- configured under/modules/activation/commands/default/activate/activate
- catalog:
default
, command:deactivate
- configured under/modules/activation/commands/default/deactivate
- catalog:
- catalog: default, command:
personalizationActivation
- configured under/modules/personalization-integration/commands/default/personalizationActivation
If you are using custom activation command and you wish to connect it with crawler mechanism, you can use info.magnolia.module.indexer.setup.AddCrawlerIntoCommandChainTask
install/update task for it.