Develop a Custom Collector

Introduction

Goal

Develop a collector to collect domain-specific targeting data about visitors.

Background

The Relevance Module includes a number of default collectors. However, effective use of the Relevance Module often requires collecting additional, domain-specific data about visitors. This page explains how to develop a custom collector in order to achieve this.

Maven Dependencies

Collectors have to be part of the cms (platform) application. You can best best add them to a separate Maven module for your own collector(s) and let the myproject-cms-dependencies module depend on that separate module (and in case of platform deployment without CMS have the platform module depend on that separate module)

Add the following Maven dependency to the module that will contain your custom collector:

<dependency>
  <groupId>com.onehippo.cms7</groupId>
  <artifactId>hippo-addon-targeting-api</artifactId>
</dependency>

Optionally, if you want to extend the base class AbstractCollector (see Collector Class below), add the following dependency too:

<dependency>
    <groupId>com.onehippo.cms7</groupId>
    <artifactId>hippo-addon-targeting-collectors</artifactId>
</dependency>

Collector Configuration

Using the Console, add a node of type targeting:collector to /targeting:targeting/targeting:collectors.

The name of the node will be used as the ID of the collector.

Add a String property targeting:className to the new node. Its value should be the fully qualified name of your custom collector class.

/targeting:targeting/targeting:collectors:
  /mycollector:
    jcr:primaryType: targeting:collector
    targeting:className: org.example.MyCollector

Collector Class

Add a class that implements the interface com.onehippo.cms7.targeting.Collector. The class should have a constructor that gets a String and a JCR Node object. The String is the configured ID for the collector (e.g. mycollector). The node is the configuration JCR node of the collector (e.g. /targeting:targeting/targeting:collectors/mycollector).

An alternative for implementing the Collector interface is to extend the base class AbstractCollector. Extending this class saves you from writing your own JSON serialization code (more about that below).

The example implementation below also extends AbstractCollector.

MyCollector.java:

import javax.jcr.Node;
import com.onehippo.cms7.targeting.collectors.AbstractCollector;

public class MyCollector extends AbstractCollector<MyTargetingDataImpl,
                                                   MyRequestData> {

    public MyCollector(String id, Node node) throws RepositoryException {
        super(id);
        // read any collector-specific configuration properties from the node
    }

    /**
     * Get the targeting data that this collector provides 
     * for the current request.
     * This allows decoupling of runtime request information 
     * and the generation of statistics.
     *
     * @param request               the <code>request</code> to inspect 
     *                              for new targeting information to add to the
     *                              data.
     * @param newVisitor
     * @param newVisit
     * @param previousTargetingData the previous collected data 
     *                              for this Collector for the 
     *                              current visitor, which can be null
     * @return processed request data, or {@code null} if 
     * no relevant data is available
     */
    MyRequestData getTargetingRequestData(HttpServletRequest request, 
                              boolean newVisitor, 
                              boolean newVisit, 
                              MyTargetingDataImpl previousTargetingData) {
        // TODO: implement
    }

    /**
     * Update the targeting data of this visitor 
     * with the request data gathered by
     * {@link #getTargetingRequestData(javax.servlet.http.HttpServletRequest, 
     * boolean, boolean, TargetingData)}
     *
     * @param targetingData the {@link TargetingData} to update. May be {@code null}
     *                      if this is the first time this collector is 
     *                      called for this visitor.
     * @param requestData   the requestData that resulted from processing 
     *                      the current request. May be {@code null}.
     * @return the updated {@link TargetingData}. Null if both 
     * the passed in {@link TargetingData} was null and there
     * was no new information to store.
     */
    MyTargetingDataImpl updateTargetingData(MyTargetingDataImpl targetingData, 
                                        MyRequestData requestData) 
                                        throws IllegalArgumentException {
        // TODO: implement
    }
}

Your collector will most likely use its own objects to store targeting data and request data. In the example, we'll call these MyTargetingDataImpl and MyRequestData.  MyTargetingDataImpl needs to be a POJO that can be mapped from Java to JSON and vice versa via com.fasterxml.jackson.databind.ObjectMapper.

The targeting data bean stores all targeting data of a visitor. The collector is responsible for updating the data in the bean. It extends the TargetingData interface, which only defines the method getCollectorId.

MyTargetingDataImpl.java:

public class MyTargetingDataImpl extends AbstractTargetingData {

    @JsonCreator
    public MyTargetingDataImpl(@JsonProperty("collectorId") String collectorId,
                            ...) {
        super(collectorId);
    }

    // add any custom fields, getters, and setters.

}

The request data bean contains all data collected from a single HTTP request. This bean will be stored in the request log of the targeting engine.

MyRequestData.java:

public class MyRequestData {

    public MyRequestData(...) {

    }

    // add any custom getters, and setters.

}

JSON Serialization

The targeting and request data will be serialized to JSON when communicating with the CMS UI and when it is persisted. The default serialization is based on Jackson and can be tuned using its annotations @JsonCreator, @JsonProperty etcetera. Since this may not give sufficient control, is inconvenient or because you need to adapt data that was serialized using an older format, the actual serialization is delegated to the Collector implementation.

The methods that will be invoked for (de)serializing request & targeting data are (see the Collector interface):

T convertJsonToTargetingData(ObjectNode root, ObjectMapper objectMapper)
                                                        throws IOException;

JsonNode convertTargetingDataToJson(T data, ObjectMapper objectMapper)
                                                        throws IOException;

U convertJsonToRequestData(JsonNode root, ObjectMapper objectMapper)
                                                        throws IOException;

JsonNode convertRequestDataToJson(U data, ObjectMapper objectMapper)
                                                        throws IOException;

The AbstractCollector has default implementations of these methods. The MyTargetingData example demos how the collector ID should be passed to the AbstractTargetingData base class. Other properties can be set with setters conforming to the Java beans convention, but they can also be initialized with additional @JsonProperty annotations, allowing for instance to create immutable data structures.

Access the Collector's TargetingData in the HST Site Webapp

In general, it is not needed that you access targeting data in the HST site webapp, but in case you do need it, you can access the collector's targeting data via the TargetingProfile, for example:

TargetingProfile profile = TargetingStateProvider.get().getProfile();
Map<String, TargetingData> targetingData = profile.getTargetingData();
GeoIPTargetingData geoIpTargetingData = (GeoIPTargetingData)targetingData.get("geo");

However, since the TargetingData implementations live in the CMS (platform) webapp (since version 13.0.0, in earlier verions they were part of the HST site webapp), you cannot cast to the implementations. The code above works because the GeoIPTargetingData is an interface which lives in the shared lib (hippo-addon-targeting-shared-api). If you want your own collector's targeting data to be available in the HST site webapp, you have to make sure to extract an interface and make sure this interface will be part of a Maven module that ends up in the shared lib. Note that all JSON annotations should be on the implementation and are not allowed to be on the interface since the shared lib does not contain Jackson jars.

Replace Dots in Field Names with Underscore

Elasticsearch 2, used to store visits, does not allow dots in field names. Therefore collectors are required to make sure no field names with dots show up in serialized targeting data.

In some cases collecting data with dots in their name is inevitable. For example, when collecting a cookie named myapp.rememberme.cookie. In such cases, it is recommended that the collector replaces all dots with underscores. All default collectors included in the Relevance Module follow this best practice and are 'dot safe'.

If a field name containing a dot is encountered while processing serialized targeting data, the dot is replaced by an underscore and a warning is logged.

Avoid the Elasticsearch Mapping Explosion Problem

If Experiments or Trends are used, the request data will also be stored in Elasticsearch. If the serialization of your model contains objects with arbitrary keys, such as timestamps or query parameters, your collector is prone to cause the so called Mapping Explosion Problem in Elasticsearch.

Elastic search dynamically updates the mapping (similar to a database schema) of a type whenever new fields are detected on incoming data. For each new field, Elasticsearch builds an index, which consumes memory, CPU, and disk space. Therefore the number of fields for a type in Elasticsearch should remain bounded.

If your model serialization contains objects with unconstrained numbers of keys, this will lead to an unconstrained number of indices. There are several ways of avoiding this:

  1. The simplest way to ensure this is to avoid having Map<> properties on your model beans at all.
  2. If you must use a Map<> property but are certain that the keys used cannot be controlled by site visitors and will remain relatively small, you can mark the getter for this property with the @LimitedKeySet annotation. This is useful if the keys are for example document types or are explicitly listed as a configuration option. The @LimitedKeySet annotation indicates that the property is safe to store in Elasticsearch.
  3. If your collector really requires arbitrary visitor-controlled map keys, you must write your own serialization logic by overriding the methods convertJsonToTargetingData(), convertTargetingDataToJson(), convertJsonToRequestData() and convertRequestDataToJson().

Example of @LimitedKeySet:

public class MyRequestData {
  @JsonProperty
  @LimitedKeySet("Keys are bounded because there are only 5 categories")
  private Map<String,Integer> categoryViewPercentage;
  ...
}

Enable Alter Ego

To enable users to use the Alter Ego feature in the Experience manager to override data collector by your custom collector, you must also develop a collector plugin.

Did you find this page helpful?
How could this documentation serve you better?
On this page
    Did you find this page helpful?
    How could this documentation serve you better?