This course is currently in draft.

Introduction

This course is for search administrators and covers the skills required to maintain and troubleshoot existing Funnelback searches.

Special tool tips or hints will appear throughout the exercises as well, which will provide extra knowledge or tips. They will look like:

This box is used to provide links to further reading available on the current topic.
This box is used to provide tips or hints related to the current topic.
This box is used to provide important advice relating to the current topic. This includes advice to be aware of to avoid common errors.
This box contains advice specific to the Linux version of Funnelback.
This box contains advice specific to the Windows version of Funnelback.

What this workshop will cover:

  • Funnelback logs

  • Troubleshooting updates

  • Troubleshooting query-time issues

Prerequisites to completing the course:

  • FUNL101, FUNL201, FUNL202, FUNL203

  • Extensive experience with Funnelback implementation.

  • Basic system administration skills.

1. Collections, profiles and frontend services

There are a number of concepts that need to be understood before you can start managing your search.

The diagram below shows users interacting with a Funnelback search.

collections profiles and frontend services 01

Interaction between a user and Funnelback occurs via a frontend service that is configured within Funnelback. A frontend service provides the user interface and also defines the display and ranking options that apply when a search is run.

This service extends a search profile, which provides a view or scope of a collection. Profiles provide access to a sub-set of the items within the search indexes.

A collection provides a way of searching across one or more search indexes.

An index is built for each data source or repository that you wish to provide search across.

1.1. Frontend services and profiles

A frontend service within Funnelback directly relates to a service that is provided to end users when they perform a search query.

A frontend service includes independent:

  • search templates

  • ranking and display settings

  • synonyms

  • interface functionality: faceted navigation, best bets, curator rule sets

  • knowledge graphs

  • search analytics

For example, a search conducted on a university website might include results from a number of different content sources (websites, social media, and course experts). Separate searches can then be provided for different contexts within the university website for example, a frontend service that only searches for courses (i.e. a course finder) or a search that looks at the whole university.

A frontend service would exist in Funnelback corresponding to each of these searches. The services would contain their own usage information and configuration and would be treated as independent searches, even though they might share underlying content and indexes.

A search profile is essentially the same as a service, except it is not exposed to search administrators via the marketing dashboard.

Profiles are usually used for system-based search activities such as using Funnelback’s search index as a database for dynamic web page content.

System-based search activities don’t usually require custom templating or search analytics.

1.2. Collections

A search collection in Funnelback is similar in concept to a collection that might be housed within a library or museum.

A collection generally contains a set of items that are related in some way. In a library this could be all of the non-fiction books, or all objects relating to Charles Darwin. In Funnelback a collection is usually determined by the type of repository - e.g. a website or set of websites, a database or a fileshare repository.

Each collection in Funnelback is updated and indexed separately and can also be queried separately.

Collections in Funnelback generally include configuration to gather content from a single data source and produces a single search index that corresponds to the data source that needs to be searched.

However, there is a special collection type (a meta collection) that includes other collections - and is used to aggregate search indexes of child collections.

One or more profiles (and services) can be defined within a collection to scope down the index and also provides search interface functionality, ranking and analytics.

1.2.1. Standard collections

Standard collections are self-contained and include configuration, live and offline indexes and follow an update cycle where the collection updates to an offline copy which is swapped to live on successful completion.

Standard collection types are:

  • custom

  • database

  • directory

  • facebook

  • filecopy

  • flickr

  • local

  • twitter

  • youtube

  • web

Notes:

  • local collections do not have a gather/filter phase as the data is indexed directly at the (local) source. This also means that the live and offline data folders will be empty.

  • custom collections will normally follow the same update cycle as a standard collection by may also be linked to a push collection and work as a gather-only collection. This is dependent on how the custom gatherer is implemented.

Standard collection types: update process

Standard collections have a cyclic update process that that gathers, filters and builds a set of indexes as an offline set of indexes which are swapped with the live indexes upon completion of a successful update. Updates are usually scheduled using the OS task scheduler.

Standard collections usually have live and offline sets of data, indexes and logs stored within the collection’s data folder.

standard collection types update process 01
Standard collection files

All the files for a collection are contained within the following folders:

  • Configuration: $SEARCH_HOME/conf/<COLLECTION-ID>

  • Data: $SEARCH_HOME/data/<COLLECTION-ID>

  • Reports:

    • $SEARCH_HOME/admin/reports/<COLLECTION-ID>

    • $SEARCH_HOME/admin/data-report/<COLLECTION-ID>

  • Users:

    • relevant users from $SEARCH_HOME/admin/users

  • Other configuration files:

    • Domain redirects: $SEARCH_HOME/conf/redirects.cfg

Data folder structure:

  • Collection logs: $SEARCH_HOME/data/<COLLECTION-NAME>/log

  • Update logs (live view): $SEARCH_HOME/data/<COLLECTION-NAME>/live/log

  • Update logs (offline view): $SEARCH_HOME/data/<COLLECTION-NAME>/offline/log

1.2.2. Gather-only collections

Gather-only collections are similar to a standard collection but only implement the gather and sometimes the filter phases of an update.

A gather-only collection is responsible for connecting to the data source and gathering the content that will be included in the index. It can optionally filter the content before submitting it to a linked push collection. This means that there are only logs for gathering and the collection will contain no data within the data folders.

Gather-only collection types are:

  • custom

  • slackpush

  • trimpush

Notes:

  • custom collections may be a gather-only collection and can be linked to a push collection. This is dependent on how the custom gatherer is implemented.

1.2.3. Push collections

Push collections have a transactional update model that involves content being added, updated and removed via an API. Push collections do not gather content but are responsible for building and maintenance of the search indexes.

push collections 01

When content is added to a push collection, it is inserted into the current index generation. A push collection will have one or more index generations.

The backend structure of a push collection is similar to a meta collection, with each generation forming a component index of that meta collection where the components are dynamically updated as new generations are created and merged.

Push collections automatically merge and clean up the index generations in the background.

Push collections: update process

Push collections have a transactional update process that that optionally filters content and builds and maintains the push collection indexes.

push collections update process 01

The push API listens for requests and updates the indexes as requests are submitted.

When a request comes in it is queued. The index is then updated when the requests are committed. The diagram provides high level background on the processes that run when an API request comes in to maintain the content of the indexes. The changes are maintained within an index generation. A new generation is automatically spawned when certain conditions are met and merged either automatically or if an explicit index merge is requested via the API.

It is responsibility of the submitting process to handle any errors that are returned by the push API. This includes retrying or queuing requests that were unsuccessful.
Push collection files
  • Collection logs: data/<COLLECTION_NAME>/log

    • Analytics, pattern analyser and knowledge graph update logs only

  • Update logs:

    • Generation logs: data/<COLLECTION_NAME>/live/log/<GENERATION>-/

    • Metalogs: data/<COLLECTION_NAME>/live/log/metalog/

1.2.4. Meta collections

Meta collections provide a way of combining the indexes of a set of collections housed within the same Funnelback server.

The combined index inherits a merged set of all the metadata classes and gscopes that are defined on each of the component collections.

Meta collections have minimal set of index files consisting of:

  • index.sdinfo: indicates which indexes belong to the meta collection, and any relative weightings to apply (if -cool.21 is set)

  • spelling indexes (index.suggest) for the combined meta collection

  • auto-completion indexes for each profile (index.autoc_<PROFILE>)

Meta collections: update process

Meta collections are not updated directly, but by a component collection updating.

When a component collection reaches the meta-dependencies phase of a running update it triggers an update on each meta collection of which it is a component. The update causes the meta collection to rebuild the spelling index for the meta collection and auto-completion index files for each profile that is setup on the meta collection.

Meta collection files
  • Collection logs: data/<COLLECTION_NAME>/log

  • Update logs (live view): data/<COLLECTION_NAME>/live/log

  • Update logs (offline view): data/<COLLECTION_NAME>/offline/log

2. Troubleshooting issues with collections

Collection issue troubleshooting usually starts with one of the following scenarios.

  • Responding to an error message, usually sourced from a log file. e.g.

    • An error message in a log such as Can’t access seed page.

    • A detailed Java stack trace.

  • Observing an unexpected behaviour, usually in the search results or reports. e.g.

    • Why is page X not returned in the search results?

    • Why are my results ranking in a particular way?

    • What are the weird queries showing in my analytics?

    • Why is my metadata missing in the search results?

    • Why is the metadata or summary truncated in the search results?

When you start debugging an issue it is often not obvious where to start looking. However, there are a few basic checks you can do as part of your initial investigation.

  1. If it’s obvious that it’s an update issue (as you’re responding to an error from your logs) then you can start investigating the issue.

  2. If the issue is showing up in the search interface then start at the frontend and work backwards. Make sure you start your investigation by accessing Funnelback directly with a query that triggers the issue - i.e. isolate Funnelback from anything that wraps it (see the tutorial below on how to do this).

  3. If your search has separate query processors make sure you are accessing those if you’re testing a frontend issue. If the issue is intermittent make sure you check each query processor individually as it could be an index that is out of sync.

  4. If you can see the error being returned by Funnelback when accessing one of Funnelback’s endpoints directly then check the underlying data model to see if the issue is reflected there.

    • If the issue is present when accessing search.html but not in search.json or search.xml then there is probably an issue in your search template.

    • If the issue is relating to missing or truncated metadata then check your query processor options to ensure that the interface is configured to return the fields (e.g. included in the summary fields list, and that the summary mode is set correctly) and that the buffer sizes (MBL, SBL) are large enough.

  5. If the error is reflected in the data model check:

    • For any hook scripts that might modify the data model

    • Check the index/configuration settings (e.g. metadata mappings, index and gather logs).

2.1. Update and query time issues

When troubleshooting a collection issue it is vital to determine if the error is a query-time or update-time issue.

Some problems could be the result of a query time or update time issue.

2.1.1. Query-time issues

  • Query-time issues are problems that occur when a query is processed by Funnelback. Query time issues include:

    • Search result ranking issues

    • Template errors

    • Missing metadata

2.1.2. Collection update issues

  • Update-time issues are problems that occurred during the collection update. This includes:

    • Anything that results in an update failure email being sent, or a collection update failure recorded in the administration interface.

    • Update failure or missing documents as a result of network errors, timeouts or misconfiguration.,

    • Missing metadata in the index (e.g. because it hasn’t been mapped, due to a filter error or truncated by the indexer).

2.1.3. Analytics update issues

Analytics update issues can show up in a number of ways:

  • Analytics reports that have empty data, or empty data beyond a certain date.

  • Analytics reports where the search keyword reports are generated but click reports are empty.

  • Search keyword reports that display weird queries.

  • Search keyword reports that

  • Trend alert reports that are empty for all periods.

  • Location reports that indicate all traffic coming from a specific location.

  • Location reports that show internal IP addresses.

2.1.4. Knowledge graph update issues

Issues that occur during the update of knowledge graph are usually caused by one of the following:

  • Incorrectly configured / specified knowledge graph metadata. Every node in the knowledge graph must have a single FUNkgNodeLabel and one or more FUNkgNodeName values. Items that do not satisfy this will be excluded from the graph.

  • An inadequate memory allocation to the funnelback-graph service. The default allocation is only suitable for small knowledge graphs.

3. Troubleshooting techniques

3.1. Isolating Funnelback

It is not uncommon for Funnelback search results to be called indirectly by another system. This commonly occurs if the search results are fetched as JSON or XML or as a HTML fragment that is nested into a page within a CMS.

You can tell that the search results are being nested if the URL that you are accessing when viewing the search results doesn’t access one of Funnelback’s public search endpoints (/s/search.html, /s/search.xml, /s/search.json, /s/search.classic).

To isolate Funnelback you will need to know what the Funnelback server(s) are that host the search indexes.

In this tutorial we’ll examine the process of determining the query that Funnelback received and how to view the raw Funnelback response.

Consider a site that nests Funnelback search results, either by reading in raw XML/JSON or nesting a HTML chunk containing the search results html.

  1. Open up a terminal session on the Funnelback server (or each of the Funnelback servers) that hosts the search index used by the search.

  2. Change to the folder containing the Jetty web server logs.

    cd $SEARCH_HOME/web/logs
  3. Decide on a test query (make sure it’s something that’s quite unique) to run and tail the web server access logs grepping for this query term. e.g. we’ll use a test query of funneltest. The following command will print out anything that is added to either the access.public.log or access.admin.log, but only if funneltest is present somewhere in the line.

    tail -f access.public.log access.admin.log | grep 'funneltest'

    Note: The above command is watching both the public and admin access logs - the incoming query should be on the public log if the search has been configured correctly. If the incoming query is in the admin log then the requests are being sent to the Jetty administration context (e.g. https://<FUNNELBACK-SERVER>:8443/s/search.html) - if this occurs the integration should be updated to call the public search interface.

  4. Return to your web browser and run a search for funneltest on the website you are trying to debug. You should see a line appear in your terminal that shows the request that was received by Funnelback. e.g.

    /opt/funnelback/web/logs/access.public.log:10.0.2.2 - - [25/Jun/2019:02:58:52 +0000] "GET /s/search.html?collection=directorysearch&query=funneltest&sort=title&num_ranks=50 HTTP/1.1" 200 133512 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36"
  5. From this information you can see that the query called when you enter funneltest as a query on the website results in Funnelback running a query of /s/search.html?collection=directorysearch&query=funneltest&sort=title&num_ranks=50. To see the exact response from Funnelback pass this to the Funnelback server URL. e.g if your Funnelback server is

    http://search.mysite.com

    then access

    http://search.mysite.com/s/search.html?collection=directorysearch&query=funneltest&sort=title&num_ranks=50

    and inspect the response from Funnelback - this is what is returned to the system that wraps the response.

Once you have this response you can inspect it and confirm if there is an error in the response. If Funnelback’s response looks correct then the error is in the code that processes the Funnelback response and needs to be addressed by the site owner.

If the problem is reflected in Funnelback’s response then you can continue your investigation.

3.2. View the Funnelback data model

Viewing the underlying data model for a query is a vital technique for successfully debugging query time-issues.

The data model is accessed by viewing either the search.json or search.xml endpoints. The JSON endpoint is more useful because it also indicates the type of each field.

Viewing the data model is useful because it shows the raw values available for each element. If there is a problem present in the data model then it eliminates the template as being the problem. Note that hook scripts can modify the data model so it is often a good idea to disable them when testing (see below).

The data model endpoints are worth checking where any type of template is involved (including any Freemarker ftl templates that return custom JSON/XML) and also when the JSON or XML is consumed directly by another system (which may break something when consuming the data).

3.3. Disable your hook scripts

Hook scripts allow significant customisation of how a query runs and also to alter what is returned to the end user. Errors in a hook script are a common cause of query-time problems and it is often useful to temporarily disable hook scripts as part of the process of pinpointing the cause of an error.

To disable a hook script requires you to rename the hook script to something else so that it isn’t detected when you run the query.

Disabling the hook scripts (all of them or specific ones) can help you to determine if the hook script is the cause of the issue you are seeing. However if the hook script is significantly changing the way the query works or other things such as the template have dependencies on changes made by a hook script then disabling them may not offer much help.

3.4. Try using the default template

Attempting to view the search results using the default template is a good way of determining if there is an issue in a custom template.

The default template (simple.ftl) is usually customised as part of an implementation. In order to test with the default template you’ll need to copy the system default one back to the collection.

To do this log in to the Funnelback server and copy $SEARCH_HOME/conf/simple.ftl.dist to the collection’s conf folder as a different template. e.g. copy it as default.ftl (note before copying make sure there isn’t an existing template called default otherwise you might overwrite it):

cp $SEARCH_HOME/conf/simple.ftl.dist $SEARCH_HOME/conf/<COLLECTION-ID>/<PROFILE-ID>_preview/default.ftl

You can then test the same query but using the default template by setting &form=default in the query.

Note: if there are hook scripts you will also need to check if they do anything that is dependent on the template that is being used.

3.5. View the stored source document

If you are experiencing a content related issue viewing the source code of the stored document can be very useful in assisting to pinpoint an issue.

The simplest way to view the source document is to view the cached version of it, either by directly selecting view cached version from the search results (if this is exposed in the template), or by extracting the cache URL from the data model and then accessing this URL.

This will show you the content of the source document - which is what Funnelback scanned when creating the index.

This will sometimes be different from what you see when you access the live url. Reasons for this could be:

  • The content has changed since Funnelback gathered the document.

  • A filter that processes the document has modified it (and there may be an error in the filter).

  • The content may be returned to Funnelback in a different format to what a connected user sees (for example because of Javascript in the page that alters the page DOM in memory or the content that it includes, because Funnelback is crawling with a different user agent, because Funnelback is crawling as a specific user).

3.6. Understand which collection to investigate

When troubleshooting a collection-related problem it is important to be able to figure out which collection you need to inspect.

If you are investigating a collection update failure then it will be obvious where you need to look. However if the issue is showing up in the search results then things become a bit more complicated.

If the problem relates to search result formatting, synonyms, best bets, curator rules, search sessions or any other front-end related feature then investigate the collection that is indicated by the collection parameter in the search query. This will often be a meta collection. If the problem is related to profile/frontend service functionality then make sure the correct profile is selected.

If the issue relates to auto-completion or extra searches then the issue may lie in the collection being queried (as above), or it could be in another collection - which will be determined by the collection/profile that is being used to source the auto-completion or extra search.

If the issue relates to search result content (such as missing metadata) then you will need to check the collection where the search result has been indexed. If you’re not sure what this is you can view the collection property of the individual search result in the data model. This will tell you which collection the result belongs to (if the results are aggregated in a meta collection).

3.7. Understand which view to investigate

Understanding the correct view to inspect is also very important when troubleshooting problems.

If the issue is relating to search results seen by users or an update that was successful then inspect the live view of a collection.

If the issue relates to an update that failed or you want to find out information about an update that is currently running or you are interested in the previously successful update then you need to inspect the offline view. Note: the state of the offline view will depend on if an update of the collection is currently running.

Which view would hold the relevant files for the following cases?

  • The administration interface shows that my collection failed to update

  • When I search for a document it appears to be missing from the search results

  • The metadata that is displaying in the results is wrong

If the offline view contains the previously successful update then you can revert the search by swapping the live and offline views. This is achieved by running an advanced update on the collection and selecting the swap live and offline views option.

This can occasionally solve a problem where a change caused an update to omit required content. Swapping the views makes the previously live indexes live again and allows you to start off another update.

3.9. Funnelback support packages

Funnelback includes an option within the administration interface to export a support package. This is very useful when supporting installed (non-hosted) instances of Funnelback.

This is a useful tool for remote support. The support package is generated by an administrator that has access to the Funnelback server. To export a support package the administrator logs in to the administration interface, changes to the affected collection then clicks the download support package link available from the administer tab in the administration interface.

The generated support package bundles up the global configuration, collection specific configuration and various logs into a file that can be supplied along with a support request.

3.9.1. Using the support package

Support packages are supplied as .tar.gz (Linux) or .zip (Windows) files. When decompressed these contain the same folder structure as a standard Funnelback instance but only including the files relevant to the exported support package.

This can be decompressed and inspected to diagnose issues and in many cases can be transferred onto a local Funnelback installation allowing the collection to be mirrored locally and tested.

Caveats:

  • Always make sure any testing is performed on the same version of Funnelback as the installed version (unless you are attempting to test if a newer version solves an issue).

  • It may not be possible to create your own mirror of the installed search if the content included in the collections is private or the configuration accesses private content.

  • There are sometimes issues exporting the support package on Windows which produces a corrupt support package. If you are having problems exporting a support package on Windows you can use a tool such as 7-zip to package up the log files and relevant configuration.

4. Troubleshooting specific issues

4.1. Search results is missing a particular URL

There are many reasons why a URL might be missing from search results including:

  • If the URL is rejected as it fails to match any include patterns

  • If the URL is rejected as it matches an exclude pattern

  • If the URL is rejected due to match against robots.txt or <meta> robots directives or because it was linked with a rel="nofollow" attribute

  • If the URL is rejected due to match against file type/mime type rules

  • If the URL is rejected due to exceeding the configured maximum filesize

  • If the URL is killed (in a kill_exact.cfg or kill_partial.cfg) or detected as a duplicate of another page

  • If a crawler trap rule is triggered (because there are too many files within the same folder or too many repeated folders in the directory structure)

  • If a canonical URL is detected

  • If the URL redirects to another URL (and if this is rejected)

  • The URL may have timed out or returned an error when it was accessed

  • If an error occurred while filtering the document

  • The update may have timed out before attempting to fetch the URL

  • The URL may be an orphan (unlinked) page

  • If the SimpleRevisitPolicy is enabled then the crawler may not have attempted to fetch the URL if it was linked from a page that rarely changes

  • The licence may not be sufficient to index all of the documents

Tutorial: Investigate missing pages

Step 1: check to see if the URL exists in the search index

Try the /collection-info/v1/collections/<COLLECTION-ID>/url API call which should report on if the URL is in the index.

If the URL is in the index but you can’t see it in the search results first try searching for the URL using the u and v metadata classes - it could be that the item ranks really badly.

The u metadata class holds the hostname component of the URL and the v metadata class holds the path component of the URL. To search for http://example.com/example/file.html you could run the query: u:example.com v:example/file.html.

If you find it when searching for the URL directly then you might need to figure out why it’s ranking badly. SEO auditor can assist you with this investigation.

If you can’t find the URL using the u and v metadata classes check that the collection/profile you are searching isn’t scoped (e.g. via gscopes) or that hook scripts are not modifying your query.

Step 2: check the index and gather logs

If it doesn’t show up as being in the index then the next thing to check is if the URL appears in the Step-Index.log for the collection that gathered the URL. It’s possible that the indexer detected it as a duplicate and removed it from the index, or that a canonical URL means that it was indexed with a different URL.

If it’s missing from the Step-Index.log then check the gather logs. For a web collection start with the url_errors.log then look at the craw.log.X files. You may need to grep the log files for the URL using a command similar to:

grep '<URL>' $SEARCH_HOME/data/<COLLECTION-ID>/<VIEW>/*.log

You may see some log messages similar to the following in the crawl.log.X files:

Rejected: <URL>

This usually means that the URL was rejected due to a match against robots.txt

Unacceptable: <URL>

This usually means that the URL was rejected due to a match against an exclude pattern or not matching any of the include patterns.

Unwanted type: [type] <URL>

This means that the URL was rejected due to a match against an unwanted mime type.

The url_errors.log file reports on specific errors that occurred (such as HTTP errors) when accessing URLs. The log lines follow a format similar to the line below and are fairly self-explanatory:

E <URL> [ERROR TYPE]

Step 3: access the URL directly using the DEBUG API

Funnelback provides an API call that can be used to assist in debugging http requests made by the crawler. This tool is particularly useful for debugging form-based authentication but it is also a very useful tool for debugging other missing URLs.

4.2. Search results is missing metadata

Empty or missing metadata in the search results is quite a common problem which can have a number of causes:

  • Funnelback hasn’t been configured to return the desired metadata classes by setting appropriate summary fields (-SF) and summary mode (-SM) query processor options.

  • The template hasn’t be configured to display the correct metadata classes.

  • The collection hasn’t been correctly configured to map the metadata fields to the metadata classes.

  • The metadata is not present in the source documents.

  • Metadata that is applied via external metadata is not correctly specified.

  • Any filters that generate/extract metadata are not working correctly.

Also, recall that metadata classes are case sensitive, so check that the case of the metadata class used in config matches what has been configured in the metadata mappings configuration.

Tutorial: Investigate missing metadata

This tutorial covers the different things to check when debugging missing metadata.

  1. Run a search that returns results that have metadata that is expected but missing.

  2. View the data model for the search and locate the result with the missing metadata. Check to see if the metadata is listed in the metaData and listMetadata elements of the search result. If the metadata is listed and showing in the data model but not showing to the end user then the template is misconfigured.

  3. If the metadata is missing from the data model alter the data model URL to force the metadata summary mode and to return the fields that you are interested in (&SM=meta&SF=[<METADATA-FIELDS>]). Update <METADATA-FIELDS> to be a comma separated list of fields you wish to see. Again inspect the metaData and listMetadata elements. If the items are now present then you need to update the query processor options to set the SF value to add the missing metadata fields, and possibly also set an SM value (metadata should be returned by default for most queries unless SM is set to override the default settings). Remember these need to be set on the collection that is accepting the queries (often a meta collection rather than the component).

  4. If a missing query processor option isn’t the cause then we need to look at the index itself. The first thing to check is the metadata mappings on the source collection to verify that the metadata fields are mapped to the correct classes. If the metadata mapping is missing add the mapping then re-index the collection and this should fix the issue.

  5. If the metadata mappings are present then you’ll need to check the source data to see what Funnelback is indexing. The next step will depend on where the metadata comes from.

    1. Look at the cached version the page and check the embedded metadata - this will show you what Funnelback indexed and will include metadata that was present in the source document and anything that was injected via filters that was written into the page. Also check the HTTP headers at the bottom of the cached copy.

    2. Look at the live version the page and check the embedded metadata - this will show you what is currently embedded in the page. Note: this could have changed since Funnelback indexed it but it’s worth checking here and then comparing to the cached version of the page.

    3. Check external_metadata.cfg for the collection to see if the metadata is attached to the page this way. If external metadata is being used to add the metadata then there may be an error in the external metadata file (such as multiple matching lines), or the URL may not match correctly. Check the Step-Index.log and locate the section where the external metadata rules are loaded. If there are problems with the external metadata file then messages will be printed to the log.

    4. If the missing metadata is injected via a filter then check the filter logs for any messages relating to filtering - if the filter had any syntax errors it won’t have been loaded at all. You may also need to modify the filter to print out more information if there’s an error in the logic implemented in the filter.

4.3. Metadata is truncated in search results

Sometimes you will see metadata presented in the search results that is cut off or truncated. There are a number of different causes for this:

  • The template is truncating the metadata when it is displayed (in the Freemarker using something like the <@s.Cut> macro or via on of the Freemerker built-in functions).

  • A hook script is truncating the metadata in the data model.

  • The metadata buffer is too small causing the metadata to be truncated when it is returned from the index.

  • The maximum field length for indexed metadata is too small (controlled by the -mdsfml indexer option).

Tutorial: Investigate truncated metadata

This tutorial goes through the steps required to debug truncated metadata.

  1. Run a search that results in metadata getting truncated.

  2. View the data model for the search and locate the result with the truncated metadata.

  3. Check to see if the metadata is truncated in the data model - if it is complete when viewing the data model but truncated in the the search results that you viewed then check the search result template as this will be truncating the metadata when it’s displayed (e.g. with a <@s.Cut> or similar).

  4. If the metadata is truncated in the data model as well then the next thing to try is to increase the size of the metadata buffer. The metadata buffer size usually only needs to be altered when you have a large amount of metadata for a collection. Add the following to the URL for the data model results &MBL=10000 and press enter. Recheck the metadata field to see if it is still truncated. If the metadata field is no longer truncated then you wil need to increase the size of the metadata buffer in the query processor options for the search (either in padre_opts.cfg for the profile, or in collection.cfg to set it at the collection level. Experiment with smaller values of MBL to find an appropriate value to set as large values will increase the memory required to run the search). Once an appropriate value is determined set the buffer size using the -MBL query processor option.

  5. If the metadata is still truncated even with a large metadata buffer then it is most likely that the truncation has occurred at index time (or it is truncated in the source data). However before you check this inspect the collection’s post-process and post-datafetch hook scripts to ensure that these are not responsible for truncating the metadata.

  6. If the hook scripts are not responsible try increasing the metadata field length indexer option (-mdsfml) on the collection that contains the result. Note that if you are searching a meta collection then you will need to change this setting on the component collection that includes the search result. After you change the -mdsfml value you will need to reindex the live view of the collection. The Step-Index.log will indicate if metadata has been truncated.

  7. After the re-indexing is complete recheck the searches - both at the data model level, then what is returned to the user. The metadata field should now be complete (or if it is still truncated then re-run through the steps above trying larger values of the settings).

4.4. Curator rule / best bets is not working

Most issues with curator will be a result of:

  • curator rules that have not been published

  • trigger terms not matching the query terms

  • promote/remove URL actions operating on a URL that isn’t in the result set

  • templates that are not configured to display the curator exhibits/best bets

4.4.1. Check the triggers

If you find a curator rule is not triggering as expected then carefully check the trigger configuration. Things to watch out for include:

  • Be careful to note the match type as this affects how the keywords are considered for the trigger match

  • Define standard triggers in lowercase (the triggers are currently case sensitive by default)

  • For regular expression based triggers carefully check the pattern. Ensure that the pattern is configured to be case insensitive (prefixed with (?i))

Tutorial: Promote URL or remove URL action is not working

Before you start: check to ensure that the query isn’t processed using term at a time (TAAT) mode. Check the query_processor_options that are set in collection.cfg of the affected collection or in padre_opts.cfg of the affected profile for any of the following:

  • -daat=0

  • -service_volume=low

If either of these are set then the URL promotion and removal functionality will not work.

If these are not set then:

  1. Run a search using search.json on the collection that has the curator rule defined for keywords that match the trigger.

  2. Check the question element of the data model and look for the following elements:

    • question.additionalParameters.promote_urls: this will be defined and set to the URL as defined in a promote URL action if the trigger matches.

    • question.additionalParameters.remove_urls: this will be defined and set to the URL as defined in a remove URL action if the trigger matches.

  3. If these are set as expected then the curator rule is functioning correctly.

If you don’t see the URL promoted or removed as expected then closely inspect the URL as this must exactly match the URL that is in the search result set.

Common reasons for URLs not being promoted or removed:

  • The URL doesn’t match the URL in the result set (this includes http/https differences in the URL) so it can’t be promoted or removed).

  • The URL is not returned in the results when the query is run (so it can’t be promoted).

Tutorial: Advert or simple message is not displaying
  1. Run a search using search.json on the collection that has the curator rule defined for keywords that match the trigger.

  2. Check the response element of the data model and look for the following elements:

    • response.curator.exhibits: this will be populated with any curator exhibits that are triggered by the query.

  3. If this is set as expected then the curator rule is functioning correctly.

If you don’t see the exhibit displayed in the search results then you need to check the search results template to ensure it is configured to display curator exhibits.

Tutorial: Best bet is not displaying
  1. Run a search using search.json on the collection that has the best bet defined for keywords that match the trigger.

  2. Check the response element of the data model and look for the following elements:

    • response.curator.exhibits: this will be populated with any curator exhibits that are triggered by the query. (A best bet is a curator exhibit that has a category of BEST_BETS)

  3. If this is set as expected then the curator rule is functioning correctly.

If you don’t see the best bet displayed in the search results then you need to check the search results template to ensure it is configured to display best bets.

Don’t be confused by the response.resultPacket.bestBets element which is used by the deprecated (v14.2 and earlier) best bets feature.

4.5. Troubleshooting sorting

4.5.1. Sorting all the results

By default the search results are grouped into tiers which are then sorted. This means that sorting of the results may seem odd becase you choose to sort by title or date but the set of results returned don’t appear to reflect that because all the fully matching results are returned (sorted) then the results matching all by one constraint and so on.

To sort all the results requires the -sortall=true to be set as a query processor option.

4.5.2. Sorting by metadata field

  • Check the data model and confirm that the metadata is containing the values you expect

  • Check that sort isn’t overridden in a pre-process or pre-datafetch hook script

  • Check that the field you are sorting by is mapped in your metadata mappings, and that the type is appropriate (e.g. you haven’t set up a number type field that contains a text string). If it’s in a meta collection ensure that the type is the same in different components.

  • Check that the result sorting is not breaking when the result tier changes (result.tier data model element). If the sort is only working for the first tier then you may need to set the -sortall=true query processor option.

4.5.3. How can I sort by more than one field?

It is not possible to directly sort the search results by more than one key. If there is a need to do this then it can be achieved by creating a sort key at filter time (but this won’t be suitable for all use cases as the sort key(s) need to be precomputed and indexed as an extra metadata field.) See: Funnelback knowledgebase - Sorting by multiple keys

4.5.4. Random (shuffle) sort

When sorting randomly it is important to supply an rseed parameter to ensure that the random sort is applied per search session. If rseed is omitted then every search will be completely random meaning that pagination of the search results becomes meaningless.

4.6. Error mesage displayed on search results screen

4.6.1. An error occurred in the search system

This indicates something went wrong with the padre query. This could be an issue with the indexes (such as a corrupted or missing index) or it could be a result of padre returning malformed XML to the modern UI which then cannot be parsed into the data model.

Inspect the source code of the search results page - there may be additional information such as a Java stack dump inside HTML comments. Also inspect the modern ui logs for the collection.

It is also worth checking the search.json response which may include a more detailed stack dump.

The raw padre XML can be viewed by accessing the padre-sw.cgi endpoint. To access this edit your URL to replace search.html or search.json with padre-sw.cgi.

4.7. Troubleshooting auto-completion

4.7.1. Auto-completion not working

  • Open up the browser debugger and view the network panel. Start typing into the search box and watch the network panel for any requests to suggest.json. If you can see requests being sent then view the response from one of these. The response should contain a JSON packet with any suggestions. If the JSON is empty then there are either no suggestions to return, or autocompletion may not have been generated. If you are using concierge auto-completion there will be a suggest.json request for each column of the auto-completion. Also check the parameters passed to suggest.json and confirm that the correct collection/profile parameters are passed (these are based on the Javascript coniguration and won’t necessarily match the collection and profile used for the search query).

  • Open up the browser debugger and view the Javascript console. Refresh the page and check to see if there are any errors occurring that might stop the Javascript from processing.

  • Check the HTML source for the page and ensure that all the auto-completion resources and dependencies are being downloaded and that there are not multiple versions of any of the dependencies (e.g. JQuery) being imported. If there are multiple versions of JQuery the order in which the files are loaded is important. If a version of JQuery is loaded after auto-completion was initialised it overwrites the JQuery object and auto-completion won’t be set up correctly. When this happens there will be errors logged to the Javascript console relating to missing functions.

4.7.2. Auto-completion suggestions not displaying

If suggest.json is returning suggestions but they are not displaying, or are displaying [Object object] then the problem is likely to be in the Javascript configuration for the auto-completion template or in the styling used within the template.

Also make sure that you check to ensure that all the fields you require are returned in the auto-completion JSON. It is possible that fields were missing in the CSV used to generate the auto-completion.

4.7.3. Removing terms from auto-completion

Non-CSV based auto-completion suggestions are based on words found within the spelling index. Terms can be added or removed from the set of suggestions by editing the spelling black and white lists for the collection that is used to generate the auto-completion.

4.7.4. Auto-completion not generated during update

Check that auto-completion=disabled is not set in the collection.cfg of the collection that is supplying the auto-completion to suggest.json.

Also check the Step-BuildAutocompletion logs for the collection and look for errors and also for the counts of how many suggestions were generated. If the suggestions are CSV based there could be an issue with the CSV file format. Also check the padre_opts.cfg on the profile to ensure that the index isn’t scoped in a way that it no longer contains any results. This will cause an empty suggestions file to be generated.

4.8. Troubleshooting hook scripts

When writing hook scripts it is critical to understand the request/response pipeline and ensure that any changes to the data model are made at an appropriate time. See: Funnelback knowledgebase - general advice for working with the data model for more information on what happens between each phase.

If you modify an element at the wrong point of the pipeline then the change may not have any effect, or may be overwritten by other operations.

Errors that occur within hook scripts will be logged to the collection’s modern ui logs. If an error occurs in a hook script the processing of that hook script is aborted but the query will still run and return results as though the hook script did not exist.

4.9. Troubleshooting saved searches, search and click history (search sessions and history)

The sessions and history functionality within Funnelback relies on HTTP cookies - when sessions are enabled on a collection a session ID is stored within this cookie and transmitted with all search queries with the saved searches and history information being stored witin a database on the Funnelback server.

If you are experiencing issues with the sessions and history functionality:

  • check if the same browser is being used (because the session ID is stored within a browser cookie) and that the user is not using incognito mode.

  • view the browser web developer tool and check both the console and network tabs for errors. The console will log any Javascript errors that occur while the network tab will display any errors returned by the API calls. If the API is returning an error there may be an issue with the session database. If there are Javascript errors in the page this could cause the sessions code to not run.

4.10. Template issues

4.10.1. Debugging template errors

Funnelback logs user interface issues to the modernui log. These are accessible via the log viewer from the collection log files for the collection. The relevant log files are:

  • modernui.Public.log contains errors for seasrches made against the public HTTP and HTTPS ports.

  • modernui.Admin.log contains errors for searches made against the administration HTTPS port.

In addition there is also a system wide modern ui logs located with the web logs (available from the system menu in the administration interface).

On Windows these log files are located with the web logs (same folder as the system wide modern ui logs) and are named modernui.<COLLECTION-ID>.Public.log and modernui.<COLLECTION-ID>.Admin.log.

Funnelback also includes some settings that allow this behaviour to be modified so that some errors can be returned as comments within the (templated) web page returned by Funnelback.

The following configuration options can be set on your collection’s configuration while you are working on template changes:

ui.modern.freemarker.display_errors=true
ui.modern.freemarker.error_format=string

The error format can also be set to return as HTML or JSON comments.

The string error format will return the error as raw text, which results in the error message being rendered (albeit unformatted) when you view the page - this is the recommended option to choose while working on a html template as the errors are not hidden from view.

Setting these causes the template execution to continue but return an error message within the code.

Syntax errors in the template resulting in a 500 error are not returned to the user interface - the browser will return an error page. More details about the error will be logged in the modern UI logs. Be sure to check both the collection’s modern UI log and also the system wide modern UI log.

4.10.2. Data model log object

The data model includes a log object that can be accessed within the Freemarker template allowing custom debug messages to be printed to the modern UI logs.

The log object is accessed as a Freemarker variable and contains methods for different log levels. The parameter passed to the object must be a string.

The default log level used by the modern UI is INFO. It is not possible to change this on a per-collection level and changing the level globally requires administrative access.

This means when debugging your templates you will probably need to print to the INFO level - but don’t forget to remove your logging once you have completed your testing otherwise a lot of debugging information will be written to the log files for every single query.

e.g.

<#-- print the query out at the INFO log level -->
${Log.info("The query is: "+question.query)}
<#-- print the detected origin out at the DEBUG log level -->
${Log.debug("Geospatial searches are relative to:  "+question.additionalParameters["origin"]?join(","))}
Messages written via the log object will only be logged when accessing the search.html endpoint.

4.10.3. Template changes saved but not displaying

If the search template has been saved but changes are not being reflected when a search is run:

  • Check that the template has been published

  • Check the URL and verify that the correct collection/profile/form parameters are defined.

  • Try adding a simple change (such as prefixing a title with some debug print) and publishing this to ensure that a simple change is displayed

  • Check the modern UI logs for any errors generated when accessing the template

  • Check the underlying data model to see that any required data is present

  • If the template change relates to a more specific piece of functionality (such as IncludeURL or curator) then check that the feature is appropriately configured and not generating feature specific errors.

4.11. Extra searches

If extra search results are not showing:

  • Check the data model to see if the extra search is returned. Each extra search that runs should create an element beneath the extraSearches node at the top level of the data model.

  • Check the template to ensure that the Freemarker required to display the extra search has been added.

  • Check collection.cfg to ensure that the extra search is configured to run for the collection.

4.12. Troubleshooting IncludeURL issues

For problems with IncludeURL start by checking the generated source code for the search results to see if any errors are printed in html comments when the include is loaded. If an error occurs while attempting to fetch the include then the template will return whatever is in the IncludeURL cache for the particular URL (even if the cache has expired).

4.12.1. Error fetching include

Errors while attempting to fetch the IncludeURL could be caused by:

  • The included URL changing (or being redirected elsewhere)

  • Issues relating to HTTPS and the server’s SSL certificate

4.12.2. Updated include is not reflected in the search results template

If an include has been updated but is not displaying it could be that the cache has not yet expired. If you need to refresh the include:

  • Try changing the expiry value in the template and publish this (don’t forget to set it back to something sensible after you’ve got the include to update)

  • If this doesn’t work you could try adding a fake parameter to the URL (e.g. if you were including http://example.com/includes/header.html try updating the include to http://example.com/includes/header.html?1).

  • Finally, restarting the Jetty web service will completely clear the IncludeURL cache.

4.12.3. Include is not displaying correctly

If the include is loading but not displaying correctly:

  • Check the browser web developer tools and inspect both the browser console and network tab.

    • Look out for mixed content or CORS errors that may prevent linked resources such as JS and web fonts from displaying.

    • If there are lots off 404/not found errors to linked resources check that the include URL call is setting the convertrelative parameter and also inspect the code of what is being included to see if there are relative links that are not being converted correctly. Also check for <base href> tags that might break the converted links.

  • If the include is displaying the whole page of the include within the search results template check the start and end parameters on the IncludeUrl macro to ensure that they are configured correctly and match the source code returned in the include.

In some cases it is not possible to use IncludeURL on the remote page because it can’t be converted correctly to a suitable include. If this happens it may be necessary to create a local file to include, or to hardcode the included code within the search template.

4.13. Knowledge graph

4.13.1. Widget displays an error message

  • Ensure that source collections have FUNkgNodeLabel and FUNkgNodeNames metadata classes defined and that these are mapped correctly and the indexer has found items that have these fields defined.

  • Ensure that the KG is updated every time the source data is updated. If the search index updates to include additional items that have FUNkgNodeLabel and FUNkgNodeNames metadata fields and the knowledge graph is updated the widget may try to access nodes that have not been added to the graph database as the search index is used to return the initial node and also for the widget search results screen.

  • Try running a null query search within the widget and see if any nodes are returned.

  • Check the knowledge graph update log for the correct service/view for errors. Items are only included in the knowledge graph if they have exactly one FUNkgNodeLabel value and one or more FUNkgNodeNames values. If an item is missing from the graph check the log for that specific URL and also check to see if any errors (such as Neo4J running out of memory) occurred.

  • Check the browser debugger and inspect any API calls listed in the network tab of the debugger for errors, or to see what was returned by the API.

4.13.2. Widget is not displaying properties

  • Ensure that the KG is updated every time the source data is updated. This includes updating it after any metdata mapping changes are made.

  • Check the browser debugger and inspect any API calls listed in the network tab of the debugger for errors, or to see what was returned by the API.

  • Ensure that any properties that you wish to display in the widget are mapped to metadata classes in the source collection.

4.13.3. Widget is missing relationships

  • Check that any custom relationships have been configured and published for the widget.

  • Ensure that the knowledge graph is updated after any changes to the relationship definitions.

  • Check the knowledge graph update logs for the relationship counts found.

  • Check the search results and inspect the values of the FUNkgNodeNames field. The individual values (see the result listMetadata["FUNkgNodeNames"] element) detected for FUNkgNodeNames are used for determining relationships. The metadata field must be correctly split into individual values otherwise relationships won’t be detected. For mentions relationships any of these values must appear somewhere in the content as a substring match for a relationship to be created. For custom relationships any of these values must exactly match the specified metadata field for the custom relaionship for a relationship to be created.

4.13.4. Knowledge graph updates

Tutorial: Investigate a failed knowledge graph update
  1. View the knowledge graph update log from the collection-level update logs. This will provide high level information on the status on an update.

  2. Additional information may be logged in the the global neo4j debug log which tracks access to the neo4j database (available by inspecting $SEARCH_HOME/logs/neo4j/debug.log)

5. Collection updates

5.1. Investigate a failed collection update

For this scenario you are alerted to an issue due to a collection update failure being recorded. This could be either by observing a collection failure status when logged in to the administration interface, or by receiving a failed update email (if the admin email for the collection is configured).

Tutorial: Investigate a failed collection update
  1. View the update log from the collection-level update logs. This will provide high level information on the status on an update.

  2. Check the update logs for the collection. These are contained within the live and offline log folders. The folder that you inspect will depend on the state of the collection and what you are trying to debug.

    Inspect the offline logs folder if:

    • The update failed and you are troubleshooting the cause of the failure.

    • The collection is currently updating and you wish to view logs relating to the currently live search indexes.

    • The collection is not currently updating and you wish to view the logs relating to the previously successful update.

      Inspect the live logs folder if:

    • The collection is not currently updating and you wish to view the logs relating to the currently live search indexes.

  3. Inspect the update.log of the relevant view (live or offline). This log provides the top level logging for each of the steps within the update pipeline. It should provide information on where within the update pipeline the error occurred.

  4. Use the information on where the error occurred to inspect the more detailed logs. For example, if the gather process failed then view the gather logs that correspond to the collection type. If the error occurred during indexing then inspect the indexing logs.

5.2. Debugging form-based authentication

Funnelback can be configured to perform form-based authentication when connecting to a website by configuring a form interaction. Form-based authentication covers situations where a user has to fill in a HTML form to log in to a website. The underlying authentication mechanism can vary and may include types of authentication such as SAML.

Form-based authentication can be quite tricky to configure and is notoriously difficult to debug because there is often a Javascript layer built in to the form that the implementer will need to account for when configuring the form interactions and various redirects that may occur during the authentication.

The best way to troubleshoot form-based authentication is to use Funnelback’s debug API, which will request a specified URL and show all the request and response headers, returned data and redirects that occur when the request is made.

This tutorial assumes that form interaction has been setup for the collection.

  1. Log in to the administration interface and select View API UI from the system menu.

  2. The Admin API calls will be listed. Scroll down then expand the debug section.

  3. The http-request call allows you to debug the set of requests that occur when Funnelback requests a URL listed inside the collection configuration using either a crawler.form_interaction.in_crawl.<GROUP-ID>.url_pattern or crawler.form_interaction.pre_crawl.<GROUP-ID>.url setting.

  4. Click on the GET /crawler/v1/debug/collections/{collection}/http-request heading to expand the API test form. Fill in the form with the collection id (this will cause the API to load the form interaction configuration from the specified collection), the url to test (this should be the URL that you use inside the form_interaction.cfg) and then select the level to debug. The different level values will provide varying degrees of information. BODY (the default value) provides the most information - including request and response headers as well as the response data. It is often best to start with BASIC and gradually increase the level as this will give insight into redirects that may be occurring and the values of any cookies returned in the HTTP headers.

The information that is returned by the debug call will provide an insight into what will need to be changed.

Common problems with form interaction include:

  • Javascript that builds up a request. An implementer will need to figure out what the request is that is built and put the resulting request URL into the form interaction configuration.

  • Circular redirects that may occur when requesting the URL. It may be that some of the redirects also rely on Javascript which again must be worked around. It may also be that the URL that was set for form interaction needs to be changed to another URL that results as part of some of the Javascript processing.

5.3. Crawl runs fine for a while then everything times out

This could be caused by a number of things including:

  • The internet connection goes down during the crawl

  • There’s some network device that detects the crawl as denial of service attack (so it starts silently dropping the crawler’s requests)

  • You’ve reached some crawl limit enforced at the customer’s end (e.g. 20 requests allowed per minute)

  • A session cookie required for the crawl has expired

  • The crawler has run out of memory

5.4. Workflow

Funnelback provides a workflow mechanism that allows a series of commands to be executed before and after each update phase that occurs during an update.

This provides a huge amount of flexibility and allows the update to perform a lot of additional tasks as part of the update.

These commands can be used to pull in external content, connect to external APIs, to manipulate the content or index, or generate output.

Effective use of workflow commands requires a good understanding of the Funnelback update cycle and what happens at each stage of the update. It also requires an understanding of how Funnelback is structured as many of the workflow commands operate directly on the file system.

Each phase in the update has a corresponding pre phase and post phase command.

The exact commands that are available will depend on the type of collection. The most commonly used workflow commands are:

  1. Pre gather: This command runs immediately before the gather process begins. Example uses:

    • Pre-authenticate and download a session cookie to pass to the web crawler.

    • Download an externally generated seed list

  2. Pre index: This command runs immediately before indexing starts. Example uses:

    • Download and validate external metadata from an external source

    • Connect to an external REST API to fetch XML content

    • Download and convert a CSV file to XML for a local collection

  3. Post index: This command runs immediately after indexing completes. Example uses:

    • Generate a structured auto-completion CSV file from metadata within the search index and apply this back to the index

  4. Post swap: This command runs immediately after the indexes are swapped. Example uses:

    • Publish index files to a multi-server environment

  5. Post update: this command runs immediately after the update is completed. Example uses:

    • Perform post update cleanup operations such as deleting any temporary files generated by earlier workflow commands

Workflow is not supported when using push collections.

5.4.1. Specifying workflow commands

Workflow commands are specified in the collection’s main configuration file (collection.cfg).

Each workflow command has a corresponding collection.cfg option. The commands take the form pre/post_phase_command=<command>. For example: pre_gather_command; post_index_command; post_update_command.

Each of these collection.cfg options takes a single value, which is the command to run.

The workflow command can be 1 or more system commands (commands that can be executed on a command line such as bash (Linux) or Windows PowerShell or the cmd prompt (Windows). These commands operate on the underlying filesystem and will run as the user that runs the Funnelback service - usually the search user (Linux) or a service account (Windows).

Care needs to be exercised when using workflow commands as it is possible to execute destructive code.
When running commands that interact with Funnelback indexes (either via an index stem, or by calling the search via http) ensure that the relevant view is being queried. Use $CURRENT_VIEW when the view needs to match the context of the update - this will be most of the time unless a command is running that always must operate on the live or offline view. This should be used unless you need to hardcode the view to always be live or offline.

The command that is specified in collection.cfg can contain the following collection.cfg variables, which should generally be used to assist with collection maintenance:

  1. $SEARCH_HOME: this expands to the home folder for the installation. (e.g. /opt/funnelback or d:\funnelback)

  2. $COLLECTION_NAME: this expands to the collection’s ID.

  3. $CURRENT_VIEW: this expands to the current view (live or offline) depending on which phase is currently running and the type of update. This is useful particularly for commands that operate on the gather and index phases as the $CURRENT_VIEW can change depending on the type of update that is running. If you run a re-index of the live view $CURRENT_VIEW is set to live for all the phases. For other update types $CURRENT_VIEW is set to offline for all workflow commands before a swap takes place. The view being queried on a search query can be specified by setting the view CGI parameter (e.g. &view=offline).

  4. $GROOVY_COMMAND: this expands to the binary path for running groovy on the command line.

If multiple commands are required for a single workflow step then it is best practice to create a shell script (Linux) or batch or PowerShell file (Windows) that scripts all the commands, taking the relevant collection.cfg variables as parameters so that you can make use of the values from within your script.

5.5. Debugging workflow commands

There are many things that can cause problems with workflow commands. Having a good understanding of Funnelback’s update phases is key to writing effective workflow.

Common problems include:

  • Failing to use collection.cfg variables, particularly $CURRENT_VIEW.

    It is very important to use the $CURRENT_VIEW variable for commands that operate on the live or the offline view depending on the type of update being run. $CURRENT_VIEW should always be used for the view unless the command always requires operating on a specific view. e.g. if you are running a command that always needs to modify the live index then live should be hardcoded into the command instead of using $CURRENT_VIEW.

  • Running the workflow command in the wrong phase. Understanding what occurs during each update phase is critical to ensure that the commands operate correctly.

6. Debugging filtering

There are a number of things to check if filtering appears to not be working as expected:

  • Start by checking the filter logs - this will vary from collection to collection but typically it’s the crawler.central.log for web collections and the gather’s’log for other collection types. Inspect the file for errors and search within the log for the specific URL.

    If there is an error in the log it probably means that there is an error within the filter code. The solution to this is to fix the code (and this obviously depends on the error) Note: it’s not uncommon to see errors generated in the filter logs for accessibility auditor and also for Tika conversion (from converting binary files to text).

  • Other causes of filtering issues could be:

    • The filter does not run on the specific URL because it fails the programmed test for the filter (e.g. wrong mime type)

    • the filter is missing from filter chain

    • the filter is included in the filter chain but the filters are not specified in the correct order. e.g. a filter that analyses document text won’t work if it runs before a document is converted from binary to text.

If a filter error occurs it may result in the document being skipped and ultimately be missing from the index.

7. Troubleshooting analytics

7.1. Empty search query reports

Empty search query reports could be caused by a number of things:

  • Ensure that analytics is scheduled to update. Select schedule automatic updates from the system menu in the administration interface and ensure that query reports update is scheduled to run.

  • Ensure that you a viewing the search query report for the collection/profile that receives the search queries (check the URL that is accesed when you run a query and look at the collection and profile parameters - this will indicate where the analytics are logged).

  • Ensure that queries are not being logged to a preview profile - only live profile analytics are displayed in the marketing dashboard.

  • Check that scheduled analytics is not disabled for the collection (analytics.scheduled_database_update collection.cfg option).

  • Check the update_reports.log of the collection that is logging the queries for any errors. Errors that may cause the analytics update to failure include:

    • Running out of memory (this can be increased using analytics.max_heap_size collection.cfg option).

    • Corrupted log file lines (this may require some manual cleaning of log files).

  • Try manually updating analytics for the collection then check the update_reports.log.

  • Check that the reporting blacklist isn’t excluding all of your queries.

7.2. Empty click reports

Empty click reports are usually a result of click links being used in the search results template. When a search result is clicked on it needs to use the click URL (similar to http://<FUNNELBACK-SERVER>/s/redirect with a bunch of parameters which logs the query before redirecting the user to the target URL.

A failed analytics update (see the information for empty search query reports above) could also result in empty click reports.

Low click volumes could simply be a result of not many results actually being clicked on. A highly effective concierge auto-complete can mean that users rarely need to click on results.

7.3. Empty trend alert reports

Empty trend alert reports could be caused by:

  • This could be an issue with trend alerts updating (check the logs)

  • It is also possible that no spikes have been detected in the logged queries as there are a number of factors which combine to generate a trend alert.

  • Check that there is a cron job or scheduled task for update outliers that should run once per hour. Note: this is not shown on the schedule automatic updates screen in the adminstration interface as it is an internal task.

7.4. All queries coming from a single or small number of IP addresses.

This is likely to be a poor integration where all traffic is coming via another system (e.g. the search is run via a CMS which makes the request to Funnelback on behalf of the user). For this type of integration it is important that the system sends an X-Forwarded-For header along with the search query so that the correct remote address is logged by analytics.

7.5. Blacklisted IP addresses and queries showing in analytics reports

Changes to the reporting blacklist require search analytics to be rebuilt for the changes to be retrospectively applied.

7.6. Search query reports display weird queries

Analytics search query reports will sometimes display stange looking queries. The queries displayed in the report are queries specified using Funnelback’s query language - so advanced queries (such as those submitted via advanced search forms or via some integrations) will often contain query language operators. These advanced parameters should ideally be passed as system query parameters or directed to a different profile (so that the analytics only reports on what the user actually typed into a search box).

7.7. How are queries counted?

Queries are counted by session - this means that if you search access multiple pages of the search results this will only count as a single query in the search analytics.

If the query is modified at all (including applying facets or other constraints) then this will count as a new query, though it may look like the same query in the analytics if the query term is the same and other constraints are set as system parameters.

7.8. Search analytics updates

  • The update_reports.log for the collection captures messages generated during an analytics update.

8. Managing system services

Funnelback utilises a number of services that are installed when Funnelback is installed onto a server.

8.1. Starting/stopping/restarting services

Starting and stopping of services in Funnelback requires administrator access to the server.

Funnelback uses four services:

  • funnelback-daemon

  • funnelback-jetty-webserver

  • funnelback-graph

  • funnelback-redis

All services need to be running for correct functioning of Funnelback.

The order in which services are started and stopped is not important.

Under Linux:

  • Services are installed with other system services.

  • Services are started (as the root user) using

    /bin/systemctl start <servicename>
  • Services are stopped (as the root user) using

    /bin/systemctl stop <servicename>
  • Services are restarted (as the root user) using

    /bin/systemctl restart <servicename>

Under Windows:

  • Services are installed with other system services.

  • Services are controlled (as administrator) using the Microsoft Windows Services Console (services.msc) in the same manner as other Windows services.

9. Managing users

Access to the Funnelback administration interface is controlled via administration interface user accounts.

The administration interface user accounts are Funnelback-specific accounts (LDAP/domain accounts are not currently supported).

Funnelback administration users have access to various administration functions depending on the set of permissions applied to the user. Sets of permissions can be grouped into roles and applied to a user. This allows default permission sets to be defined and reused across many users.

Funnelback ships with five default permission sets:

  • default-administrator: full access to all available permissions.

  • default-analytics: read only access to analytics.

  • default-marketing: provides access to functionality available via the marketing dashboard.

  • default-support: provides access to functionality suitable for users in a support role.

  • default-implementer: provides access to functionality suitable for most implementers.

Roles can also be used to define sets of resources (such as collections, profiles and licences) that may be shared by multiple users.

When creating a user it is best practice to separate the resources and permissions into different roles and combine these. This allows you to create a role that groups the collections, profiles and licences available for a set of users and combine this with one or more roles that define the permission sets that apply for the user.

10. Maintenance tasks

10.1. Rename a collection / profile / service

The renaming of collections/profiles/services is not recommended and risks losing analytics and WCAG data. Consider using generic naming when creating collections (e.g. For the Department of Health consider using an ID of health instead of doh for the collection ID). Using a generic name future proofs any need to change names.

If a new name is required it is better to create a new collection and migrate the configuration and analytics from the old collection to the new collection.

10.2. Rebuild the analytics database

  1. Log in to the administration interface and switch to the affected collection

  2. Edit the collection.cfg by selecting browse collection configuration files from the administer tab, then click on the collection.cfg link in the file manager display.

  3. Disable incremental analytics updates by adding the following line then save the file:

    analytics.reports.disable_incremental_reporting=true
  4. Select update analytics now from the analytics tab.

  5. The analytics reports for the collection will contain data once the database has finished rebuilding. This can take a while for large collections.

  6. During the rebuild messages are logged to

    $SEARCH_HOME/data/COLLECTION_NAME/log/update_reports.log
  7. Once the analytics build is finished edit the collection.cfg and remove the configuration line added above, or set

    analytics.reports.disable_incremental_reporting=false

10.3. Migrating analytics from one collection to another

  1. Move the query and click log files from $SEARCH_HOME/data/<SOURCE-COLLECTION>/archive to $SEARCH_HOME/data/<DESTINATION-COLLECTION>/archive

  2. Log into the administration interface and switch to <DESTINATION-COLLECTION>

  3. Rebuild the analytics database for <DESTINATION-COLLECTION>

10.4. Moving a collection from one Funnelback server to another

10.4.1. Non-push collections

The following folders hold configuration and data for a Funnelback collection:

$SEARCH_HOME/conf/<COLLECTION-ID>
$SEARCH_HOME/data/<COLLECTION-ID>
$SEARCH_HOME/admin/reports/<COLLECTION-ID>
$SEARCH_HOME/admin/data_report/<COLLECTION-ID>

Of these only the $SEARCH_HOME/conf/<COLLECTION-ID>/folder is absolutely required, along with $SEARCH_HOME/data/<COLLECTION-ID>/archive which contains all the historical query data (if analytics should be preserved on the new server).

  1. Transfer the collection’s configuration folder from the old server to the new server. It’s normally easiest to compress the collections conf folder on the source machine, transfer the file then decompress it inside the conf folder on the target machine. If you have $SEARCH_HOME/conf/<COLLECTION-ID> on the source machine you should end up with $SEARCH_HOME/conf/<COLLECTION-ID> on the target machine.

  2. If you are moving from Linux to Windows of vice versa edit the collection.cfg file and ensure any expanded references to the` $SEARCH_HOME` are converted to the equivalent for the target OS. (e.g. If going from Windows to Linux you might convert any references to d:\Funnelback to /opt/Funnelback, or better still replace these with $SEARCH_HOME or %SEARCH_HOME%). Any workflow will need to be similarly updated and any bash/shell scripts converted to equivalents that are suitable for the OS.

  3. Create the collection’s folder structure by running:

    $SEARCH_HOME/bin/create_collection.pl $SEARCH_HOME/conf/<COLLECTION-ID>/collection.cfg

    Note: you will need to assign a licence after you’ve run the create collection command or call the command with one of the licence switches (Run the command without arguments to get a list of switches).

  4. At this point you should be able to update the collection. However, you may wish to migrate the collection’s data and analytics across as well.

  5. If you wish to bring across all of the existing data transfer the collection’s complete data folder from the source machine and unpack it on the target machine ensuring paths are preserved.

  6. If you wish to only transfer the collection’s historical query data then just transfer the <DESTINATION-COLLECTION> folder

  7. If you wish to transfer existing analytics data across then transfer the admin folders listed above.

  8. If Funnelback on the source and target machines are both the same version (and same OS) you should be able to start querying the collection (assuming the complete data folder was transferred).

  9. If Funnelback on the source and target machines are different versions or you didn’t transfer all of the data across then you will need to update the collection. The type of update you need to run will depend on what you’ve copied. If you copied the complete data folders across then you should be able to rebuild the index from the data you’ve copied - run a reindex of the live view. Upon completion you should have a working index. If you didn’t copy the complete data folder across (and just copied archive logs, or nothing at all) then you’ll need to run a normal update of the collection - just start an update from the administration interface.

  10. If you just copied the archive logs across and wish to rebuild the analytics database start an analytics update from the analytics tab for the collection.

10.4.2. Push collections

If you need to move a push collection you should ensure that the source and target machines are running the same Funnelback version.

  1. If possible block access to the push collections so no more add/remove operations can be run against the push collection (e.g. define a firewall rule to prevent access).

  2. Snapshot the push collection by calling the snapshot push API call.

PUT /push-api/v1/collections/<COLLECTION-ID>/snapshot/<SNAPSHOT-NAME>
  1. On the target server create a new push collection with exactly the same collection ID.

  2. Ensure the new collection is stopped by checking the state of the collection

    GET /push-api/v1/collections/<COLLECTION-ID>/state
  3. If the collection is running then stop it

    POST /push-api/v1/collections/<COLLECTION-ID>/state/stop
  4. Copy the configuration folder for the push collection ($SEARCH_HOME/conf/<COLLECTION-ID>/) from the old server to the new server, overwriting all files.

  5. Copy the snapshot from the old server (from $SEARCH_HOME/data/<COLLECTION-ID>/snapshot/<SNAPSHOT-NAME> to the live folder for the collection on the new server $SEARCH_HOME/data/<COLLECTION-ID>/live/).

  6. Rebuild the analytics.

10.5. Starting, stopping or restarting a collection update

The administration interface allows an administrator to start, stop and restart collection updates.

Note: this does not apply to push collections or meta collections.

  • Meta collections are never updated

  • Push collection updates are controlled via the push collection API.

10.5.1. Starting an update

Collection updates are started either by an automatic schedule (see scheduling updates) or by manually starting an update from the administration interface. The types of available update modes will depend on the type of collection.

10.5.2. Stopping an update

A running collection update can be stopped by switching to the collection in the administration interface and clicking the stop update button from the update tab, or by clicking the stop icon in the actions column under the collection overview listing.

10.5.3. Restarting a stopped update

An update for a collection may be restarted by selecting advanced update from the update tab for a collection, then choosing the phase where to restart the update.

you can restart an update from a different point to where you stopped it. For example:

  • Restart an update that was halted at crawl time at the indexing phase to just build an index on whatever was crawled so far.

  • Restart an update that completed but timed out (after swapping the views) at the crawl phase to continue crawling from where it got up to when the crawl timeout was reached.

10.5.4. Killing and resetting a stalled update

If an update becomes stuck it can usually be stopped via the administration interface (se instructions above for stopping an update). If this fails to stop the update and it remains in a state where the update is stopping then the following steps can be used to reset the update:

  1. Log in to the server as the search user and check the running processes to see if there is an update running for the collection. e.g. ps -fewjaH | grep <COLLECTION-ID>.

  2. If there is an update running (there will be an update.pl process running with the collection ID as part of the arguments) you may need to kill the process.

  3. If there isn’t a process running for the update or you have killed it then you can safely clear the locks on the collection. This can be done in two ways:

    1. From the administration interface navigate to the update tab for the collection that is stuck and click on the clear logs link that appears as part of the stop update button

    2. or run the following command on the command line /opt/funnelback/bin/mediator.pl ClearLocks collection=<COLLECTION-ID>

  4. Refresh the administration interface and the lock should be cleared.

  5. Optionally start or restart the update.