Introduction

This course is aimed at frontend and backend developers and takes you through collection creation and advanced configuration of Funnelback implementations.

  • A summary: A brief overview of what you will accomplish and learn throughout the exercise.

  • Exercise requirements: A list of requirements, such as files, that are needed to complete the exercise.

  • Detailed step-by-step instructions: detailed step-by-step instructions to guide you through completing the exercise.

  • Some extended exercises are also provided. These exercises can be attempted if the standard exercises are completed early, or as some review exercises that can be attempted in your own time.

Special tool tips or hints will appear throughout the exercises as well, which will provide extra knowledge or tips for completing the exercise. They will look like:

This box will is used to provide links to further reading available on the current topic.
This box is used to provide tips or hints related to the current topic.
This box is used to provide important advice relating to the current topic. This includes advice to be aware of to avoid common errors.
This box contains advice specific to the Linux version of Funnelback.
This box contains advice specific to the Windows version of Funnelback.

What this workshop will cover:

  • Collections

  • Web collection update cycle

  • Checking and debugging updates

  • Meta collections

  • XML content

  • Geospatial and numeric metadata

  • Generalised scopes

  • Social media collections

  • Push collections

  • Search result manipulation

  • Query processing pipeline

  • Hook scripts

  • Removing items from the index

  • Alternate output formats

  • Extra searches

  • Configuring the content auditor

  • Configuring the accessibility auditor

Prerequisites to completing the course:

  • FUNL201 - Funnelback for implementers.

  • HTML, JavaScript and CSS familiarity.

What you will need before you begin:

  • Access to the Funnelback training VM for Funnelback that has been setup for this course.

  • Internet access.

1. Collections

A search collection in Funnelback is similar in concept to a collection that might be housed within a library or museum.

A collection generally contains a set of items that are related in some way. In a library this could be all of the non-fiction books, or all objects relating to a specific topic such as Albert Einstein. In Funnelback a collection is usually determined by the type of repository being crawled - e.g. a website or set of websites, a database or a TRIM or fileshare repository.

Each collection is updated and indexed separately and can also be searched separately.

When defining a collection there are a few things that will determine if a separate collection is required:

  • How the content is gathered. Different types of content must be in different collections because the method of fetching the content is different. For example: web crawler, database driver, fileshare copy.

  • How often the content needs to be updated. If you wish to update different parts of the same website on a different update cycle a separate collection may be required. For example: media releases on a website.

A collection also needs to have bounds placed upon it - determining what content should be included or excluded. (eg. crawl www.mysite.com but don’t include all PDF documents).

Funnelback has four broad types of collections:

  • Standard collections gather content and store it to disk following an update cycle that includes gather, filter, index, swap. This cycle is examined in detail below. Standard collections include a live and offline copy of indexes and data. Updated material appears in the index at the end of a successful update. Standard collection types include: web, database, custom, directory, filecopy, local, matrix.

  • Push collections are updated using an API that includes add and delete operations. Push collections handle the indexing of content in near real-time. Push collections are all of the push collection type.

  • Push gatherer collections are linked to a push collection and implement logic to gather and filter content. These collections interact with the push collection API. Push gather collection types include: trimpush.

  • Meta collections are a special collection type that allow the aggregation of standard and push collections. A search against a meta collection will search across the indexes of all collections that are members of the meta collection. All meta collections are of the meta collection type.

Most of the training will focus on standard collection types. Push and meta collections will be examined in more detail later in the training.

Exercise 1: Creating a web collection

In this exercise you will create a web collection and create a searchable index of a website.

A web collection is used to create a search of a website or set of websites. Web collections contain HTML, PDF and MS Office files that are gathered by crawling a website or set of websites.

  1. Open the administration interface by visiting https://training-admin.clients.funnelback.com/search/admin.

    You will be prompted for your admin interface username and password to log in. Enter admin as both the username and password.

  2. Click create collection on the menu bar at the top of the administration interface.

    exercise creating a web collection 01
  3. The first collection creations screen requests a number of generic properties that apply to all types of collections.

  4. Enter the following information: + The project group id is used to group together a set of collections. These collections are also grouped in the collection switcher in the administration interface. Set this to Training 202 so that all collections that are created as part of this training course are grouped together.

    • Project group ID: Training 202

  5. Define the collection identifier. This is used by Funnelback to identify the collection internally - it must be a unique identifier containing only letters, numbers and dash/underscore characters.

    • Collection ID: funnelback-website

  6. The collection type sets a number of parameters that will define how the collection will work. This includes what type of gatherer is used. Choose web from the drop down list.

  7. Choose the licence to associate with this collection. Any URLs stored in the index will be count towards the licence limit for the assigned licence. If there are multiple licences available you will be able to choose the relevant licence from the drop down menu. Leave the other settings as their default values then click the create collection button.

    exercise creating a web collection 02
  8. The configuration editor loads with the default web collection values defined.

  9. Enter the following into the start_url field:

    http://training-search.clients.funnelback.com/training/training-data/funnelback-website/www.funnelback.com/

    The start URL(s) field contains a list of URLs from which Funnelback’s web crawl will start. Each URL will be accessed by the web crawler and links extracted and fetched. This process continues until there are no more URLs in the list of known URLs (the crawl frontier) or a timeout is reached.

  10. Enter the folowing into the include_patterns field:

    /training/training-data/funnelback-website/www.funnelback.com

    The include_patterns field contains a list of patterns that are matched (as substrings) against each URL that is seen by the Funnelback during the crawl. If the URL matches any of these include patterns then the URL is added to the list of URLs to fetch.

  11. Leave the exclude content from field as the default value.

    The exclude content from field contains a list of patterns that are matched (as substrings) against each URL that is seen by the Funnelback during the crawl. If the URL matches any of these exclude patterns then it is skipped by the crawler.

  12. Leave the crawler timeout at the default value. Note: when first setting up a collection it is often a good idea to set this value to something small (e.g. 5 minutes) to run some exploratory crawls and test the small indexes until you have Funnelback gathering correctly.

  13. Add a human friendly name to the collection. Click the add new button and add a service_name. Set the value to Funnelback

  14. The screen should now look something similar to the screenshot below.

    exercise creating a web collection 03
  15. Return to the administration home page by clicking admin home in the breadcrumb trail.

  16. Update the collection by clicking the update this collection button on the update tab.

  17. Funnelback will provide status update messages to the administration interface while the update is running. The search collection can be queried once the first update has completed.

1.1. Review questions: collections

  1. How do the collection identifier and collection title fields differ?

  2. What would you need to change in order to add http://docs.funnelback.com to the search? 

2. Web collections - update cycle

Clicking the update this collection link starts Funnelback on a journey that consists of a number of processes or phases.

web collections update cycle 01

Each of these phases must be completed for a successful update to occur. If something goes wrong an error will be raised and this will result in a failed update.

The exact set of phases will depend on what type of collection is being updated - however all collections generally have the following phases:

  1. A gather phase that is the process of Funnelback accessing and storing the source data.

  2. A filter phase that transforms the stored data.

  3. An index phase that results in a searchable index of the data.

  4. A swap phase that makes the updated search live.

2.1. Gathering

The gather phase covers the set of processes involved in retrieving the content from the data source.

The gather process needs to implement any logic required to connect to the data source and fetch the content.

The overall scope of what to gather also needs to be considered.

For a web collection the process of gathering is performed by a web crawler. The web crawler works by accessing a seed URL (or set of URLs). This is fetched by the web crawler and stored locally. The crawler then parses the downloaded HTML content and extracts all the links contained within the file. These links are added to a list of URLs (known as the crawl frontier) that the crawler needs to process.

Each URL in the frontier is processed in turn. The crawler needs to decide if this URL should be included in the search - this includes checking a set of include / exclude patterns, robots.txt rules, file type and other attributes about the page. If all the checks are passed the crawler fetches the URL and stores the document locally. Links contained within the HTML are extracted and the process continues until the crawl frontier is empty, or a pre-determined timeout is reached.

The logic implemented by the web crawler includes a lot of additional features designed to optimise the crawl. On subsequent updates this includes the ability to decide if a URL has changed since the last visit by the crawler.

2.1.1. Crawler limitations

The web crawler has some limitations that are important to understand:

  • The web crawler does not process JavaScript. Any content that can’t be accessed when JavaScript is disabled will be hidden from the web crawler.

  • It is possible to crawl some authenticated websites, however this happens as a specified user. If content is personalised, then what is included in the index is what the crawler’s user can see.

  • By default the crawler will skip document that are larger than 10MB in size (this value can be adjusted).

2.2. Filtering

Filtering is the process of transforming the downloaded content into text suitable for indexing by Funnelback.

This can cover a number of different scenarios including:

  • File format conversion - converting binary file formats such as PDF and Word documents into text.

  • Text mining and entity extraction

  • Document geocoding

  • Metadata generation

  • Content and WCAG checking

  • Content cleaning

  • Custom filters

2.3. Indexing

The indexing phase creates a searchable index from the set of filtered documents downloaded by Funnelback.

The main search index is made up of an index of all words found in the filtered content and where the words occur. Additional indexes are built containing other data pertaining to the documents. These indexes include document metadata, link information, auto-completion and other document attributes (such as the modified date and file size and type).

Once the index is built it can be queried and does not require the source data used to generate the index.

2.4. Swap views

The swap views phase serves two functions - it provides a sanity check on the index size and performs the operation of making the updated indexes live.

Most Funnelback collections (non-push collections) maintain two copies of the search index known as the live view and the offline view.

When the update is run all of the processes operate on an offline view of the collection. This offline view is used to store all the content from the new update and build the indexes. Once the indexes are built they are compared to what is currently in the live view - the set of indexes that are currently in a live state and available for querying.

The index sizes are compared. If Funnelback finds that the index has shrunk in size below a definable value (e.g. 50%) then the update will fail. This sanity check means that an update won’t succeed if the website was unavailable for a significant duration of the crawl.

An administrator can override this if the size reduction is expected. e.g. a new site has been launched and it’s a fraction of the size of the old site.

Push collections used an API based mechanism to update and will be covered separately.

2.5. Review questions: update cycle

  1. The offline view can contain three different states (for example one of these is the currently updating collection) - what are the other two and under what conditions do these exist?

3. Checking an update

Funnelback maintains detailed logs for all of the processes that run during an update.

When there is a problem and an update fails the logs should contain information that allows the cause to be determined.

It is good practice to also check the log files while performing the setup of a new collection - some errors don’t cause an update to fail. A bit of log analysis while building a collection can allow you to identify:

  • pages that should be excluded from the crawl

  • crawler traps

  • documents that are too large

  • documents of other types

Each of the phases above generate their own log files - learning what these are and what to look out for will help you to solve problems much more quickly.

Exercise 2: Examine update logs

In this exercise some of the most useful log files generated during an update of a web collection will be examined.

  1. Load the administration interface and switch to the funnelback-website collection.

  2. Open the log viewer by clicking selecting browse log files from the administer tab.

    exercise examine update logs 01
  3. Observe the file manager view of the available log files. The files are grouped under several headings including: collection, offline, live and archive log files.

    exercise examine update logs 02

    The collection log files section contains the top-level update and report logs for the collection. The top-level update log contains high-level information relating to the collection update.

    The live log files section includes all the detailed logs for the currently live view of the search. This is where you will find logs for the last successful update.

    The offline log files section includes detailed logs for the offline view of the search. The state of the log files will depend on the collection’s state - it will contain one of the following:

    • Detailed logs for an update that is currently in progress

    • Detailed logs for the previous update (that failed)

    • Detailed logs for the successful update that occurred prior to the currently live update.

    The archive log files section contains all the historical query logs, used to generate the search analytics.

  4. Open the update-funnelback-website.log from the collection log files section by clicking on the filename.

  5. Observe that the log follows the overall update process described above, with messages relating to crawling, indexing and the swapping of views. The log file should indicate a successful update with the different phases exiting with a status of 0.

  6. Open the update.log from under the live logs files - this provides a step by step overview of the update.

  7. Return to the file manager display of logs and inspect the url_errors.log and stored.log from under the live log files heading. The url_errors.log includes messages about errors that were detected when accessing URLs - this can include timeouts and other server errors, or messages about documents being larger than the maximum download size. The stored.log shows what URLs were downloaded during the crawl.

    The information in the url_errors.log and stored.log can be used to optimise the crawl. For example:

    • Crawler traps can be identified by viewing the end of the stored.log. If there are lots of very similar URLs that take different parameters filling the end of the log this might indicate a calendar that doesn’t contain useful content, which can be fixed by adding an appropriate exclude pattern.

    • The url_errors.log will show items that were too large (meaning that the maximum download size may need to be increased if appropriate). Lots of unexpected timeouts should be addressed by increasing the timeouts used by the web crawler.

3.1. Debugging failed updates

An update can fail for numerous reasons. The following provides some high level guidance by providing some common failures and how to debug them.

The first step is to check the collection’s update log and see where and why the update failed. Look for error lines. Common errors include:

  • Failed to access seed page: For some reason Funnelback was unable to access the seed page so the whole update failed (as there is nothing to crawl). Look at the offline crawl logs (crawl.log, crawl.log.X.gz) and url_errors.log for more information. The failure could be the result of a timeout, or a password expiring if you are crawling with authentication.

  • Failed changeover conditions: After building the index a check is done comparing with the previous index. If the index shrinks below a threshold then the update will fail. This can occur if one of the sites was down when the crawl occurred, or if there were excessive timeouts, or if the site has shrunk (e.g. because it has been redeveloped or part of it archived). If a shrink in size is expected you can run an advanced update and swap the views.

  • Failures during filtering: Occasionally the filtering process crashes causing an update to fail. The crawl.log or crawler.central.log may provide further information to the cause.

  • Lock file exists: The update could not start because a lock file was preventing the update. This could be because another update on the collection was running; or a previous update crashed leaving the lock files in place. The lock can be cleared from the administration interface by selecting the collection then clicking on the clear locks link that should be showing on the update tab.

  • Failures during indexing: Have a look at the offline index logs (Step-*.log) for more details.

Exercise 3: Debug a failed update
  1. Open the administration interface and switch to the funnelback-website collection.

  2. Select edit collection settings from the administer tab

  3. Update the start url to http://training-search.clients.funnelback.com/training/training-data/funnelback-website/www.funnelback.co/ (change .com to .co) then save your changes. We are intentionally using a URL containing a typo in this example so we can examine the resulting errors.

  4. Start an update. The update should fail almost immediately.

  5. Switch to the log screen by selecting browse log files from the administer tab. Inspect the log files for the collection - starting with the update-funnelback-website.log. The update-<COLLECTION-ID>.log provides an overview of the update process and should give you an idea of where the update failed. Observe that there is an error returned in the crawl step. This suggests that we should investigate the crawl logs further.

    exercise debug a failed update 01
  6. As the collection update failed the offline view will contain all the log files for the update that failed. Locate the crawl.log from the offline logs section and inspect this. The log reports that no URLs were stored.

    exercise debug a failed update 02
  7. Examine the url_errors.log which logs errors that occurred during the crawl. From this log you can see that a 404 not found error was returned when accessing http://training-search.clients.funnelback.com/training/training-data/funnelback-website/www.funnelback.co/ which is the seed URL for the crawl. This explains why nothing was indexed because the start page was not found, so the crawl could not progress any further.

    E http://training-search.clients.funnelback.com/training/training-data/funnelback-website/www.funnelback.co/ [404 Not Found] [2017:01:05:00:37:51]
  8. With this information at hand you can investigate further. In this case the reason the crawl failed was due to the seed URL being incorrectly typed. But you might visit the seed URL from your browser to investigate further.

  9. Run a search against the collection and observe that results are still being returned. Recall that Funnelback updates an offline copy of the index when an update is run. When an update fails the previous successful update remains live meaning that the search continues to function.

  10. Return to the edit collection settings screen and correct the start URL.

3.2. Review questions: debugging failed updates

  1. What’s the difference between the live and offline logs and when would you look at logs from each of these log folders?

  2. Which logs would you look at to solve the following?

    1. Find files that were rejected due to size during the last update?

    2. Find the cause of an update that failed?

    3. Determine why a URL is missing from an index?

    4. Identify unwanted items that are being stored? 

4. Meta collections

A meta collection is a special collection type that combines the indexes of a set of collections into a single item that can be queried.

A meta collection doesn’t gather content or build indexes - in fact a meta collection is never ‘updated' - the content of a meta collection updates by virtue of the included collections being updated.

A meta collection has query-side configuration including ranking and display options, individual templates, best bets, curator rule and synonyms, and also has it’s own analytics.

Creating a meta collection is as simple as picking the collections that should be combined.

Once created, querying the meta collection will search across the sub-collections. Services and profiles are also set up on meta collections in the same was as for a standard collection. Any front end customisation, e.g. custom templating, synonyms, faceted navigation, display options (such as the SF value), should be set within a service on the meta collection.

Exercise 4: Create a meta collection

In this exercise you will create a search that includes content from the Funnelback website collection that you previously configured combined with the built-in Funnelback documentation collection.

  1. Log in to the administration interface and click on create collection on the top menu bar.

    exercise create a meta collection 01
  2. Enter the following information:

    • Project group ID: Training 202

    • Collection ID: funnelback-search

    • Collection type: meta

    Assign and appropriate licence then click create collection

    exercise create a meta collection 02
  3. Assign a human readable collection name by adding a service_name and setting it to Funnelback - combined search. Define which collections belong to this meta collection by clicking on the meta component editor link (in the blue box above the parameter listing). This opens the meta components editor screen.

    exercise create a meta collection 03
  4. Tick the Funnelback website and Funnelback Documentation items from the sub-collections list. This controls which collections make up the content of the meta collection you are creating.

    exercise create a meta collection 04
  5. Run a search for summaries from the quick search box - observe that search results are being returned from the Funnelback documentation and Funnelback website collections. Also observe that no updates were required - you could start searching the meta collection as soon as the collections were added.

  6. Click on the collection switcher in the administation interface and observe that the two collections you have created so far are listed under the Training 202 heading you defined when creating the collection.

    exercise create a meta collection 05

4.1. Understanding how meta collections combine indexes

It is important to understand a few basics about how meta collections aggregate content from the different indexes.

  • Metadata class names are shared across the sub collections: this means if you have a class called title in collection A and a class called title in collection B there will be a field called title in the meta collection that searches the title metadata from both sub collections. This means you need to be careful about the names you choose for your metadata classes, ensuring that they only overlap when you intend them to. One technique you can use to avoid this is to namespace your metadata fields to keep them separate. (e.g. use something like websiteTitle instead of title in your website collection.

  • Generalised scopes are shared across sub collections: the same principles as outlined above for metadata apply to gscopes. You can use gscopes to combine or group URLs across collections by assigning them the same gscope ID in each collection, but only do this when it makes sense - otherwise you may get results that you don’t want if you choose to scope the search results using your defined gscopes.

  • Geospatial and numeric metadata: these are special metadata types and the value of the fields are interpreted in a special way. If you have any of these classes defined in multiple collections in a meta collection ensure they are of the same type in each collection where they are defined.

  • Meta collection indexes are combined at query time: this means you can add and remove collections from the meta collection and immediately start searching across the indexes. Note: auto-completion and spelling suggestions for the meta collection won’t be updated to match the changed meta collection content until one of the sub-collections completes a successful update. If combined indexes contain an overlapping set of URLs then duplicates will be present in the search results (as duplicates are not removed at query time).

  • You can scope a query to specific collections within a meta collection by supplying the clive parameter with a list of collections to include.

4.2. Configuring meta collections

There are a few things that must be considered when configuring meta collections - this is due to the separation between the querying of the meta collection and the fact that the indexes for the meta collection components are part of the relevant component collection and not the meta collection.

When working with a meta collection all queries should be directed to the meta collection (i.e. the collection parameter is set to the meta collection’s collection ID). This means all of the configuration that controls what happens at query time needs to be made within a service that is configured on the meta collection.

These changes include:

  • templates

  • synonyms, best bets and curator rules

  • faceted navigation (note: addition of facets based on new metadata fields or generalised scopes require the component collections to be updated before the facet will be visible)

  • display options

  • ranking options

  • most auto-completion options

  • hook scripts

  • quick links display

  • knowledge graph user interface

  • knowledge graph relationships

However because the indexes are still built when sub-collections update any changes that affect the update or index build process must be made to the sub-collection. These changes include:

  • metadata field mappings and external metadata

  • gscope mappings

  • indexer options

  • quicklinks generation

  • groovy and JSoup filters

  • spelling options

  • knowledge graph metadata mappings (FUNkgNodeNames, FUNkgNodeLabel and other properties used by any knowledge graph)

5. Creating additional profiles (and services)

Every collection in Funnelback contains one or more profiles which underpin the preview/live templates and configuration, any configured services and also which can be used to provide access to scoped versions of the search.

Every collection includes a default profile - which has both a live and preview version.

These are represented on disk by a _default and _default_preview folder existing within the collection’s configuration folder.

Collections support an arbitrary number of profiles. Additional profiles can be created from the administration interface.

Once a profile is created default display and ranking options can be applied to the profile.

Every profile that exists can also be set up as a service. Every profile that will be used in any of the following ways should be set up as a service:

  • The profile will be searched directly by users (ie. searches will be run where the profile parameter is set to the profile name).

  • Separate analytics are desired for the profile

  • Independent templates, best bets, synonyms or curator rules are required for the profile.

  • Tuning is to be run against the profile.

A profile is set up as a service in the same manner as for the _default profile.

Exercise 5: Create a new profile and front-end service
  1. Log in to the administration interface and switch to the funnelback-search collection.

  2. Select manage profiles from the administer tab.

    exercise create a new profile and front end service 01
  3. Create a profile named docs. Enter docs into the box labelled new profile name then click the create button.

    exercise create a new profile and front end service 02
  4. The table of profiles updates to show the newly created profile

    exercise create a new profile and front end service 03
  5. Press the cancel button to close the manage profile screen. Observe that a profile selector is now displayed for the funnelback-search collection, and that there is a docs choice in the dropdown. Select docs from the menu.

    exercise create a new profile and front end service 04
  6. This sets docs as the current profile, with the customise, analyse and tune tabs becoming specific to the docs profile. Observe that the customise options are greyed out. This is because the docs profile that was just created has not been set to be a frontend service.

    exercise create a new profile and front end service 05
  7. Click the create service button to enable the frontend service features for the docs profile.

    exercise create a new profile and front end service 06
Exercise 6: Define a scope for a profile

When a new profile is created it has the same content as for the parent collection.

Profiles are commonly used to provide a search over a sub-set of content within a larger collection. To achieve this the profile should be configured to apply a default scope.

In this exercise the docs profile will be scoped so that only documentation pages are returned when the profile is set to docs in the search query.

  1. Change to the funnelback-search collection.

  2. Change to the docs profile.

  3. Run a search for Funnelback ensuring the preview radio button is selected and observe that pages from both the Funnelback website and Funnelback documentation sites are returned. Pages from the documentation site have URLs prefixed with /opt/funnelback/web/admin/help while pages from the Funnelback website are prefixed with http://training-search.clients.funnelback.com/training/training-data/funnelback-website/www.funnelback.com/

  4. Open the file manager and create a new padre_opts.cfg file within the funnelback-search / Profile: docs (preview) section. The padre_opts.cfg holds a list of options that get passed to the query processor. These options are the same display and ranking options set in the query processor options field in the main collection configuration, but apply only to the profile.

  5. Add a query processor option to scope the collection - various options can be used to scope the collection including scope, xscope, gscope1 and clive. The clive parameter is a special scoping parameter that can be applied to meta collection to restrict the results to only include pages from a specified collection or set of collections. Add a clive parameter to scope the profile to pages from the funnelback_documentation collection then save (but do not publish) the file.

    -clive=funnelback_documentation
  6. Rerun the search for funnelback against the funnelback-search collection specifying the docs_preview profile (&profile=docs_preview) and observe that the results are now restricted to only pages from the documentation site.

  7. Rerun the search for funnelback against the funnelback-search collection, this time using the docs profile (&profile=docs). Observe that pages are returned from both sites - this is because the padre_opts.cfg must be published for the changes to take effect on the live profile.

  8. Return to the file manager and publish the padre_opts.cfg file that you created on the funnelback-search / Profile: docs (preview) profile.

  9. Rerun the search for funnelback against the funnelback-search collection, again using the docs profile (&profile=docs). Observe that pages are now restricted to documentation site pages.

Exercise 7: Set default profile display and ranking options

Display and ranking options can be set independently for each profile.

This allows the same collection to be used to serve search results with quite different ranking/display options or to be scoped to a subset of the collection’s data. These options are set in the same way as the scoping from the previous exercise by adding options to the padre_opts.cfg.

  1. Return to the file manager (on the funnelback-search collection) and edit the padre_opts.cfg file within the funnelback-search / Profile: docs (preview) section that was created in the previous exercise. Set the profile to return 5 results per page and sort alphabetically then save and publish the file.

    -clive=funnelback_documentation  -sort=title -num_ranks=5
  2. Rerun the search for funnelback against the funnelback-search collection specifying the docs_preview profile (&profile=docs_preview) and observe that the results are sorted alphabetically by title and only 5 results are returned per page. Results are all from the Funnelback documentation site.

6. Working with XML content

Funnelback can index XML documents and there are some additional configuration files that are applicable to indexing XML files.

  • You can map elements in the XML structure to Funnelback metadata classes.

  • You can display cached copies of the document via XSLT processing.

Funnelback can be configured to index XML content, creating an index with searchable, fielded data.

Funnelback metadata classes are used for the storage of XML data – with configuration that maps XML element paths to internal Funnelback metadata classes – the same metadata classes that are used for the storage of HTML page metadata. An element path is a simple XML X-Path.

XML files can be optionally split into records based on an X-Path. This is useful as XML files often contain a number of records that should be treated as individual result items.

Each record is then indexed with the XML fields mapped to internal Funnelback metadata classes as defined in the XML mappings configuration file.

6.1. XML configuration

The collection’s XML configuration defines how Funnelback’s XML parser will process any XML files that are found when indexing.

The XML configuration is made up of two parts:

  1. XML special configuration

  2. Metadata classes containing XML field mappings

The XML parser is used for the parsing of XML documents and also for indexing of most non-web data. The XML parser is used for:

  • XML, CSV and JSON files,

  • Database, social media, directory, HP CM/RM/TRIM and most custom collections.

6.2. XML element paths

Funnelback element paths are simple X-Paths that select on fields and attributes.

Absolute and unanchored X-Paths are supported, however for some special XML fields absolute paths are required.

  • If the path begins with / then the path is absolute (it matches from the top of the XML structure).

  • If the path begins with // it is unanchored (it can be located anywhere in the XML structure).

XML attributes can be used by adding @attribute to the end of the path.

Element paths are case sensitive.

Attribute values are not supported in element path definitions.

Example element paths:

X-Path Valid Funnelback element path

/items/item

VALID

//item/keywords/keyword

VALID

//keyword

VALID

//image@url

VALID

/items/item[@type=value]

NOT VALID

6.2.1. Interpretation of field content

CDATA tags can be used with fields that contain reserved characters or the characters should be HTML encoded.

Fields containing multiple values should be delimited with a vertical bar character or the field repeated with a single value in each repeated field.

e.g. The indexed value of //keywords/keyword and //subject below would be identical.

<keywords>
    <keyword>keyword 1</keyword>
    <keyword>keyword 2</keyword>
    <keyword>keyword 3</keyword>
</keywords>

<subject>keyword 1|keyword 2|keyword 3</subject>

6.3. XML special configuration

There are a number of special properties that can be configured when working with XML files. These options are defined from the XML configuration screen, by selecting XML processing from the administer tab in the administration interface.

xml special configuration 01

6.3.1. XML document splitting

A single XML file is commonly used to describe many items. Funnelback includes built-in support for splitting an XML file into separate records.

Absolute X-Paths must be used and should reference the root element of the items that should be considered as separate records.

Splitting an XML document using this option is not available on push collections. A filter can be used as an alternative approach for splitting documents when using a push collection.

6.3.2. Document URL

The document URL field can be used to identify XML fields containing a unique identifier that will be used by Funnelback as the URL for the document. If the document URL is not set then Funnelback auto-generates a URL based on the URL of the XML document. This URL is used by Funnelback to internally identify the document, but is not a real URL.

Setting the document URL to an XML attribute is not supported.

Setting a document url is not available on push collections.

6.3.3. Document file type

The document file type field can be used to identify an XML field containing a value that indicates the filetype that should be assigned to the record. This is used to associate a file type with the item that is indexed. XML records are commonly used to hold metadata about a record (e.g. from a records management system) and this may be all the information that is available to Funnelback when indexing a document from such as system.

6.3.4. Special document elements

The special document elements can be used to tell Funnelback how to handle elements containing content.

Inner HTML or XML documents

The content of the XML field will be treated as a nested document and parsed by Funnelback and must be XML encoded (i.e. with entities) or wrapped in a CDATA declaration to ensure that the main XML document is well formed.

The indexer will guess the nested document type and select the appropriate parser:

The nested document will be parsed as XML if (once decoded) it is well formed XML and starts with an XML declaration similar to <?xml version="1.0" encoding="UTF-8" />. If the inner document is identified as XML it will be parsed with the XML parser and any X-Paths of the nested document can also be mapped. Note: the special XML fields configured on the advanced XML processing screen do not apply to the nested document. For example, this means you can’t split a nested document.

The nested document will be parsed as HTML if (once decoded) when it starts with a root <html> tag. Note that if the inner document contains HTML entities but doesn’t start with a root <html> tag, it will not be detected as HTML. If the inner document is identified as HTML and contains metadata then this will be parsed as if it was a HTML document, with embedded metadata and content extracted and associated with the XML records. This means that metadata fields included in the embedded HTML document can be mapped in the metadata mappings along with the XML fields.

The inner document in the example below will not be detected as HTML:

<root>
  <name>Example</name
  <inner>This is &lt;strong&gt;an example&lt;/strong&gt;</inner>
</root>

This one will:

<root>
  <name>Example</name>
  <inner>&lt;html&gt;This is &lt;strong&gt;an example&lt;/strong&gt;&lt;/html&gt;</inner>
</root>
Indexable document content

Any listed X-Paths will be indexed as unfielded document content - this means that the content of these fields will be treated as general document content but not mapped to any metadata class.

For example, if you have the following XML document:

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <title>Example</title>
  <inner>
  <![CDATA[
  <html>
    <head>
      <meta name="author" content="John Smith">
    </head>
    <body>
      This is an example
    </body>
  </html>
  ]]>
  </inner>
</root>

With an Indexable document content path to //root/inner, the document content will have This is an example; however, the metadata author will not be mapped to "John Smith". To have the metadata mapped as well, the Inner HTML or XML document path should be used instead.

Include unmapped elements as content

If there are no indexable document content paths mapped, Funnelback can optionally choose to how to handle the unmapped fields. When this option is selected then all unmapped XML fields will be considered part of the general document content.

Exercise 8: Creating an XML collection

In this exercise you will create a searchable index based off records contained within an XML file.

A web collection is used to create a search of a website or set of websites. Web collections contain HTML, PDF and MS Office files that are gathered by crawling a website or set of websites.

For this exercise we will use a web collection, though XML content can exist in many different collection types (eg. custom, database, trimpush, directory, local etc.).

The XML file that we will be indexing includes a number of individual records that are contained within a single file, an extract of which is shown below.

<?xml version="1.0" encoding="UTF-8"?>
<tsvdata>
        <row>
                <Airport_ID>1</Airport_ID>
                <Name>Goroka</Name>
                <City>Goroka</City>
                <Country>Papua New Guinea</Country>
                <IATA_FAA>GKA</IATA_FAA>
                <ICAO>AYGA</ICAO>
                <Latitude>-6.081689</Latitude>
                <Longitude>145.391881</Longitude>
                <Altitude>5282</Altitude>
                <Timezone>10</Timezone>
                <DST>U</DST>
                <TZ>Pacific/Port_Moresby</TZ>
                <LATLONG>-6.081689;145.391881</LATLONG>
        </row>
        <row>
                <Airport_ID>2</Airport_ID>
                <Name>Madang</Name>
                <City>Madang</City>
                <Country>Papua New Guinea</Country>
                <IATA_FAA>MAG</IATA_FAA>
                <ICAO>AYMD</ICAO>
                <Latitude>-5.207083</Latitude>
                <Longitude>145.7887</Longitude>
                <Altitude>20</Altitude>
                <Timezone>10</Timezone>
                <DST>U</DST>
                <TZ>Pacific/Port_Moresby</TZ>
                <LATLONG>-5.207083;145.7887</LATLONG>
        </row>
        <row>
                <Airport_ID>3</Airport_ID>
                <Name>Mount Hagen</Name>
                <City>Mount Hagen</City>
                <Country>Papua New Guinea</Country>
                <IATA_FAA>HGU</IATA_FAA>
...

Indexing an XML file like this requires two main steps:

  1. Configuring Funnelback to fetch and split the XML file

  2. Mapping the XML fields for each record to Funnelback metadata classes.

Exercise 8 steps
  1. Open the administration interface and create a new web collection with the following attributes (refer to the create a web collection exercise if you need a reminder of the steps):

  2. Run a search for !showall using the search preview - the ! means that the search will return results that do not include the word showall. Observe that only one result (to the source XML file) is returned.

    exercise creating an xml collection 01
  3. Configure Funnelback to split the XML files into records. To do this we need to inspect the XML file(s) to see what elements are available.

    Ideally you will know the structure of the XML file before you start to create your collection. However, if you don’t know this and the XML file isn’t too large you might be able to view it by inspecting the cached version of the file, available from the drop down link at the end of the URL. If the file is too large the browser may not be able to display the file.

    Inspecting the XML (displayed above) shows that each airport record is contained within the <row> element that is nested beneath the top level <tsvdata> element. This translates to an X-Path of /tsvdata/row.

    Log in to the administration interface and select XML processing from the administer tab. The XML processing screen is where all the special XML options are set.

    exercise creating an xml collection 02
  4. The XML document splitting field configures the X-Path(s) that are used to split an XML document. Select /tsvdata/row from the listed fields.

    exercise creating an xml collection 03
  5. If possible also set the document URL to an XML field that contains a unique identifier for the XML record. This could be a real URL, or some other sort of ID. Inspecting the airports XML shows that the Airport_ID can be used to uniquely identify the record. Select /tsvdata/row/Airport_ID from the dropdown for the document URL field.

    exercise creating an xml collection 04
    If you don’t set a document URL Funnelback will automatically assign a URL.
  6. The index must be rebuilt for any XML processing changes to be reflected in the search results. Return to the administration interface home page by clicking the admin home link (in the breadcrumbs) then rebuild the index by switching to the update tab and selecting reindex the live view from the advanced update option.,

  7. Run a search for !showall using the search preview and confirm that the XML file is now being split into separate items. The search results display currently only shows the URLs of each of the results (which in this case is just an ID number). In order to display sensible results the XML fields must be mapped to metadata and displayed by Funnelback.

    exercise creating an xml collection 05
  8. Map XML fields by selecting configure metadata mappings from the administer tab.

    exercise creating an xml collection 06
  9. The metadata screen lists a number of pre-configured mappings. Because this is a XML data set the mappings will be of no use so clear all the mappings by selecting clear all metadata mappings from the tools menu.

    exercise creating an xml collection 07
  10. Click the add new button to add a new metadata mapping. Create a class called name that maps the <Name> xml field. Enter the following into the creation form:

    • Class: name

    • Type: Text

    • Search behaviour: Searchable as content

    exercise creating an xml collection 08
  11. Add a source to the metadata mapping. Click the add new button in the sources box. This opens up a window that displays the metadata sources that were detected when the index was built. Display the detected XML fields by clicking on the XML path button for the type of source, then choose the Name (/tsvdata/row/Name) field from the list of availble choices the click the save button.

    exercise creating an xml collection 09
  12. You are returned to the create mapping screen. Click the add new mapping button to create the mapping.

    exercise creating an xml collection 10
  13. The metadata mappings screen updates to show the newly created mapping for Name.

    exercise creating an xml collection 11
  14. Repeat the above process to add the following mappings. Before adding the mapping switch the editing context to XML - this will mean that XML elements are displayed by default when selecting the sources.

    exercise creating an xml collection 12
    Class name Source Type Search behaviour

    city

    /tsvdata/row/City

    text

    searchable as content

    country

    /tsvdata/row/Country

    text

    searchable as content

    iataFaa

    /tsvdata/row/IATA_FAA

    text

    searchable as content

    icao

    /tsvdata/row/ICAO

    text

    searchable as content

    altitude

    /tsvdata/row/Altitude

    text

    searchable as content

    latlong

    /tsvdata/row/LATLONG

    text

    display only

    latitude

    /tsvdata/row/Latitude

    text

    display only

    longitude

    /tsvdata/row/Longitude

    text

    display only

    timezone

    /tsvdata/row/Timezone

    text

    display only

    dst

    /tsvdata/row/DST

    text

    display only

    tz

    /tsvdata/row/TZ

    text

    display only

  15. Note that the metadata mappings screen is displaying a message:

    These mappings have been updated since the last index, perform a re-index to apply all of these mappings.

    Rebuild the index by clicking the admin home link in the breadcrumb trail then selecting start advanced update from the update tab. Select reindex live view from the rebuild live index section and click the update button.

  16. Display options will need to be configured so that the metadata is returned with the search results. Reminder: switch to the administer tab and select edit collection configuration, then click the interface tab and locate the query processor options. Configure the summary fields to include the name, city, country, iataFaa, icao and altitude metadata:

    -stem=2 -SF=[name,city,country,iataFaa,icao,altitude]
  17. Rerun the search for !showall and observe that metadata is now returned for the search result.

    exercise creating an xml collection 13
  18. Inspect the data model. Reminder: edit the url changing search.html to search.json. Inspect the response element - each result should have fields populated inside the metadata sub elements of the result items. These can then be accessed from your Freemarker template and printed out in the search results.

    exercise creating an xml collection 14

7. Advanced metadata

7.1. Geospatial and numeric metadata

Recall that Funnelback supports five types of metadata classes:

  • Text: The content of this class is a string of text.

  • Geospatial x/y coordinate: The content of this field is a decimal latlong value in the following format: geo-x;geo-y (e.g. 2.5233;-0.95) This type should only be used if there is a need to perform a geospatial search (e.g. This point is within X km of another point). If the geospatial coordinate is only required for plotting items on a map then a type 0 or 1 field is sufficient.

  • Number: The content of this field is a numeric value. Funnelback will interpret this as a number. This type should only be used if there is a need to use numeric operators when performing a search (e.g. X > 2050) or to sort the results in numeric order. If the field is only required for display within the search results a type 0 or 1 field is sufficient.

  • Document permissions: The content of this field is a security lock string defining the document permissions. This type should only be used when working with an enterprise collection that includes document level security.

  • Date: A single metadata class supports a date, which is used as the document’s date for the purpose of relevance and date sorting. Additional dates for the purpose of display can be indexed as either a text or number type metadata class.

Funnelback’s text metadata type is sufficient for inclusion of metadata in the index appropriate for the majority of use cases.

The geospatial x/y coordinate and number metadata types are special metadata types that alter the way the indexed metadata value is interpreted, and provide type specific methods for working with the indexed value.

Defining a field as a geospatial x/y coordinate tells Funnelback to interpret the contents of the field as a decimal lat/long coordinate. (e.g. -31.95516;115.85766). This is used by Funnelback to assign a geospatial coordinate to an index item (effectively pinning it to a single point on a map). A geospatial metadata field is useful if you wish to add any location-based search constraints such as (show me items within a specified distance to a specified origin point), or sort the results by proximity (closeness) to a specific point.

A geospatial x/y coordinate is not required if you just want to plot the item onto a map in the search results (a text type value will be fine as it’s just a text value you are passing to the mapping API service that will generate the map).

Defining a field as a number tells Funnelback to interpret the contents of the field as a number. This allows range and equality comparisons (==, !=, >=, >, <, <=) to be run against the field. Numeric metadata is only required if you wish to make use of these range comparisons. Numbers for the purpose of display in the search results should be defined as text type metadata.

Only use geospatial and numeric values if you wish to make use of the special type-specific query operators. When defining these fields on a collection that belongs to a meta collection ensure that the fields if shared with other collections are of the same type.
Exercise 9: Geospatial and numeric metadata

In this exercise we will extend the metadata that is extracted from the XML example. We will include both a geospatial metadata field as well as a numeric metadata field. Recall the record format for the XML data:

        <row>
                <Airport_ID>1</Airport_ID>
                <Name>Goroka</Name>
                <City>Goroka</City>
                <Country>Papua New Guinea</Country>
                <IATA_FAA>GKA</IATA_FAA>
                <ICAO>AYGA</ICAO>
                <Latitude>-6.081689</Latitude>
                <Longitude>145.391881</Longitude>
                <Altitude>5282</Altitude>
                <Timezone>10</Timezone>
                <DST>U</DST>
                <TZ>Pacific/Port_Moresby</TZ>
                <LATLONG>-6.081689;145.391881</LATLONG>
        </row>

The <LATLONG> field contains the geospatial metadata that will be associated with the item. Note: when working with geospatial metadata Funnelback expects the format of the field to contain a decimal X/Y coordinate in the format above (X coordinate;Y coordinate). If the format of the field doesn’t match (e.g. is delimited with a comma) or the X/Y values are supplied separately you will need to clean the XML before Funnelback indexes it (or provide an additional field in the correct format within the source data).

The <Altitude> field will be used as the source of numeric metadata for the purpose of this exercise.

  1. From the administration interface change to the airports collection.

  2. Edit the metadata mappings. (Administer tab, customise metadata mappings).

  3. Modify the mapping for the <LATLONG> field to set the type as a geospatial coordinate. Note: the <LATLONG> field was mapped previously so edit the existing entry.

    exercise geospatial and numeric metadata 01
  4. Modify the mapping for the <Altitude field> to be number then save the changes. ). Note: the <Altitude> field was mapped previously so edit the existing entry.

    exercise geospatial and numeric metadata 02
  5. Rebuild the index (as you have changed the metadata configuration). Reminder: Update tab, start advanced update, reindex live view.

  6. Run a search for !showall and inspect the XML or JSON noting that kmFromOrigin elements now appear (due to the elements containing geospatial metadata).

    exercise geospatial and numeric metadata 03
  7. Return to the HTML results and add numeric constraints to the query to return only airports that are located between 2000 ft and 3000 ft: add &lt_altitude=3000&ge_altitude=2000 to the URL observing that the number of matching results is reduced and that altitudes of the matching results are all now between 2000 and 3000.

    exercise geospatial and numeric metadata 04
    The full list of numeric operators is available in the online documentation
  8. Remove the numeric constraints and apply some geospatial constraints. For geospatial search you must supply an origin and an optional maxdist parameter. The maxdist parameter indicates a radius (in km) from the origin point to limit the result set to. Define an origin (Canberra, Australia) by adding &origin=-35.2842245,149.1328911 (note: the origin parameter has the x and y coordinates delimited using a comma) to the URL and observe the XML results. Note the <kmFromOrigin> elements update to contain values relative to the origin you have defined.

    If you don’t define an origin Funnelback will use 0.0,0.0 (somewhere in the Atlantic Ocean) and all kmFromOrigin values will be calculated from this origin.

    When working with geospatial search you may want to consider setting the origin value by reading the location data from your web browser (which might be based on a mobile phone’s GPS coordinates, or on IP address location. Once you’ve read this value you can pass it to Funnelback along with the other search parameters.

    exercise geospatial and numeric metadata 05
  9. Edit the template to print out the kmFromOrigin value in the results. Add the following below the metadata (e.g. immediately before the </dl> tag at approx line 610) that is printed in the result template:

    <#if s.result.kmFromOrigin??>
    <dt>Distance from origin:</dt><dd>${s.result.kmFromOrigin} km</dd>
    </#if>
  10. Run the !showall search again and observe the distance is now returned in the results.

    exercise geospatial and numeric metadata 06
  11. Add a maxdist constraint limiting the search to return items within 400km of the origin (&maxdist=400). Observe that the number of matching results drops, and that all the kmFromOrigin values are 400km or less.

    exercise geospatial and numeric metadata 07
  12. Sort the results by proximity to the origin by adding &sort=prox and observe that the kmFromOrigin values are now sorted by distance.

    exercise geospatial and numeric metadata 08
Extended exercises: Geospatial search and numeric metadata
  1. Modify the search box to include controls to set the origin using the browser’s location support and to adjust the maxdist. Hint: examine the advanced search form for an example.

  2. Add sort options to sort the results by proximity to the origin.

  3. Modify the search box to set the origin inside a hidden field.

  4. Set the origin parameter using a pre_process hook script. You won’t be able to do this until you have completed the FUNL202 training course. Hint: the maxdist and origin need to be set in the additionalParameters data model element.

  5. Modify the template to plot the search results onto a map. See: https://community.funnelback.com/knowledge-base/implementation/search-interface/using-funnelback-search-results-to-populate-a-map

  6. Add sort options to sort the results numerically by altitude. Observe that the sort order is numeric (1, 2, 10, 11). Update the metadata mappings so that altitude is a standard text metadata field and reindex the live view. Refresh the search results and observe the sort order is now alphabetic (1, 10, 11, 2). This distinction is important if you have a metadata field that you need to sort numerically.

8. Configuring url sets (generalised scopes)

The generalised scopes mechanism in Funnelback allows an administrator to group sets of documents that match a set of URL patterns (e.g. */publications/*), or all the URLs returned by a specified query (e.g. author:shakespeare).

Once defined these groupings can be used for:

  • Scoped searches (provide a search that only looks within a particular set of documents)

  • Creating additional services (providing a search service with separate templates, analytics and configuration that is limited to a particular set of documents).

  • Faceted navigation categories (Count the number of documents in the result set that match this grouping).

The patterns used to match against the URLs are Perl regular expressions allowing for very complex matching rules to be defined. If you don’t know what a regular expression is don’t worry as simple substring matching will also work.

The specified query can be anything that is definable using the Funnelback query language.

Generalised scopes are a good way of adding some structure to an index that lacks any metadata, but either making use of the URL structure, or by creating groupings based on pre-defined queries.

Metadata should always be used in preference to generalised scopes where possible as gscopes carry a much higher maintenance overhead.

URLs can be grouped into multiple sets by having additional patterns defined within the configuration file.

Exercise 10: Configuring URL sets that match a URL pattern

The process for creating configuration for generalised scopes is very similar to that for external metadata.

  1. Load the administration interface and switch to the silent-films collection.

  2. Navigate to the file manager (select browse collection configuration files from the administer tab).

  3. Create a gscopes.cfg by selecting gscopes.cfg from the select menu that appears at the bottom of the silent-films / config file listing then click on the create button.

    exercise configuring url sets that match a url pattern 01
  4. A blank file editor screen will load. We will define URL groupings that groups together a set of pages about Charlie Chaplin.

    When defining gscopes there is often many different ways of achieving the same result.

    The following pattern tells Funnelback to create a set of URLs with a gscope ID of charlie that is made up of any URL containing the substring /details/CC_:

    charlie /details/CC_

    The following would probably also achieve the same result. This tells Funnelback to tag the listed URLs with a gscope ID of charlie. Note: the match is still a substring but this time the match is much more precise so each item is likely to only match a single URL. Observe also that it is possible to assign the same gscope ID to many patterns:

    charlie https://archive.org/details/CC_1916_05_15_TheFloorwalker
    charlie https://archive.org/details/CC_1916_07_10_TheVagabond
    charlie https://archive.org/details/CC_1914_03_26_CruelCruelLove
    charlie https://archive.org/details/CC_1914_02_02_MakingALiving
    charlie https://archive.org/details/CC_1914_09_07_TheRounders
    charlie https://archive.org/details/CC_1914_05_07_ABusyDay
    charlie https://archive.org/details/CC_1914_07_09_LaffingGas
    charlie https://archive.org/details/CC_1916_09_04_TheCount
    charlie https://archive.org/details/CC_1915_02_01_HisNewJob
    charlie https://archive.org/details/CC_1914_06_13_MabelsBusyDay
    charlie https://archive.org/details/CC_1914_11_07_MusicalTramps
    charlie https://archive.org/details/CC_1916_12_04_TheRink
    charlie https://archive.org/details/CC_1914_12_05_AFairExchange
    charlie https://archive.org/details/CC_1914_06_01_TheFatalMallet
    charlie https://archive.org/details/CC_1914_06_11_TheKnockout
    charlie https://archive.org/details/CC_1914_03_02_FilmJohnny
    charlie https://archive.org/details/CC_1914_04_27_CaughtinaCaberet
    charlie https://archive.org/details/CC_1914_10_10_TheRivalMashers
    charlie https://archive.org/details/CC_1914_11_09_HisTrystingPlace
    charlie https://archive.org/details/CC_1914_08_27_TheMasquerader
    charlie https://archive.org/details/CC_1916_05_27_Police
    charlie https://archive.org/details/CC_1916_10_02_ThePawnshop
    charlie https://archive.org/details/CC_1915_10_04_CharlieShanghaied
    charlie https://archive.org/details/CC_1916_06_12_TheFireman
    charlie https://archive.org/details/CC_1914_02_28_BetweenShowers
    charlie https://archive.org/details/CC_1918_09_29_TheBond
    charlie https://archive.org/details/CC_1918_xx_xx_TripleTrouble
    charlie https://archive.org/details/CC_1914_08_31_TheGoodforNothing
    charlie https://archive.org/details/CC_1914_04_20_TwentyMinutesofLove
    charlie https://archive.org/details/CC_1914_03_16_HisFavoritePasttime
    charlie https://archive.org/details/CC_1917_10_22_TheAdventurer
    charlie https://archive.org/details/CC_1914_06_20_CharlottEtLeMannequin
    charlie https://archive.org/details/CC_1917_06_17_TheImmigrant
    charlie https://archive.org/details/CC_1916_11_13_BehindtheScreen
    charlie https://archive.org/details/CC_1914_08_10_FaceOnTheBarroomFloor
    charlie https://archive.org/details/CC_1914_10_29_CharlottMabelAuxCourses
    charlie https://archive.org/details/CC_1914_10_26_DoughandDynamite
    charlie https://archive.org/details/CC_1914_12_07_HisPrehistoricpast
    charlie https://archive.org/details/CC_1914_02_09_MabelsStrangePredicament
    charlie https://archive.org/details/CC_1914_11_14_TilliesPuncturedRomance
    charlie https://archive.org/details/CC_1915_12_18_ABurlesqueOnCarmen
    charlie https://archive.org/details/CC_1914_08_01_CharolotGargonDeTheater
    charlie https://archive.org/details/CC_1917_04_16_TheCure
    charlie https://archive.org/details/CC_1916_08_07_One_A_M
    charlie https://archive.org/details/CC_1914_08_13_CharliesRecreation
    charlie https://archive.org/details/CC_1914_02_07_KidsAutoRaceAtVenice
    charlie https://archive.org/details/CC_1914_04_04_TheLandladysPet

    Finally the following regular expression would also achieve the same result.

    charlie archive.org/details/CC_.*$

    This may seem a bit confusing, but you need to keep in mind that the defined pattern can be as general or specific as you like - the trade-off is on what will match. The pattern needs to be specific enough to match the items you want but exclude those that shouldn’t be matched.

    Copy and paste the following into your gscopes.cfg and save the file. This will set up two URL sets - the first matching a subset of pages about Charlie Chapliln and the second matching a set of pages about Buster Keaton.

    charlie /details/CC_
    buster archive.org/details/Cops1922
    buster archive.org/details/Neighbors1920
    buster archive.org/details/DayDreams1922
    buster archive.org/details/OneWeek1920
    buster archive.org/details/Convict13_201409
    buster archive.org/details/HardLuck_201401
    buster archive.org/details/ThePlayHouse1921
    buster archive.org/details/College_201405
    buster archive.org/details/TheScarecrow1920
    buster archive.org/details/MyWifesRelations1922
    buster archive.org/details/TheHighSign_201502
    buster archive.org/details/CutTheGoat1921
    buster archive.org/details/TheFrozenNorth1922
    buster archive.org/details/BusterKeatonsThePaleface
  5. Rebuild the index (Select start advanced update from the update tab, then select reapply gscopes to live view and click update) to apply these generalised scopes to the index.

  6. Confirm that the gscopes are applied. Run a search for day dreams and view the JSON/XML data model. Locate the results and observe the values of the gscopesSet field. Items that match one of the Buster Keaton films listed above should have a value of buster set. Observe that results have other gscopes set that look like FUN followed by a random string of letters and numbers. These are gscopes that are defined by Funnelback when you create faceted navigation based on queries (as was done when setting up the Buster Keaton facet in the faceted navigation exercise in FUNL201)

  7. Use gscopes to scope the search. Run a search for !showeverything - this will return all results that don’t contain the word showeverything (in otherwords it will return everything). Add &gscope1=charlie to the URL and press enter. Observe that all the results are now restricted to films featuring Charlie Chaplin (and more specifically all the URLs contain /details/CC_ as a substring). Change the URL to have &gscope1=buster and rerun the search. This time all the results returned should be links to films featuring Buster Keaton. Advanced scoping, combining gscopes is also possible using reverse polish notation when configuring query processor options. See the documentation above for more information.

Exercise 11: Configuring URL sets that match a Funnelback query
  1. Load the administration interface and switch to the silent-films collection.

  2. Navigate to the file manager (select browse collection configuration files from the administer tab).

  3. Create a query-gscopes.cfg by selecting query-gscopes.cfg from the select menu that appears at the bottom of the silent-films / config file listing then click on the create button.

    exercise configuring url sets that match a funnelback query 01
  4. A blank file editor screen will load. We will define a URL set containing all silent movies about christmas.

    The following pattern tells Funnelback to create a set of URLs with a gscope ID of XMAS that is made up of the set of URLs returned when searching for christmas:

    XMAS christmas

    The query is specified using Funnelback’s query language and supports any advanced operators that can be passed in via the search box.

  5. Rebuild the index: (select start advanced update from the update tab, then select reapply gscopes to live index and click update) to apply these generalised scopes to the index.

  6. Confirm that the gscopes are applied. Run a search for christmas and view the JSON/XML data model. Locate the results and observe the values of the gscopesSet field. The returned items should have a value of XMAS set.

    exercise configuring url sets that match a funnelback query 02
  7. Use gscopes to scope the search. Run a search for !showeverything - this will return all results that don’t contain the work showeverything (in otherwords it will return everything). Add &gscope1=XMAS to the URL and press enter. Observe that all the results are now restricted to the films about christmas. Replace gscope1=XMAS with gscope1=xmas and observe that the gscope value is case sensitive.

Extended exercises and questions: URL sets (gscopes)
  • Redo the first gscopes exercise, but with the alternate pattern sets defined in step 4 of the exercise. Compare the results and observe that a similar result is achieved with the three different pattern sets.

  • Create a generalised scope that contains all documents where the director is Alfred Hitchcock

  • Why is using gscopes to apply keywords higher maintenance than using a metadata field?

  • Construct a reverse-polish gscope expression that includes charlie OR christmas but not buster. Hint: https://docs.funnelback.com/15.24/develop/reference-documents/gscopes.html#gscope-expressions

9. Social media collections

Funnelback has the ability to index content from the following social media services:

  • YouTube

  • Facebook

  • Flickr

  • Twitter

Additional services can be added by implementing a custom gatherer. Custom gatherers are covered in the FUNL203 training.

There are a number of pre-requisites that must be satisfied before social media services can be indexed. These vary depending on the type of service, but generally involve having an account, channel/service identifier and API key for access to the service.

Exercise 12: Index a YouTube channel

In this exercise you will create a collection that indexes Funnelback’s YouTube channel:

When creating your own social media collections you will need to generate your own API keys for the service you are attempting to interact with. The API key used in this exercise has been generated for the purposes of this Funnelback training exercise. The process to generate an API key varies for each social media type - please refer to the documentation for the social media platform for specific details.
Multiple channels can be configured by listing the channel IDs (separating the IDs with commas). The channel ID is embedded within the source code of YouTube channel page. To find this open up your YouTube channel then view the page source in your browser, searching within the text for channel_id. This will appear multiple times within the page.
  1. Log in to the administration interface and create a new collection with the following details:

    • Project group ID: Training 202

    • Collection ID: funnelback-youtube

    • Collection type: youtube

    exercise index a youtube channel 01
  2. Add configuration settings for the following:

    • service_name: Funnelback - YouTube

    • youtube.api-key: AIzaSyAiQT4tINzGie8d5Sjt9Ezes-VdcUCVHcg

    • youtube.channel-ids: UC28P4i0bRdTb08l86PhCXHA

    exercise index a youtube channel 02
  3. Update the collection by selecting update this collection from the update tab.

  4. Inspect the metadata mappings (administer tab, configure metadata mappings) and observe that a set of YouTube specific fields are automatically mapped.

  5. Add some display options to display the YouTube metadata. Add the following to the display options (administer tab, edit collection configuration, interface, query processor options):

    -stem=2 -SF=[c,t,viewCount,likeCount,dislikeCount,duration,imageSmall]
  6. Update the template (create a service then edit templates from the customise tab). Replace the contents of the <@s.Results> tag (approx line 505) with the following code:

    <#if s.result.class.simpleName == "TierBar">
      <#-- A tier bar -->
      <#if s.result.matched != s.result.outOf>
        <li class="search-tier"><h3 class="text-muted">Results that match ${s.result.matched} of ${s.result.outOf} words</h3></li>
      <#else>
        <li class="search-tier"><h3 class="hidden">Fully-matching results</h3></li>
      </#if>
      <#-- Print event tier bars if they exist -->
      <#if s.result.eventDate??>
        <h2 class="fb-title">Events on ${s.result.eventDate?date}</h2>
      </#if>
    <#else>
      <li data-fb-result="${s.result.indexUrl}" class="result<#if !s.result.documentVisibleToUser>-undisclosed</#if> clearfix">
    
        <h4 <#if !s.result.documentVisibleToUser>style="margin-bottom:4px"</#if>>
          <#if s.result.metaData["imageSmall"]??>
            <img class="img-thumbnail pull-left" style="margin-right:0.5em;" src="${s.result.metaData["imageSmall"]?replace("\\|.*$","","r")}" />
          </#if>
    
          <#if question.currentProfileConfig.get("ui.modern.session")?boolean><a href="#" data-ng-click="toggle()" data-cart-link data-css="pushpin|remove" title="{{label}}"><small class="glyphicon glyphicon-{{css}}"></small></a></#if>
            <a href="${s.result.clickTrackingUrl}" title="${s.result.liveUrl}">
              <@s.boldicize><@s.Truncate length=70>${s.result.title}</@s.Truncate></@s.boldicize>
            </a>
    
          <#if s.result.fileType!?matches("(doc|docx|ppt|pptx|rtf|xls|xlsx|xlsm|pdf)", "r")>
            <small class="text-muted">${s.result.fileType?upper_case} (${filesize(s.result.fileSize!0)})</small>
          </#if>
          <#if question.currentProfileConfig.get("ui.modern.session")?boolean && session?? && session.getClickHistory(s.result.indexUrl)??><small class="text-warning"><span class="glyphicon glyphicon-time"></span> <a title="Click history" href="#" class="text-warning" data-ng-click="toggleHistory()">Last visited ${prettyTime(session.getClickHistory(s.result.indexUrl).clickDate)}</a></small></#if>
        </h4>
    
        <p>
        <#if s.result.date??><small class="text-muted">${s.result.date?date?string("d MMM yyyy")}:</small></#if>
        <span class="search-summary"><@s.boldicize><#noautoesc>${s.result.metaData["c"]!"No description available."}</#noautoesc></@s.boldicize></span>
        </p>
    
        <p>
        	<span class="glyphicon glyphicon-time"></span> ${s.result.metaData["duration"]!"N/A"}
        	Views: ${s.result.metaData["viewCount"]!"N/A"}
        	<span class="glyphicon glyphicon-thumbs-up"></span> ${s.result.metaData["likeCount"]!"N/A"}
        	<span class="glyphicon glyphicon-thumbs-up"></span> ${s.result.metaData["dislikeCount"]!"N/A"}
        </p>
      </li>
    </#if>
  7. Run a search for !showall and observe the YouTube results:

    exercise index a youtube channel 03
Extended exercises: social media
  1. Find a random YouTube channel and determine the channel ID. Add this as a second channel ID and update the collection.

  2. Set up a new social media collection using one of the other templates (such as Facebook or Twitter). To do this you will need an appropriate API key for the repository and a channel to consume. This social media article provides some hints on generating the keys: Social media API keys

10. Push collections

We have previously looked at standard Funnelback collections (such as web collections) that follow a linear update cycle of gather, filter, index, swap to produce a searchable index.

Push collections are quite different, handling indexing only - updates to a push collection are made using an API.

This means that a separate process or set of processes is required to handle the gathering and filtering of the content.

Push collections also differ from standard collections in the way indexes are stored and managed - push collections don’t have live and offline versions. Push collections also update whenever changes are committed - this means you don’t necessarily have to wait for an update to complete before you can start searching.

When working with a push collection it is critical for the code that interacts with the push API to handle error conditions (for example catch any errors and queue the items for pushing at a later time).

There are a couple of approaches generally used when interacting with push collections:

  1. Creation of a collection in Funnelback that only handles gathering and filtering of content, interacting with the push collection’s API to add and remove items. This is how trimpush collections work.

  2. Interact directly with the push collection from an external source (eg. from a CMS or database directly calling the push collection API. This is currently the more common approach, but does place the responsibility on the gather and filter processes with the external system that interacts with Funnelback.

The Funnelback administration interface also includes an interface that allows for interaction with the API via web forms, allowing data to be interactively added or removed from the indexes.

Exercise 13: Create a push collection

This exercise uses the API-UI available in the administration interface to interactively run API calls. Interaction with a push collection is usually performed programatically by contacting the push-api REST endpoint and passing in the appropriate parameters.

  1. Log in to the administration interface and create a new collection.

  2. Enter the following information into the creation form then click create:

    • Project group ID: Training 202

    • Collection ID: push-collection

    • Collection type: push2

    exercise create a push collection 01
  3. Add the following configuration setting:

    • service_name: Training push collection

    exercise create a push collection 02
  4. View the available Push API calls by selecting API-UI from the system menu, then clicking the Push API button on the title bar.

    exercise create a push collection 03
    exercise create a push collection 04
  5. Add a document to the collection. Expand the push-api-content heading then select the PUT documents item.

    exercise create a push collection 05
  6. Enter the following into the parameters section of the PUT documents form:

    • collection: push-collection

    • key: http://mysite/url1

    • content: <html><head><title>Test document</title></head><body><h1>Test document 1</h1><p>This is some sample text.</p></body></html>

    • Parameter content type: text/html

    • Content-type: text/html

    The other fields can be left blank.

    exercise create a push collection 06
  7. Add the document by clicking the execute button. The screen will update with the submitted call and response

    exercise create a push collection 07
  8. Run a query against push-collection and verify that your document is returned in the search results.

    exercise create a push collection 08
  9. Changes to the push collection will be visible as soon as a commit completes. Push collections will auto-commit based on configured settings, but a commit can also be manually triggered by calling the POST commit API call. This is also available under the push-api-content heading.

11. Manipulating search result content

Funnelback offers a number of options for manipulating the content of search results.

There are three main places where search result content can be manipulated. Which one to choose will depend on how the modifications need to affect the results. The options are (in order of difficulty to implement):

  1. Modify the content as it is being displayed to the end user.

  2. Modify the content after it is returned from the indexes, but before it is displayed to the end user

  3. Modify the content before it is indexed.

11.1. Modify the content as it is displayed to the user

This is the easiest of the manipulation techniques to implement and involves transforming the content as it is displayed.

This is achieved in the presentation layer and is very easy to implement and test with the results being visible as soon as the changes are saved. Most of the time this means using the Freemarker template. If the raw data model is being accessed the code that interprets the data model will be responsible for implementing this class of manipulation.

A few examples of content modification at display time include:

  • Editing a value as its printed (e.g. trimming a site name off the end of a title)

  • Transforming the value (e.g. converting a title to uppercase, calculating a percentage from a number)

Freemarker provides libraries of built-in functions that facilitate easy manipulation of data model variables from within the Freemarker template.

Exercise 14: Use Freemarker to manipulate data as it is displayed

In this exercise we’ll clean up the search result hyperlinks to remove the site name.

  1. Run a search against the funnelback-website collection for funnelback and observe that each result link includes - Funnelback.

  2. Log in to the administration interface and switch to the funnelback-website collection.

  3. Edit the default template by selecting design results page from the customise tab.

  4. Locate the results block in the template, and specifically the variable printed inside the result link. In this case it is the ${s.result.title} variable.

    exercise freemarker to manipulate data as it is displayed 01
  5. Edit the ${s.result.title} variable to use the Freemarker replace function to remove the - Funnelback from each title. Update the variable to ${s.result.title?replace(" - Funnelback","")} then save and publish the template.

  6. Repeat the search against the funnelback-website collection for funnelback and observe that - Funnelback no longer appears at the end of each link. If the text is still appearing check your replace carefully and make sure you haven’t got any incorrect quotes or dashes as the replace matches characters exactly.

  7. The function calls can also be chained - for example adding ?upper_case to the end of ${s.result.title?replace(" - Funnelback","")?upper_case} will remove - Funnelback and then uppercase the remaining title text.

11.2. Modify the content after it is returned from the index, but before it is displayed

This technique involves manipulating the data model that is returned. User interface hook scripts can be used to modify values that exist within the data model. Content modification using this technique is made before the data is consumed by the presentation layer - so the raw XML or JSON response is what is manipulated.

This method requires an understanding of Funnelback’s query processing pipeline, which follows the lifecycle of a query from when it’s submitted by a user to when the results are returned to the user.

We will take a closer look at this technique in the next section.

11.3. Modify the content before it is indexed

This involves creating a filter that receives the raw downloaded data as an input, performs arbitrary processing of the data before returning this data to Funnelback.

This is the most complicated of the transformation techniques and involves writing a filter in the Groovy language. The scope of what can be implemented using these techniques is wide and varied. Whatever ends up in the output data is what Funnelback indexes.

Example uses:

  • Format conversion (eg. extract the text from a PDF document)

  • Record geocoding (eg. perform a geospatial lookup based on postcode or address and generate a geo-coordinate)

  • Metadata injection (eg. analyse the reading grade of the document and inject this value as metadata)

  • Data scraping (eg. scrape the breadcrumb trails and write this out as section metadata)

  • Data cleansing (eg. cleanse or replace titles)

  • Report generation (eg. analyse the document for WCAG compliance and write this to an external database)

This sort of technique is required if you wish the modification to apply to searches on the index (rather than just display). For example if you wish to enable geospatial search and require geocoding of results this must be done before the index is constructed so that Funnelback is able to run a geospatial search. The same applies to modification to the start of titles as this affects sorting.

Authoring of filters currently requires some back-end access and is covered in the FUNL203 Funnelback training course. 

12. Query processing pipeline

When a query is submitted to Funnelback it passes through a series of steps between the initial submission, building of the data model question to the actual running of the query, processing of the result packet before the final result is delivered to the end user.

query processing pipeline 01

Each step of this lifecycle has a corresponding hook allowing manipulation of the data model using the corresponding user interface hook script.

12.1. Query processing phases

  • Input processor: The user’s query terms and service/collection configuration is used to construct a question object within the data model. This question object outlines all of the attributes that define the question that will be submitted to Funnelback’s query processor. This includes configuring faceted navigation.

  • Extra searches setup: the question object is cloned for any extra searches that are configured to run as part of the overall search process.

  • Results fetching: The question object is used to construct a query that is submitted to Funnelback’s query processor (padre). This is the actual process of looking up the index based on the attributes in the question object. This includes passing on the query constraints as well as configuration options that affect the display and ranking of the results.

  • Output processing: The raw XML response from the query processor is converted into a response object that is added to the data model. The output processing also builds many user interface objects such as the data structure behind faceted navigation.

  • Results rendering: If templating is applied to the results (i.e. The search is going via the html endpoint (search.html)) then a final phase transforms the final data model and produces formatted search results output based on the configured Freemarker templates.

12.2. Hook scripts

The following five hook points are provided as part of the search lifecycle. A corresponding hook script can be run at each of these points, allowing manipulation of the data model at the current state in the lifecycle.

Understanding the state of the data model at every step in the process is key to writing an effective hook script.

Hook scripts need to be defined on the collection that is being queried.

The most commonly used hook scripts are the pre and post process hooks.

  • Pre-process: This runs after initial question object population, but before any of the input processing occurs. Manipulation of the query and addition or modification of most question attributes can be made at this point.

    Example uses: modify the user’s query terms; convert a postcode to a geo-coordinate and add geospatial constraints

  • Pre-datafetch: This runs after all of the input processing is complete, but just before the query is submitted. This hook can be used to manipulate any additional data model elements that are populated by the input processing. This is most commonly used for modifying faceted navigation.

    Example uses: Update metadata, gscope or facet constraints.

  • Post-datafetch: This runs immediately after the response object is populated based on the raw XML return, but before other response elements are built. This is most commonly used to modify underlying data before the faceted navigation is built.

    Example uses: Rename or sort faceted navigation categories, modify live URLs

  • Post-process: This is used to modify the final data model prior to rendering of the search results.

    Example uses: clean titles; load additional custom data into the data model for display purposes.

  • Extra searches: This runs after the extra search question is populated but before any extra search runs allowing modification of the extra search’s question. This hook script is deprecated and should no longer be used.

12.3. Groovy interaction with the data model

Hook scripts are written in the Groovy programming language and are used to interact with and manipulate the Funnelback data model.

Recall that the data model is the data structure that sits behind the search results (and what is rendered when viewing the XML/JSON version of the search results.

The data model is returned inside a transaction object within Groovy that is available from within each hook script.

The transaction object is the base level of the data model - within this object exist the question and response objects as well as some additional items. Other sub-elements can be inferred by viewing the data model as XML/JSON and translating the structure and element type into a variable that can be used in Groovy.

E.g.

transaction.question.inputParameterMap["collection"] = "funnelback-search"
transaction.question.form = "simple"
groovy interaction with the data model 01

The type of the element in the data model is important as this will affect how it is accessed:

  • {} indicates a hash/map

  • [] indicates and array/list

  • "" indicates a string value

  • true/false indicates a Boolean value

  • a number without quotes indicates and integer value.

This is similar to what was previously outlined for Freemarker templating in the FUNL201 course.

Exercise 15: Clean result titles

In this exercise we will create a hook_post_process.groovy file to clean repeated text from the end of result titles.

Web pages frequently include information such as the organisation name as part of the page titles. This can have the effect of making the internal search results more difficult to read as the titles contain a lot of irrelevant content.

We can use a hook script to change the title after the search results are returned from the Funnelback query processor, but before the results are templated.

The post process hook is the most appropriate place to make this change as we want to change just how things are displayed in the search results.

  1. Open the administration interface and switch to the funnelback-search collection (the meta collection).

  2. Run a search for search and observe the result titles returned. Titles include - Funnelback Documentation - Version 15.x or - Funnelback at the end of each title.

    exercise clean result titles 01
  3. Switch to the XML or JSON view and inspect the title element for each result item. Observe the title endings.

    exercise clean result titles 02
  4. Open the administration interface and switch to the funnelback-search collection (the meta collection).

  5. Open up the file manager (administer tab, browse collection configuration files). Create a new hook_post_process.groovy file. Because we are editing a hook script we need to create this by clicking the edit configuration files button then adding a new hook_post_process.groovy.

    exercise clean result titles 03
    exercise clean result titles 04
  6. Add the following groovy code to the file then save it.

    // Remove site's names from result titles
    if ( transaction.response != null
      && transaction.response.resultPacket != null) {
        transaction.response.resultPacket.results.each() {
          // In Groovy, "it" represents the item being iterated
          it.title = (it.title =~ / - Funnelback$| - Funnelback Documentation.*$/).replaceAll("")
       }
    }
  7. Repeat the search for search and observe the result titles returned now have the endings stripped.

    exercise clean result titles 05
  8. Change to the XML or JSON view and also observe the result titles have also been modified.

    exercise clean result titles 06
    This highlights one of the key advantages of using a hook script over changing the template - changes are applied to the underlying data so anything that integrates with the search will also be affected by the changes.

Further examples of hook scripts can be found in the Funnelback online documentation:

Extended exercise: title cleaning
  1. Observe that there are still some titles that could be cleaned such as Index (Modern UI for Funnelback - Core 15.24.0-SNAPSHOT API). Modify the hook script to add an additional rule to clean these titles.

13. Removing items from the index

It is not uncommon to find items being returned in the search results that are not useful search results.

Removing these improves the findability of other items in the index and provides a better overall search experience for end users.

There are a few different techniques that can be used to remove unwanted items from the index.

13.1. Prevent access to the item

Prevent the item from being gathered by Funnelback by preventing Funnelback from accessing it. This is something that needs to be controlled in the data source and the available methods are dependent on the data source. The advantage of this technique is that it can apply beyond Funnelback.

For example:

  • For web collections utilise robots.txt and robots meta tags to disallow access for Funnelback or other crawlers.

  • Change document permissions so that Funnelback is not allowed to access the document (e.g. for a filecopy collection only grant read permissions for Funnelback’s crawl user to those documents that you wish to be included in the search).

13.2. Exclude the item

The simplest method of excluding an item from within Funnelback is to adjust the gatherer so that the item is not gathered.

The exact method of doing this varies from gatherer to gatherer. For example:

  • Web collections: use the exclude patterns and crawler.reject_files setting to prevent unwanted URL patterns and file extensions from being gathered.

  • Database collections: adjust the SQL query to ensure the unwanted items are not returned by the query.

  • Filecopy collections: use the exclude patterns to prevent unwanted files from being gathered.

The use of exclude patterns needs to be carefully considered to assess if it will prevent Funnelback from crawling content that should be in the index. For example: excluding a home page in a web crawl will prevent Funnelback from crawling any pages linked to by the home page (unless they are linked from somewhere else) as Funnelback needs to crawl a page to extract links to sub-pages.

13.3. Killing urls / url patterns

It is also possible to remove items from the search index after the index is created.

This can be used to solve the home page problem mentioned above - including the home page in the crawl (so that sub-pages can be crawled and indexed) but removing the actual home page afterwards.

Examples of items that are commonly removed:

  • Home pages

  • Site maps

  • A-Z listings

  • 'Index' listing pages

Removing items from the index is as simple as listing the URLs in a configuration file. After the index is built a process runs that will remove any items that are listed in the kill configuration.

For normal collection types, there are two configuration files that control URL removal:

  • kill_exact.cfg: URLs exactly matching those listed in this file will be removed from the index. The match is based on the indexUrl (as seen in the data model).

  • kill_partial.cfg: URLs with the start of the URL matching those listed in the file will be removed from the index. Again the match is based on the indexUrl (as seen in the data model).

For push collections URLs can be removed using the Push API.

For meta collections URLs are removed by removing the URLs from the sub collection.

Exercise 16: Remove URLs from an index

In this exercise the Funnelback home page will be removed from the search index.

  1. Run a search for home against the funnelback-website collection. Observe the home result (URL: http://training-search.clients.funnelback.com/training/training-data/funnelback-website/www.funnelback.com/) is returned in the search results.

  2. Open the administration interface and switch to the funnelback-website collection.

  3. Open the file manager (administer tab, browse collection configuration files) then create a new kill_exact.cfg using the edit configuration files button under the funnelback-website / config heading. This is the same process as for creating the hook script in the previous exercise.

  4. The editor window will load for creating a new kill_exact.cfg file. The format of this file is one URL per line. If you don’t include a protocol then http is assumed. Add the following to kill_exact.cfg then save the file:

    http://training-search.clients.funnelback.com/training/training-data/funnelback-website/www.funnelback.com/
  5. As we are making a change that affects the makeup of the index we will need to rebuild the index. Rebuild the index by running an advanced update to re-index the live view.

  6. Repeat the search for home and observe that the home result no longer appears in the results.

  7. Run a search for blog - observe that a number of blog results are returned. We will now define a kill pattern to remove all of the blog posts. Observe that all the items we want to kill start with a common URL base.

  8. Return to the administration interface and create a new kill_partial.cfg.

  9. Add the following to the file and then save it:

    http://training-search.clients.funnelback.com/training/training-data/funnelback-website/www.funnelback.com/blog
  10. Rebuild the index then repeat the search for blog. You should see no results returned, as all the previous search results contained the kill string.

  11. Run a search for blog against the funnelback-search collection (the meta collection) and observe that the blog items have also disappeared from this search. 

13.4. Extended exercise: removing items from the index

Configuration files that are managed via the edit configuration files button can also be edited via WebDAV. This currently includes spelling, kill, server-alias, cookies, hook scripts, meta-names, custom_gather, workflow and reporting-blacklist configuration files.

  1. Delete the two kill config files you created in the last exercise then rebuild the index. Observe that the blog pages and items you killed are returned to the results listings.

  2. In your favourite text editor create a kill_exact.cfg and kill_partial.cfg containing the URLs you used in the previous exercise. Save these somewhere easy to access (such as your documents folder or on the desktop).

  3. Using your WebDAV editor (reminder: you used CyberDuck to edit template files in FUNL201) connect to the Funnelback server

  4. change to the funnelback-website collection and then browse to the conf folder

    exercise clean result titles 05
  5. Upload the two kill configuration files you’ve saved locally by dragging them into the conf folder in CyberDuck.

  6. Once the files are uploaded return to the administration interface and return to the configuration file listing screen that is displayed when you click the edit configuration files button from the file manager. Observe that the files you just uploaded are displayed.

  7. Rebuild the index and the URLs should once again be killed from the index.

  8. Return to CyberDuck and this time browse to the conf folder for the funnelback-search collection. Observe that you can edit the hook script you created in the previous exercise.

13.5. Review questions: removing items from the index

  1. What are the advantages of using robots.txt or robots meta tags to control access to website content?

  2. Why can’t you kill a home page (like http://mysite.com/) from the index by adding it to the kill_partial.cfg?

  3. When you need to kill an item from a meta collection where do you add the kill configuration?

14. Alternate output formats

Funnelback includes XML and JSON endpoints that return the raw Funnelback data model in these formats.

These can be useful when interacting with Funnelback, however the data model is complex and contains a lot of data (meaning the response packets can be quite large). Hook scripts can be used to modify the structure of the data model that is returned performing operations such as removing elements - but this should be done with care as the changes may break other functionality (especially if the html endpoint is used for the collection).

Funnelback also provides an endpoint that is designed to stream all the results back to the user in a single call. The all results endpoint can be used to return all matching Funnelback results in JSON or CSV format. This endpoint only returns the results section of the data model and only minimal modifications can be made to the return format. Hook scripts do not apply to the all results endpoint.

Funnelback’s HTML endpoint utilises Freemarker to template the results. This is traditionally used to format the search results as HTML. However this templating is extremely flexible and can be adapted to return any text-based format - such as CSV, RSS or even custom XML/JSON formats.

This is useful if you need to export search results, or use Funnelback’s index as a data source to be consumed by another service.

14.1. Returning search results (only) as CSV or custom JSON

The all results endpoint is ideal for providing a CSV export or custom JSON of search results.

A number of parameters can be supplied to the endpoint that control the fields returned and the field labels used.

The all results endpoint returns the result level data for all the matching results to a query. This is ideal for search results export. If data model elements from outside the results element are required (e.g. faceted navigation counts, search result counts, query) then a Freemarker template will be required to create the custom output.

Exercise 17: Return search results as CSV

In this exercise the all results endpoint will be configured to return the search results as CSV.

  1. Decide what fields you want to return as the columns of your CSV. Any of the fields with the search result part of the data model can be returned. For the airports collection we deside to return only the following fields, which are all sourced from the indexed metadata: "Name","City","Country","IATA/FAA","ICAO","Altitude"

  2. A URL will need to be constructed that defines the fields that are required. Several parameters will be required to set the fields and fieldnames to use:

    • collection: Required. Set this to the collection id.

    • fields: Required. Defines the fields that will be returned by the all results endpoint as a comma separated list of fields. Nested items in the results part of the data model require an xpath style value. e.g. metaData/author will return the author field from the metaData sub-element.

    • fieldnames: Optional list of field names to use. These will be presented as the column values (CSV) or element keys (JSON). If omitted then the raw values from fields will be used (e.g. metaData/author).

    • SF: Required if any metadata elements are being returned. Ensure this is configured to return all the metadata fields that are required for the output.

    • SM: Required if any metadata elements or the automatically generated results summary are being returned.

  3. Decide what fields need to be returned. For the aiports collection we will return six six columns: Name, City, Country, IATA/FAA, ICAO and Altitude. Open the administration interface, change to the airports collection and view the metadata mappings. Make a note of the metadata class names for these fields.

  4. Define the query string parameters. To return the six fields will require the following parameters:

    • collection=airports

    • fields=metaData/name,metaData/city,metaData/country,metaData/iataFaa,metaData/icao,metaData/altitude

    • fieldnames=Name,City,Country,IATA/FAA,ICAO,Altitude

    • SM=meta

    • SF=[name,city,country,iataFaa,icao,altitude]

  5. Construct a URL from this information and view the response from the all results JSON endpoint. This illustrates the custom JSON that can be returned using the all results endpoint:

  6. Access the corresponding all results CSV endpoint. When you click the link a CSV file should be downloaded to your computer. Open this file after you’ve downloaded it.

  7. The file will have been saved with a file name similar to all-results.csv. The file name can be defined by adding an additional parameter, fileName. Modify the the previous URL and add &fileName=airports.csv to the URL then press enter. The file should download and this time be saved as airports.csv.

  8. Open the administration interface and change to the airports collection.

  9. Configure the default profile to be a frontend service (to enable template editing) by clicking the create service button.

  10. Edit the default template for the airports collection (select the simple.ftl from the edit result templates listing on the customise tab). Add a button above the search results that allows the results to be downloaded as CSV. Add the following code immediately before the <ol> tag with an ID of search-results (approx. line 505):

    <p><a class="btn btn-success" href="/s/all-results.csv?collection=${question.collection.id}&query=${question.query}&profile=${question.profile}&fields=metaData/name,metaData/city,metaData/country,metaData/iataFaa,metaData/icao,metaData/altitude&fieldnames=Name,City,Country,IATA/FAA,ICAO,Altitude&SF=[name,city,country,iataFaa,icao,altitude]&SM=meta&fileName=airports.csv">Download results as CSV</a></p>
    exercise return results as csv 03
  11. Run a search for denmark and observe that a download results as CSV button appears above the 28 search results. Click the button and the csv file will download. Open the CSV file and confirm that all the results matching the query are included in the CSV data (matching the 28 results specified by the results summary).

    exercise return results as csv 04

14.2. Using Freemarker templates to return other text-based formats

Funnelback’s HTML endpoint can be used to define a template that returns the search results in any arbitrary text-based format.

This is commonly used for returning custom JSON or XML but can also be used to return other known formats such as tab delimited data or GeoJSON.

Exercise 18: Return search results as GeoJSON

In this exercise a Freemarker template will be created that formats the set of search results as a GeoJSON feed.

  1. Open the administration interface and change to the airports collection.

  2. Create a new custom template. (Reminder: customise tab, edit result templates, add new.

    exercise return results as geojson 01
  3. Enter geojson.ftl as the name of the file once the new file editor loads.

    exercise return results as geojson 02
  4. Create template code to format the results as GeoJSON. Copy the code below into your template editor and hit save.

    <#ftl encoding="utf-8" output_format="JSON"/>
    <#import "/web/templates/modernui/funnelback_classic.ftl" as s/>
    <#import "/web/templates/modernui/funnelback.ftl" as fb/>
    <#compress>
    <#-- geojson.ftl
    Outputs Funnelback response in GeoJSON format.
    -->
    <#-- Read the metadata field to source the latLong value from from the map.geospatialClass collection.cfg option -->
    <#--<#assign latLong=question.currentProfileConfig.get("map.geospatialClass")/>-->
    <#-- Hard code using the latLong metadata field for this example -->
    <#assign latLong="latlong"/>
    <@s.AfterSearchOnly>
    <#-- NO RESULTS -->
    <#if question.inputParameterMap["callback"]?exists>${question.inputParameterMap["callback"]}(</#if>
    {
    <#if response.resultPacket.resultsSummary.totalMatching != 0>
    <#-- RESULTS -->
            "type": "FeatureCollection",
            "features": [
        <@s.Results>
            <#if s.result.class.simpleName != "TierBar">
              <#if s.result.metaData[latLong]?? && s.result.metaData[latLong]?matches("-?\\d+\\.\\d+;-?\\d+\\.\\d+")> <#-- has geo-coord and it's formatted correctly - update to the meta class containing the geospatial coordinate -->
                <#-- EACH RESULT -->
                {
                    "type": "Feature",
                        "geometry": {
                            "type": "Point",
                            "coordinates": [${s.result.metaData[latLong]?replace(".*\\;","","r")},${s.result.metaData[latLong]?replace("\\;.*","","r")}]
                        },
                        "properties": { <#-- Fill out with all the custom metadata you wish to expose (e.g. for use in the map display -->
                            "rank": "${s.result.rank?string}",
                            "title": "${s.result.title!"No title"}",
                			<#if s.result.date?exists>"date": "${s.result.date?string["dd MMM YYYY"]}",</#if>
                            "summary": "${s.result.summary!}",
                            "fileSize": "${s.result.fileSize!}",
                            "fileType": "${s.result.fileType!}",
                            "exploreLink": "${s.result.exploreLink!}",
                            <#if s.result.kmFromOrigin?? && question.inputParameterMap["origin"]??>"kmFromOrigin": "${s.result.kmFromOrigin?string("0.###")}",</#if>
                            <#-- MORE METADATA FIELDS... -->
    						"metaData": {
    						<#list s.result.metaData?keys as md>
    						      "${md}": "${s.result.metaData[md]}"<#if md_has_next>,</#if>
    						</#list>
                            },
                            "displayUrl": "${s.result.liveUrl!}",
                            "cacheUrl": "${s.result.cacheUrl!}",
                            "clickTrackingUrl": "${s.result.clickTrackingUrl!}"
                        }
                    }<#if s.result.rank &lt; response.resultPacket.resultsSummary.currEnd>,</#if>
                  </#if> <#-- has geo-coord -->
                </#if>
        </@s.Results>
        ]
    </#if>
    }<#if question.inputParameterMap["callback"]?exists>)</#if>
    </@s.AfterSearchOnly>
    </#compress>

    As you can see from the code above the template is bare bones and only handles the case of formatting search results after the search has run plus a minimal amount of wrapping code required by the GeoJSON format. Each result is templated as a feature element in the JSON data..

    Also observe that the template defines a JSON output format (<#ftl encoding="utf-8" output_format="JSON"/>) to ensure that only valid json is produced.

  5. Run a search using the geojson template - click on the search icon (eye icon third from the right) that appears in the available actions within the file listing.

    exercise return results as geojson 03
  6. A blank screen will display - Funnelback has loaded with the template by no query is supplied. Specify a query by adding &query=!showall to the end of the URL. A fairly empty JSON response is returned to the screen. The response may look like unformatted text, depending on the JSON plugin you have installed. The response is quite empty because the template requires the latlong metadata field to be returned in the response and the collection isn’t currently configured to return this.

    exercise return results as geojson 04
  7. Return to the administration interface and add latlong to the list of summary fields set in the collection.cfg. (Update the -SF parameter to include latlong). Refresh the results and observe that you are now seeing data points returned

    exercise return results as geojson 05
  8. The template is now configured to return the text in the GeoJSON format. To ensure that your browser correctly detects the JSON we should also configure the GeoJSON template to return the correct MIME type in the HTTP headers. The correct MIME type to return for JSON is application/json - Funnelback templates are returned as text/html by default (which is why the browser renders the text). Return to the administration interface and edit the profile configuration (administer tab, edit profile configuration).

    If you are returning JSON as as JSONP response you should use a text/javascript mime type as the JSON is wrapped in a Javascript callback. .
    exercise return results as geojson 06
  9. Add the following setting to the profile configuration then save and then publish. This option configures Funnelback to return the text/javascript content type header when using the geojson.ftl template. We are doing this because the template is configured to support a callback function to enable JSONP.

    • Parameter key: ui.modern.form.*.content_type

    • Form name: geojson

    • Value: text/javascript

    exercise return results as geojson 07
    the content type option is set per-template name at the profile level - the above option would apply to a template called geojson.ftl but only when the matching profile is set as part of the query. ui.modern.form.TEMPLATE.content_type would set the content type for a template named TEMPLATE.ftl.
  10. Rerun the search using the geojson template. This time the browser correctly detect and format the response as JSON in the browser window.

  11. Observe that the GeoJSON response only contains 10 results - this is because it is templating what is returned by Funnelback, and that defaults to the first 10 results. In order to return more results an additional parameter needs to be supplied that tells Funnelback how many results to return. E.g. try adding &num_ranks=30 to the URL and observe that 30 results are now returned. If you wish to return all the results when accessing the template you will need to set the num_ranks value either to a number equivalent to the number of documents in your index, or link it from another template where you can read the number of results.

If you are creating a template for a result format that you wish to be automatically downloaded (instead of displayed in the browser) you can set an additional HTTP response header for the template to ensure that the browser downloads the file. This header can also be used to define the filename.

If a file name isn’t set the download will save the file using the browser’s default naming.

The name of the file can be set by setting the content-disposition HTTP header. To do this edit the profile.cfg file (located in the profile’s preview folder) and add the following line then save the file:

ui.modern.form.<TEMPLATE-NAME>.headers.1=Content-Disposition: attachment; filename=<FILE-NAME>

This will tell Funnelback to send a custom HTTP header (Content-Disposition) with the <TEMPLATE-NAME>.ftl template, instructing the browser to save the file as <FILE-NAME>.

Extended exercise: Plot search results on a map

Use the code for the Funnelback mapping plugin to add a search-powered map to your search results page that uses a GeoJSON feed to supply the data points.

15. Extra searches

Funnelback has the ability to run a series of extra searches in parallel with the main search, with the extra results data added to the data model for use in the search results display.

The extra searches can be run against any collection that exists on the Funnelback server, and it is possible via the extra searches hook script to modify the query that is submitted to each extra search that is run.

An extra search could be used on an all of university search to pull in additional results for staff members and videos and display these alongside the results from the main search. The formatting of the extra results can be controlled per extra search, so for the university example staff results might be presented in a box to the right of the results with any matching video results returned in a JavaScript carousel.

15.1. Extra searches vs meta collections

The functionality provided by extra searches can sometimes be confused with that provided by meta collections. While both will bring in results from different collections the main difference is that an extra search is a completely separated set of search results from the main result set.

  • A meta collection searches across multiple repositories or sub collections and merges all the results into a single set of results.

  • An extra search on a collection runs a secondary search providing an additional set of search results and corresponding data model object. This data model object contains a separate question and response object, just as for the main search.

A similar outcome could be achieved by running a secondary search via Ajax from the search results page to populate a secondary set of result such as a video carousel. Note however that the use of extra search within Funnelback is more efficient than making an independent request via a method such as Ajax.

It is good practice to limit the number of extra searches run for any query otherwise performance of the search will be adversely affected. For good performance limit the number of extra searches to a maximum of 2 or 3.

In this exercise an extra search will be configured for the Funnelback search meta collection. The extra search will return any matching videos from the funnelback-youtube collection and present these alongside the main results.

  1. Log in to the administration interface and switch to the funnelback-search collection

  2. Open the file manager (select browse collection configuration files from the administer tab) then create an extra_search.*.cfg file within the funnelback-search / config section.

    exercise configure an extra search 01
  3. Set the source collection for the extra search and the query processor options that should be applied to the extra search when it runs. The query processor options are in addition to the ones supplied on the main search (and any settings in this configuration will override existing settings for the main search). If you wish to use metadata from the extra search then you need to specify the SF values here. Add the following then save the file as extra_search.youtube.cfg:

    collection=funnelback-youtube
    query_processor_options= -num_ranks=3 -SF=[imageSmall]
    exercise configure an extra search 02
  4. Edit the collection configuration (administer tab, edit collection configuration) adding the following:

    • Parameter key: ui.modern.extra_searches

    • Value: youtube

    exercise configure an extra search 03

    This tells Funnelback to run the extra search as configured in extra_search.youtube.cfg

  5. Run a search on the funnelback-search collection for funnelback and observe the JSON or XML response now includes data beneath the extraSearches node. Within this is an element corresponding to each extra search that is run, and beneath that is a question and response element which mirrors that of the main search.

    exercise configure an extra search 04
  6. Make use of the extra search results within your template by adding an <@ExtraResults> block. The code nested within the ExtraResults block is identical to what you would use in your standard results, but the code is drawn from the extra search item within the data model. E.g. Print the title for each extra search result. Edit the default template (select edit result templates from the customise tab) and insert the following immediately after the </@s.ContextualNavigation> tag (approx. line 655) then save the template.

    <@fb.ExtraResults name="youtube">
    <div class="well">
      <h3>Related videos</h3>
      <div class="row">
      <@s.Results>
        <#if s.result.class.simpleName != "TierBar">
          <a href="${s.result.liveUrl}">
        <#if s.result.metaData["imageSmall"]??>
              <img class="img-thumbnail pull-left" style="margin-right:0.5em;" src="${s.result.metaData["imageSmall"]?replace("\\|.*$","","r")}" alt="${s.result.title}" title="${s.result.title}"/>
            <#else>
              <strong>${s.result.title}</strong>
            </#if>
          </a>
        </#if>
      </@s.Results>
      </div>
     </div>
    </@fb.ExtraResults>
    exercise configure an extra search 05
  7. Rerun the search and this time view the HTML results, observing the related videos appearing below the related searches panel.

    exercise configure an extra search 06
  8. Run a search for dashboard and confirm that the related videos are related to the search for dashboard. (Hint: hovering over the thumbnail displays the video’s title.)

    exercise configure an extra search 07

16. Content auditor

The content auditor provides a series of reports on various aspects of a collection’s content.

The content auditor is primarily designed for web and file content but could be adapted to other content sources.

Interpretation of content auditor reports is covered in detail in FUNL101.

The content auditor reports can be customised in a number of ways. Customisations include:

  • Reporting on custom metadata

  • Specifying undesirable text

  • Defining acceptable reading grades

Most customisations are applied as soon as the configuration change is saved, however some require an update of the collection for the changes to take effect.

Exercise 20: Content auditor reports
  1. Log in to the marketing dashboard, switch to the foodista service then click on the content auditor tile, or select content auditor from the sidebar menu.

  2. The content auditor - recommendations screen will then load.

    exercise content auditor reports 01
  3. Observe that the report is organised into several sections - recommendations (currently showing), overview (which provides information on the top values for the metadata fields covered by the report), attributes (which breaks down the different metadata fields) and search results (which provides page level information for items that match the current report’s search criteria).

  4. Spend a few minutes exploring the different sections of the content auditor report.

16.1. Customise the undesirable text

Funnelback uses Wikipedia’s common misspellings list to identify undesirable words. This list can be replaced or augmented with custom lists of terms.

Customisation of undesirable text requires a full update of the collection.

Exercise 21: Configuring undesirable text
  1. From the administration interface switch to the foodista collection. Open the file manager by selecting browse collection configuration files from the administer tab.

    exercise configuring undesirable text 01
  2. Create a file called undesirable-text.additional.cfg. Select undesirable-text.*.cfg from the create menu that appears at the bottom of the foodista / config section.

    exercise configuring undesirable text 02
  3. Set the filename to undesirable-text.additional.cfg by editing the text field above the main content editor, then edit the file and add the following then save:

    prawn
    prawns
    coriander
    abt
    exercise configuring undesirable text 03

    This will configure content auditor to identify pages that contain these words. In this example we might want to identify prawn(s) and coriander as non-preferred or banned words and abt as a banned abbreviation. This would allow for these to be updated to say shrimp, cilantro and about if these were the preferred terms in a site’s content guidelines.

  4. Edit the collection configuration and add the following setting then save.

    • Parameter key: filter.jsoup.undesirable_text-source.*

    • Key: additional

    • Value: $SEARCH_HOME/conf/$COLLECTION_NAME/undesirable-text.additional.cfg

    exercise configuring undesirable text 04
  5. Run a full update of the collection. Note: a full update is required - an incremental update is not sufficient because filter changes won’t be applied to content that is not downloaded.

  6. After the update completes return to the content auditor report for the foodista collection and observe that occurrences of the words added to the custom undesirable text file are now included in the words listed as undesirable text. Clicking on one of the terms will filter the report to only pages containing the selected word.

    exercise configuring undesirable text 05

16.2. Customise the reading grade chart

The range of acceptable grade levels can be configured with collection configuration settings which control the lower and upper bounds of the reading grades that are rendered in green. ui.modern.content-auditor.reading-grade.lower-ok-limit and ui.modern.content-auditor.reading-grade.upper-ok-limit can be set to appropriate grade levels.

Customisation of the reading grade chart does not require an update of the collection.

Exercise 22: Set the reading grade limits
  1. Log in to the administration interface and change to the foodista collection.

  2. View the content auditor report and observe the acceptable range for reading grade lies between 4- (grade 4 and below) and 9.

    exercise set the document age threshold 01
  3. Change the acceptable reading grade levels to grades 5-7. Edit the collection configuration and add the following settings:

    ui.modern.content-auditor.reading-grade.lower-ok-limit=5
    ui.modern.content-auditor.reading-grade.upper-ok-limit=7
  4. Return to the content auditor report and observe that the reading graph has updated to display the new acceptable range in green. Observe that the changes take place as soon as the configuration changes are saved - no update of the foodista collection is required.

    exercise set the document age threshold 02

16.3. Customise overview and attributes metadata

Custom metadata fields can be added to the overview and attributes screens.

This is done by adding ui.modern.content-auditor.facet-metadata.(metadata_name)=(facet_name) settings to the profile configuration for each field that should be displayed.

Customisation of the overview and attributes reports does not require an update of the collection, unless the metadata fields are not currently mapped.

Exercise 23: Add custom metadata to the overview and attributes screens
  1. Log in to the administration interface and switch to the foodista collection.

  2. View the content auditor report and observe the overview and attributes screens.

    exercise add custom metadata to the overview and attributes screens 01
    exercise add custom metadata to the overview and attributes screens 02
  3. Edit the profile configuration and add the following setting then save and publish. This will add an Authors field to the overview and attributes screens.

    • Parameter key: ui.modern.content-auditor.facet-metadata.*

    • Metadata name: authors

    • Value: Authors

    exercise add custom metadata to the overview and attributes screens 03
  4. Return to the content auditor search results and observe that the changes are now reflected in the report.

    exercise add custom metadata to the overview and attributes screens 04
    exercise add custom metadata to the overview and attributes screens 05

16.4. Customise the search results display

The search results screen can be configured to display the values of arbitrary metadata fields for each result.

Customisation of the search results report does not require an update of the collection, unless the metadata fields are not currently mapped.

Exercise 24: Add custom metadata columns to the search results
  1. Log in to the administration interface and switch to the foodista collection.

  2. View the content auditor report and select the search results screen

    exercise add custom metadata columns to the search results 01
  3. Edit the profile configuration and add the following settings. This will remove the format and subjects columns and add columns that display the author and tags metadata field content. Observe that there is an existing entry for display metadata to show the tags field.

    ui.modern.content-auditor.display-metadata.f=
    ui.modern.content-auditor.display-metadata.keyword=
    ui.modern.content-auditor.display-metadata.authors=Author
  4. Return to the content auditor search results and observe that the changes are now reflected in the report.

    exercise add custom metadata columns to the search results 02

16.5. Customise the marketing dashboard summary tile

The attribute displayed on the content auditor summary tile can be customised by setting a collection configuration parameter.

Customisation of the tile source does not require an update of the collection, unless the metadata fields are not currently mapped.

Exercise 25: Change the content auditor summary tile
  1. Log in to the administration interface and switch to the foodista collection.

  2. Open the marketing dashboard and observe the content auditor summary tile:

    exercise change the content auditor summary tile 01
  3. Edit the profile configuration and observe that there is a configuration option setting the summary tile to the date modified:

    ui.modern.content-auditor.preferred-facets=Date Modified
    Note: this option has already been customised for the foodista collection - when setting the tile default it is normal for this setting to be missing from the collection configuration.
  4. Edit the configuration setting to display Tags if available, with second preference for Date Modified. Save and publish the setting.

    ui.modern.content-auditor.preferred-facets=Tags,Date Modified
    exercise change the content auditor summary tile 02
  5. Reload the marketing dashboard and observe that the content auditor is now displaying tags on the summary tile.

    exercise change the content auditor summary tile 03

17. Accessibility auditor

The Funnelback accessibility auditor examines web content for accessibility issues. These issues are cross-referenced against the Web Content Accessibility Guidelines (WCAG 2.0) which is the recognised standard for assessing web content for accessibility.

Interpretation and management of accessibility auditor reports is covered in detail in FUNL101.

Accessibility auditor reports are only available for web collections and will be generated for every service that exists on a collection.

Note: The Funnelback accessibility auditor tool is a great tool for checking your site for accessibility compliance, but should not be the only method used to check content. The auditor tool only checks the accessibility of machine-readable HTML and PDF content. A large number of checks required for full WCAG compliance require manual checking.
Exercise 26: Enable the accessibility auditor
  1. Open the administration interface and switch to the inventors collection.

  2. Select the administer tab then click edit collection configuration

    exercise enable the accessibility auditor 01
  3. Click on the accessibility auditor item in the left hand navigation. Enable the accessibility auditor by selecting Accessibility auditor check: true

    exercise enable the accessibility auditor 02
  4. Run a full update of the inventors collection - select update, start advanced update, full update from the administration interface. A full update is required because the all of the documents need to be re-filtered to produce the accessibility auditor reports.

  5. When the update finishes switch to the marketing dashboard then select the accessibility auditor tile, or select accessibility auditor from the left hand menu. The accessibility auditor report should have been generated by the update.

    exercise enable the accessibility auditor 03