This course is only available to users of the Vagrant training VM as it requires you to log in via SSH.

Introduction

This course is aimed at frontend and backend developers and takes you through collection creation and advanced configuration of Funnelback implementations.

  • A summary: A brief overview of what you will accomplish and learn throughout the exercise.

  • Exercise requirements: A list of requirements, such as files, that are needed to complete the exercise.

  • Detailed step-by-step instructions: detailed step-by-step instructions to guide you through completing the exercise.

  • Some extended exercises are also provided. These exercises can be attempted if the standard exercises are completed early, or as some review exercises that can be attempted in your own time.

Special tool tips or hints will appear throughout the exercises as well, which will provide extra knowledge or tips for completing the exercise. They will look like:

This box is used to provide links to further reading available on the current topic.
This box is used to provide tips or hints related to the current topic.
This box is used to provide important advice relating to the current topic. This includes advice to be aware of to avoid common errors.
This box contains advice specific to the Linux version of Funnelback.
This box contains advice specific to the Windows version of Funnelback.

What this workshop will cover:

  • Funnelback filesystem

  • Workflow

  • Administration users

  • Padre binaries

  • Filtering

  • APIs

  • Ranking

  • Custom collections

  • Template localisation

Prerequisites to completing the course:

  • Funnelback 202 training.

  • HTML, JavaScript and CSS familiarity.

  • Familiarity with Freemarker and Groovy.

  • Familiarity with the Linux/UNIX command line interface.

What you will need before you begin:

Funnelback access details

The exercises in this workshop require access to Funnelback’s administration interface and terminal access to the Funnelback server.

Funnelback administration interface

The Funnelback administration interface is access via a web browser from the following location:

Access to the administration interface should use the following username and password:

  • Username: admin

  • Password: admin

Funnelback server ssh access

The Funnelback server can be accessed using the following details via SSH, executed from a terminal window or SSH client such as PuTTY.

  • SSH host: localhost

  • SSH port: 2222

  • Username: search

  • Password: se@rch

Under MacOS X run the following ssh command from a terminal to access:

ssh -p 2222 search@localhost

Under Windows use the details above to configure a PuTTY session.

1. Funnelback filesystem

1.1. Funnelback home folder

The Funnelback home folder, or installation root, is the base folder where the Funnelback software has been installed. This is the folder that’s referred to by the $SEARCH_HOME environment variable, which appears in many configuration files. (If on Windows the environment variable is %SEARCH_HOME%).

The home folder, $SEARCH_HOME usually points to the following location:

C:\Funnelback or D:\Funnelback

/opt/funnelback/

1.2. Top level folders

The following folders sit within the Funnelback home folder ($SEARCH_HOME):

Folder Description

conf

Global and collection-specific configuration files.

data

Collection-specific index, log and data files.

admin

Admin related files, administration users and collection-specific reports (WCAG, analytics, data reports).

web

Web server configuration and logs. Also includes the macro libraries and log files related to the modern UI.

log

Global log files.

bin

Funnelback binaries.

lib

Funnelback libraries.

share

Shared files such as custom collection templates, security plugins and language files.

wbin / linbin

OS specific helper binaries. linbin contains Linux binaries, wbin contains Windows binaries.

1.3. Funnelback configuration folders

Note: all paths below are relative to $SEARCH_HOME/conf/.

Folder Description

<COLLECTION-ID>

Collection-specific configuration files.

<COLLECTION-ID>/@workflow

Collection-specific workflow scripts.

Note: this folder does not exist by default

<COLLECTION-ID>/@groovy

Collection-specific Groovy filters.

Note: this folder does not exist by default.

<COLLECTION-ID>/<PROFILE-ID>

<COLLECTION-ID>/<PROFILE-ID>_preview

Collection profile/service folders. These exist in pairs and correspond to the presentation live and preview folders in the file manager.

Contains profile/service-specific configuration including templates, best bets, synonyms.

Note: only folders for the default profile will exist by default.

<COLLECTION-ID>/<PROFILE-ID>/web

<COLLECTION-ID>/<PROFILE-ID>_preview/web

Web resources folder. Used for storage of custom static web resources (such as CSS, JS and image files referenced locally from templates).

Corresponding web path: http://funnelback_server/s/resources/<COLLECTION-ID>/<PROFILE-ID>/

1.4. Funnelback reporting folders

Note: all paths below are relative to $SEARCH_HOME/admin/.

Folder Description

reports/<COLLECTION-ID>

Analytics and collection update history database files.

data_report/<COLLECTION-ID>

Data reports (web collections only).

1.5. Funnelback data folders

Note: all paths below are relative to $SEARCH_HOME/data/.

Folder Description

<COLLECTION-ID>

Cached data and log files for collection.

<COLLECTION-ID>/archive

Historic query and click logs used for analytics generation.

<COLLECTION-ID>/live

<COLLECTION-ID>/offline

Live and offline views.

Live view (contains all files relating to currently live search index).

Offline view (contains all files relating to previously live search index OR currently running update OR last failed update).

Note: live and offline are symbolic links to folders one and two.

<COLLECTION-ID>/one

<COLLECTION-ID>/two

These are the physical folders that contain the live and offline views and are the targets of the live and offline symbolic links.

<COLLECTION-ID>/log

Collection-level update log files and modern UI logs. Location of the top level update-collection.log file.

<COLLECTION-ID>/databases

Database history files for the accessibility auditor.

<COLLECTION-ID>/live/data

<COLLECTION-ID>/offline/data

Live/offline cached data. Used when building indexes, and for supplying cached pages.

<COLLECTION-ID>/live/idx

<COLLECTION-ID>/offline/idx

Live/offline index files.

<COLLECTION-ID>/live/log

<COLLECTION-ID>/offline/log

Live/offline view log files. Detailed logs for the live and offline views. Includes crawl and index logs.

<COLLECTION-ID>/live/databases

<COLLECTION-ID>/offline/databases

Live/offline view database files. Includes web crawler database (where appropriate), content auditor summaries and recommender databases (where enabled). <COLLECTION-ID>/knowledge_graph

Exercise 1: Log in to Funnelback via SSH
  1. Open up a terminal window. Log in to the Funnelback VM using SSH using the settings outlined in the Funnelback server SSH access section above. This logs you in to Funnelback as the search user.

    When working on Funnelback (under Linux) you should always work as the search user (or a user that belongs to the search group). Under Windows you should have a remote desktop session to your Funnelback server with an appropriate user account.
  2. Change to Funnelback’s home folder and list the directory contents:

    [search@localhost ~]$ cd $SEARCH_HOME
    [search@localhost funnelback]$ ls
    admin  conf       lib     log       share   uninstall      web
    bin    data       linbin  run       tools   uninstall.dat
    cache  databases  local   services  tuning  VERSION
  3. Change to the configuration folder and list the directory contents:

    [search@localhost funnelback]$ cd conf
    [search@localhost conf]$ ls
    1524-conf2.tar.gz                mcf-properties.xml
    1524-conf.tar.gz                 metadata-mapping.cfg
    admin-ui.ini.default             metadata-mapping.cfg.dist
    airports                         metamap.cfg.content-auditor
    auto-completion                  meta-names.xml.default
    beatles                          neo4j.conf
    collection.cfg                   neo4j.password
    collection.cfg.default           predirects.cfg
    common-misspellings.txt.default  predirects.cfg.dist
    connector.xsl.default            push-collection
    contextual_navigation.cfg        realm.properties
    contextual_navigation.cfg.bak    redis.conf
    contextual_navigation.cfg.dist   redis.conf.default
    executables.cfg                  redis.conf.dist
    executables.linux                redis.conf.example
    file-manager.ini                 redis.password.conf.default
    file-manager.ini.dist            reporting-stop-words.cfg.default
    foodista                         rss.ftl
    funnelback_documentation         sample-user.ini.dist
    funnelback-search                search.cron
    funnelback-website               silent-films
    funnelback-youtube               simple.cache.ftl
    global.cfg                       simple.cache.ftl.dist
    global.cfg.default               simple.ftl
    inventors                        simple.ftl.dist
    keyset                           simpsons
    log4j2.xml.default               textify.cfg.default
    matrix.ftl                       update-statistics.sqlitedb.dist
    matrix.ftl.dist                  xml-index.cfg
    mcf-connectors.xml               xml-index.cfg.dist
    mcf-logging.ini
  4. Change to the inventors folder and list the directory contents:

    [search@localhost conf]$ cd inventors/
    [search@localhost inventors]$ ls
    collection-20191106033213.cfg  collection.state
    collection-20191106235850.cfg  _default
    collection-20191107000033.cfg  _default_preview
    collection-20191107033938.cfg  metadata-mapping.cfg
    collection-20191113014714.cfg  wikipedia
    collection.cfg                 wikipedia_preview
    collection.cfg.start.urls      xml-index.cfg
  5. Log in to the administration interface and switch to the inventors collection.

  6. Select browse collection configuration files from the administer tab. Compare the listing of files displayed in the file manager to the listing that you have in your SSH session. Explore the _default and _default/web folders and compare with the file manager listing in your browser.

    exercise log in to funnelback via ssh 01
  7. Click on the edit configuration files link. This opens up the new configuration screen which lists the rest of the configuration files. This screen will ultimately replace the previous screen once all the different configuration file types have been transitioned into the new system. The configuration files managed via the new screen can also be accessed and managed via WebDAV.

    exercise log in to funnelback via ssh 02
  8. Change to the data folder and view the contents:

    [search@localhost inventors]$ cd $SEARCH_HOME/data
    [search@localhost data]$ ls
    airports  funnelback_documentation  funnelback-youtube  silent-films
    beatles   funnelback-search         inventors           simpsons
    foodista  funnelback-website        push-collection
  9. Change to the inventors subfolder and explore contents.

  10. Compare the log subfolders with the log view in the administration interface for the log collection (administer tab, view logs)

2. Workflow

2.1. Update cycle

Recall that when a (non-push) collection updates a series of different update phases are stepped through.

update cycle 01

The main update phases were discussed previously - there are some additional phases that perform maintenance tasks.

update cycle 02

The full update cycle consists of the following update phases:

  1. gather and filter: this is the process used by Funnelback to connect to the data source and gather all the content. Most collection types also filter the content as part of this process.

  2. index: this is the process of building the indexes based on the filtered content.

  3. reporting: Creates data and broken link reports for web collections.

  4. swap: performs a sanity check on the index comparing the number of documents in the new and previous index. If the check passes the new index is published to live.

  5. meta_dependencies: performs a number of tasks to ensure that any meta collections to which this collection belongs are updated. E.g. builds spelling and auto-completion indexes for the meta collection using the updated data.

  6. archive: archives log files

Knowledge graph and analytics are updated independently to the collection update cycle.

2.2. Workflow commands

Funnelback provides a workflow mechanism that allows a series of commands to be executed before and after each update phase that occurs during an update.

This provides a huge amount of flexibility and allows the update to perform a lot of additional tasks as part of the update.

These commands can be used to pull in external content, connect to external APIs, to manipulate the content or index, or generate output.

Effective use of workflow commands requires a good understanding of the Funnelback update cycle and what happens at each stage of the update. It also requires an understanding of how Funnelback is structured as many of the workflow commands operate directly on the file system.

Each phase in the update has a corresponding pre phase and post phase command.

The exact commands that are available will depend on the type of collection. The most commonly used workflow commands are:

  1. Pre gather: This command runs immediately before the gather process begins. Example uses:

    • Pre-authenticate and download a session cookie to pass to the web crawler.

    • Download an externally generated seed list

  2. Pre index: This command runs immediately before indexing starts. Example uses:

    • Download and validate external metadata from an external source

    • Connect to an external REST API to fetch XML content

    • Download and convert a CSV file to XML for a local collection

  3. Post index: This command runs immediately after indexing completes. Example uses:

    • Generate a structured auto-completion CSV file from metadata within the search index and apply this back to the index

  4. Post swap: This command runs immediately after the indexes are swapped. Example uses:

    • Publish index files to a multi-server environment

  5. Post update: this command runs immediately after the update is completed. Example uses:

    • Perform post update cleanup operations such as deleting any temporary files generated by earlier workflow commands

Workflow is not supported when using push collections.

2.2.1. Specifying workflow commands

Workflow commands are specified using collection configuration settings sec on the edit collection configuration screen.

Each workflow command has a corresponding configuration option. The commands take the form pre/post_phase_command=<command>. For example: pre_gather_command; post_index_command; post_update_command.

Each of these configuration options takes a single value, which is the command to run.

The workflow command can be 1 or more system commands (commands that can be executed on a command line such as bash (Linux) or Windows PowerShell or the cmd prompt (Windows). These commands operate on the underlying filesystem and will run as the user that runs the Funnelback service - usually the search user (Linux) or a service account (Windows).

Care needs to be exercised when using workflow commands as it is possible to execute destructive code.
When running commands that interact with Funnelback indexes (either via an index stem, or by calling the search via http) ensure that the relevant view is being queried. Use $CURRENT_VIEW when the view needs to match the context of the update - this will be most of the time unless a command is running that always must operate on the live or offline view. This should be used unless you need to hardcode the view to always be live or offline.

The command that is specified in can contain the following collection configuration variables, which should generally be used to assist with collection maintenance:

  1. $SEARCH_HOME: this expands to the home folder for the installation. (e.g. /opt/funnelback or d:\funnelback)

  2. $COLLECTION_NAME: this expands to the collection’s ID.

  3. $CURRENT_VIEW: this expands to the current view (live or offline) depending on which phase is currently running and the type of update. This is useful particularly for commands that operate on the gather and index phases as the $CURRENT_VIEW can change depending on the type of update that is running. If you run a re-index of the live view $CURRENT_VIEW is set to live for all the phases. For other update types $CURRENT_VIEW is set to offline for all workflow commands before a swap takes place. The view being queried on a search query can be specified by setting the view CGI parameter (e.g. &view=offline).

  4. $GROOVY_COMMAND: this expands to the binary path for running groovy on the command line.

If multiple commands are required for a single workflow step then it is best practice to create a shell script (Linux) or batch or PowerShell file (Windows) that scripts all the commands, taking the relevant configuration variables as parameters so that you can make use of the values from within your script.

Exercise 2: Download an XML file for a local collection
  1. Log into the administration interface and create a new local collection. A local collection creates an index off a set of files in a folder on the local filesystem.

  2. Enter the following attributes into the collection creation screen and click create collection:

    • Project group ID: Training 203

    • Collection ID: nobel-prize-winners

    • Collection type: local

    exercise download an xml file for a local collection 01
  3. Complete the initial collection setup by setting the following configuration parameters:

    • service_name: Nobel Prize Winners

    • data_root: /opt/funnelback/data/nobel-prize-winners/offline/data/

    exercise download an xml file for a local collection 02
  4. Add pre-index workflow by adding a pre_index_command configuration setting. Set this to run a curl command that downloads the nobel.xml file and saves it to the local training data folder.

    • Parameter name: pre_index_command

    • Value: curl --connect-timeout 60 --retry 3 --retry-delay 20 'http://training-search.clients.funnelback.com/training/training-data/nobel/nobel.xml' -o $SEARCH_HOME/data/$COLLECTION_NAME/offline/data/nobel.xml || exit 1

    When using external commands such as curl, it is best practice to apply appropriate command arguments to make the call as reliable as possible and errors should also be caught.
  5. An indexer setting needs to be set that disables the URL exclusion as the local collection is indexing files that don’t have URLs that sit beneath the Funnelback home folder.

    When working with local collections it is common to index files that are stored in a folder that exists under the Funnelback home folder.

    If these files are not XML or if the document url value in the XML processing settings is not set then a local file path URL will be assigned containing the home folder in the path.

    Funnelback includes a built in security setting to prevent the indexing of documents that include the home folder anywhere in the path - primarily to avoid the indexing of configuration files.

    This setting needs to be turned off to ensure the files for the local collection are indexed.

  6. Edit the collection configuration then select indexer from the left hand menu and add the following to the indexer_options:

    -check_url_exclusion=off
    exercise download an xml file for a local collection 03
  7. Save the changes then update the collection.

  8. After the update completes run a search for nobel - you should have a single result returned. We only have a single result (pointing to the complete XML file) as XML mappings have not yet been configured.

    exercise download an xml file for a local collection 04
  9. View the update log by selecting browse log files from the administer tab in the administration interface and observe the output from the workflow in the collection’s update log.

    Output from workflow commands is always written to the top-level update-<COLLECTION-ID>.log in the collection logs section for the collection that is being updated.
  10. Return to the administration interface and select browse log files from the administer tab. Inspect the update-nobel-prize-winners.log and observe the messages produced by the curl command.

    exercise download an xml file for a local collection 05
  11. Configure the collection to handle the XML file splitting. (Adminstration interface, administer tab, XML processing):

    Property Source

    XML document splitting

    /tsvdata/row

  12. Configure the XML field to metadata class mappings. (Adminstration interface, administer tab, configure metadata mappings):

    Class name Source Type Search behaviour

    year

    /tsvdata/row/Year

    text

    searchable as content

    category

    /tsvdata/row/Category

    text

    searchable as content

    name

    /tsvdata/row/Name

    text

    searchable as content

    birthDate

    /tsvdata/row/Birthdate

    text

    display only

    birthPlace

    /tsvdata/row/Birth_Place

    text

    searchable as content

    country

    /tsvdata/row/Country

    text

    searchable as content

    residence

    /tsvdata/row/Residence

    text

    searchable as content

    roleAffiliate

    /tsvdata/row/Role_Affiliate

    text

    searchable as content

    fieldLanguage

    /tsvdata/row/Field_Language

    text

    searchable as content

    prizeName

    /tsvdata/row/Prize_Name

    text

    searchable as content

    motivation

    /tsvdata/row/Motivation

    text

    searchable as content

  13. Rebuild the index by running an advanced update and selecting reindex the live view.

  14. Configure the display options so that relevant metadata is returned with the search results. Add the following display options (hint: administer tab, edit collection configuration, then interface tab and query processor options):

    -SF=[year,category,name,birthDate,birthPlace,country,residence,roleAffiliate,fieldLanguage,prizeName,motivation]
  15. Edit the default template to return metadata for each search result. Replace the contents of the <@s.Results> tag with the following then save and publish the template:

              <@s.Results>
                <#if s.result.class.simpleName != "TierBar">
                  <li data-fb-result=${s.result.indexUrl}>
    
    <h4>${s.result.metaData["prizeName"]!} (${s.result.metaData["year"]!})</h4>
    <ul>
    <li>Winner: ${s.result.metaData["name"]!}</li>
    <li>Born: ${s.result.metaData["birthDate"]!}, ${s.result.metaData["birthPlace"]!}, ${s.result.metaData["country"]!}</li>
    <li>Role / affiliate: ${s.result.metaData["roleAffiliate"]!}</li>
    <li>Prize category: ${s.result.metaData["category"]!}</li>
    <li>Motivation: ${s.result.metaData["motivation"]!}</li>
    </ul>
    
                  </li>
                </#if>
              </@s.Results>
  16. Rerun a search for nobel to confirm that the XML is split into individual records and that the metadata is correctly returned.

    exercise download an xml file for a local collection 06
Exercise 3: Download and convert a TSV file for a local collection

This exercise is almost identical to the previous exercise, except that a tab delimited file will be used as the data source and a local script to convert the file to XML will run as part of the update. The end result should be identical to the previous exercise.

This exercise will also work exclusively from the command line to make the necessary changes.

  1. Log in to the Funnelback server via SSH as the search user. (password: se@rch)

    [ ~] ssh search@localhost -p 2222
  2. Change to the configuration folder for the nobel-prize-winners collection

    [search@localhost ~] cd $SEARCH_HOME/conf/nobel-prize-winners
  3. Create a new folder @workflow under the collection’s configuration folder. All workflows scripts should be stored under a folder named @workflow.

    [search@localhost nobel-prize-winners]$ mkdir @workflow
  4. Create a new shell script called pre_index.sh

    [search@localhost nobel-prize-winners]$ cd \@workflow/
    [search@localhost @workflow]$ nano pre_index.sh
  5. Add the following code to the file then save. This script downloads a tab delimited file and converts it to XML.

    #!/bin/bash
    # Pre index workflow script
    
    # Allow the collection.cfg variables to be passed in and made available within this script.
    # Pass the variables in as -c $COLLECTION_NAME -g $GROOVY_COMMAND -v $CURRENT_VIEW
    while getopts ":c:g:v:" opt; do
      case $opt in
        c) COLLECTION_NAME="$OPTARG"
        ;;
        g) GROOVY_COMMAND="$OPTARG"
        ;;
        v) CURRENT_VIEW="$OPTARG"
        ;;
        \?) echo "Invalid option -$OPTARG" >&2
        ;;
      esac
    done
    
    # empty the offline data folder
    rm -f $SEARCH_HOME/data/$COLLECTION_NAME/offline/data/*
    
    # download the TSV data file, catching any errors.
    curl --connect-timeout 60 --retry 3 --retry-delay 20 'http://training-search.clients.funnelback.com/training/training-data/nobel/nobel.tsv' -o $SEARCH_HOME/data/$COLLECTION_NAME/offline/data/nobel.tsv || exit 1
    
    # convert the TSV file to XML
    $SEARCH_HOME/share/training/build/resources/main/training-bin/tsv2xml.pl $SEARCH_HOME/data/$COLLECTION_NAME/offline/data/nobel.tsv
    
    # remove the TSV file
    rm -f $SEARCH_HOME/data/$COLLECTION_NAME/offline/data/nobel.tsv
    The tsv2xml.pl script used in this example is not the best way to index tsv or csv data and is presented here as a simple workflow example. It is not recommended for use in production.
  6. Make the pre_index.sh file executable.

    [search@localhost @workflow] chmod 775 pre_index.sh
  7. Edit the collection.cfg file. This file holds all the settings that are defined on the edit collection configuration screen.

    [search@localhost @workflow] cd ..
    [search@localhost nobel-prize-winners] nano collection.cfg
  8. Update the pre_index_command to the following:

    pre_index_command=$SEARCH_HOME/conf/$COLLECTION_NAME/@workflow/pre_index.sh -c $COLLECTION_NAME
  9. Run the following command to update the collection:

    [search@localhost nobel-prize-winners]$ /opt/funnelback/bin/update.pl /opt/funnelback/conf/nobel-prize-winners/collection.cfg
    Running update.pl without arguments provides usage information and access to the advanced update modes.
  10. The update will start and you will receive a shell prompt once the update is complete. After the update completes run a search - you should see a similar result to before. Note that the XML is split and formatted because this was all configured in the previous exercise.

    exercise download and convert a tsv file for a local collection 01
  11. View the collection’s update log to view the workflow command output. Observe the log output includes the workflow commands that were specified in the pre_index.sh script.

    [search@localhost nobel-prize-winners]$ cd /opt/funnelback/data/nobel-prize-winners/log/
    [search@localhost log]$ less update-nobel-prize-winners.log
    Detailed log: /opt/funnelback/data/nobel-prize-winners/one/log/update.log
    Detailed output is being logged to: /opt/funnelback/data/nobel-prize-winners/one/log/update.log
    Started at: Wed, 13 Nov 2019 22:39:41 GMT
    Starting: Executing collection update cycle.
    Executing collection update phases. (UpdateCollectionStep)
      Starting: Runs the update cycle steps
      Recording update start time. (RecordUpdateStartTime)
             completed in 0.171s
      Phase: Gathering content. (GatherPhase)
             skipped because gathering has been disabled with collection config option: gather, in 0.0s
      Running pre_reporting_command to provide backward compatibility for fetching remote logs. (PreReportCommand)
        Starting: Running pre_reporting_command to provide backward compatibility for fetching remote logs.
        Executes the command set in the collection.cfg option: 'pre_report_command' (PrePostStep)
             skipped because 'pre_report_command' is not set., in 0.1s
        Finished 'Running pre_reporting_command to provide backward compatibility for fetching remote logs.', in 0s
             completed in 0.50s
      Phase: Archiving query logs. (ArchivePhase)
        Starting: Phase: Archiving query logs.
        Executes the command set in the collection.cfg option: 'pre_archive_command' (PrePostStep)
             skipped because 'pre_archive_command' is not set., in 0.0s
        Archiving query logs. (ArchiveLiveViewLogs)
             completed in 0.8s
        Executes the command set in the collection.cfg option: 'post_archive_command' (PrePostStep)
             skipped because 'post_archive_command' is not set., in 0.0s
        Finished 'Phase: Archiving query logs.', in 0s
             completed in 0.137s
      Phase: Indexing content. (IndexPhase)
        Starting: Index pipeline for collection: nobel-prize-winners
        Executes the command set in the collection.cfg option: 'pre_index_command': /opt/funnelback/conf/nobel-prize-winners/@workflow/pre_index.sh -c nobel-prize-winners (PrePostStep)
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    ^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M100  303k  100  303k    0     0  24.1M      0 --:--:-- --:--:-- --:--:-- 24.7M
    TSV2XML: Processing files at Wed Nov 13 22:39:42 2019...
    TSV2XML: processing data file: /opt/funnelback/data/nobel-prize-winners/offline/data/nobel.tsv
    TSV2XML: Finished processing files at Wed Nov 13 22:39:42 2019
             completed in 0.214s
        Configuring permission on query lock file for impersonation. (SetWindowsPermissionsOnLock)
             skipped because not running on windows., in 0.0s
        Creating duplicate and redirects database. (CreateDupredrexAndRedirectsDB)
             completed in 0.231s
        Processing click logs for ranking improvements. (ClickLogs)
             completed in 0.99s
        Assembling metadata mappings. (AssembleMetadataMapping)
             completed in 0.318s
        Working out how many documents can be indexed to stay within license limits. (SetMaxDocumentsThatCanBeIndexed)
             completed in 0.192s
        Indexing documents. (Index)
             completed in 0.334s
        Indexing documents. (ParallelIndex)
             skipped because config option:'collection-update.step.ParallelIndex.run' was set to 'false'., in 0.1s
        Killing fully-matching documents. (ExactMatchKill)
             skipped because kill_exact.cfg does not exist, in 0.0s
        Killing partially-matching documents. (PartialMatchKill)
             skipped because kill_partial.cfg does not exist, in 0.1s
        Processing annotations for the collection. (AnnieAPrimaryCollection)
             completed in 0.17s
        Setting Gscopes. (SetGscopes)
             skipped because gscopes.cfg does not exist, in 0.1s
        Setting Query Gscopes. (SetQueryGscopes)
             skipped because query-gscopes.cfg does not exist, in 0.0s
        Applying gscopes derived from Faceted Navigation. (FacetBasedGscopes)
             skipped because no URL Pattern based facet categories., in 0.37s
        Building result collapsing signatures file. (BuildCollapsingSignatures)
             completed in 0.23s
        Creating spelling suggestions. (BuildSpelling)
             completed in 0.36s
        Applying Query Independent Evidence. (QueryIndependentEvidenceCollectionLevel)
             completed in 0.17s
        Building query completion index. (BuildAutoCompletion)
             completed in 0.105s
        Carry over WCAG summary file (CarryOverWCAGSummary)
             skipped because Previous summary file didn't exist, in 0.5s
        Generating Content Auditor profile summaries (ContentAuditorSummary)
             completed in 0.605s
        Creating broken links report. (CreateBrokenLinksReportStep)
             skipped because broken link reports are only generated for web collections, in 0.0s
        Setting the time stamp on the index and data. (AddTimeStamp)
             completed in 0.13s
        Executes the command set in the collection.cfg option: 'post_index_command' (PrePostStep)
             skipped because 'post_index_command' is not set., in 0.0s
        Finished 'Index pipeline for collection: nobel-prize-winners', in 2s
             completed in 2s
      Creating recommendations. (BuildRecommenderDBPhase)
             skipped because recommender=false, in 0.16s
      Deleting content from store and index for instant delete. (InstantDeletePhase)
             skipped because Instant delete is not run in this update type, in 0.1s
      Phase: Swapping views to make the new index live. (SwapViewsPhase)
        Starting: Switching to the newly crawled data and index.
        Executes the command set in the collection.cfg option: 'pre_swap_command' (PrePostStep)
             skipped because 'pre_swap_command' is not set., in 0.0s
        Checking the total number of documents indexed is sufficient. (ChangeOverIndexCountCheck)
             completed in 0.3s
        Record Accessibility Auditor historical data  (RecordAccessibilityAuditorHistoryForSwapViews)
             skipped because Accessibility Auditor history has already been recorded (if it was enabled)., in 0.1s
        Switching to the new index. (SwapViews)
             completed in 0.7s
        Executes the command set in the collection.cfg option: 'post_swap_command' (PrePostStep)
             skipped because 'post_swap_command' is not set., in 0.0s
        Finished 'Switching to the newly crawled data and index. ', in 0s
             completed in 0.77s
      Phase: Updating Meta parent collections (spelling, auto completion, etc). (MetaDependenciesPhase)
        Starting: Update meta parents of: nobel-prize-winners
        Executes the command set in the collection.cfg option: 'pre_meta_dependencies_command' (PrePostStep)
             skipped because 'pre_meta_dependencies_command' is not set., in 0.0s
        Update the parent meta collections. (UpdateMetaParents)
             completed in 0.2s
        Executes the command set in the collection.cfg option: 'post_meta_dependencies_command' (PrePostStep)
             skipped because 'post_meta_dependencies_command' is not set., in 0.0s
        Finished 'Update meta parents of: nobel-prize-winners', in 0s
             completed in 0.48s
      Phase: Archiving query logs. (ArchivePhase)
        Starting: Phase: Archiving query logs.
        Executes the command set in the collection.cfg option: 'pre_archive_command' (PrePostStep)
             skipped because 'pre_archive_command' is not set., in 0.0s
        Archiving query logs. (ArchiveOfflineViewLogs)
             completed in 0.0s
        Executes the command set in the collection.cfg option: 'post_archive_command' (PrePostStep)
             skipped because 'post_archive_command' is not set., in 0.0s
        Finished 'Phase: Archiving query logs.', in 0s
             completed in 0.35s
      Running post updates tasks. (PostUpdateTasks)
        Starting: Running post update tasks
        Executing post update hook scripts. (PostMetaUpdateHookScripts)
             skipped because not a meta collection, in 0.0s
        Triggering post update command. (PostUpdateCommand)
             skipped because No post update command specified, in 0.0s
        Finished 'Running post update tasks', in 0s
             completed in 0.40s
      Recording update finish time. (RecordUpdateFinishTime)
             completed in 0.28s
      Recording update metrics. (RecordMetrics)
             completed in 0.0s
      Sending success email. (SendSuccessfulUpdateMessage)
             skipped because config option mail.on_failure_only=true, in 0.0s
      Marking this update as a success. (SetUpdateSuccessfulState)
             completed in 0.3s
      Finished 'Runs the update cycle steps', in 3s
             completed in 3s
    Finished 'Executing collection update cycle.', in 3s
    Finished at: Wed, 13 Nov 2019 22:39:45 GMT
    
    Update: Exit Status: 0

3. Introduction to administration users

Access to the Funnelback administration interface is controlled via administration interface user accounts.

The administration interface user accounts are Funnelback-specific accounts (LDAP/domain accounts are not currently supported).

Funnelback administration users have access to various administration functions depending on the set of permissions applied to the user. Sets of permissions can be grouped into roles and applied to a user. This allows default permission sets to be defined and reused across many users.

Funnelback ships with five default permission sets:

  • default-administrator: full access to all available permissions.

  • default-analytics: read only access to analytics.

  • default-marketing: provides access to functionality available via the marketing dashboard.

  • default-support: provides access to functionality suitable for users in a support role.

  • default-implementer: provides access to functionality suitable for most implementers.

Roles can also be used to define sets of resources (such as collections, profiles and licences) that may be shared by multiple users.

When creating a user it is best practice to separate the resources and permissions into different roles and combine these. This allows you to create a role that groups the collections, profiles and licences available for a set of users and combine this with one or more roles that define the permission sets that apply for the user.

Exercise 4: Create a user

This exercise assumes that any customised roles will already exist on the server.

  1. Log in to the administration interface using the admin user account

  2. Select manage users and roles from the system menu.

    exercise create a user 01
  3. Click the manage users button.

    exercise create a user 02
  4. Create a user with the name search-analytics. Click on the add new button.

  5. On the general information screen, enter basic user details, then click the Next: create password button.

    • Username: search-analytics

    • Full name: Search Analytics

    • Email: analytics@localhost

      exercise create a user 03
  6. Enter searchanalyticsarecool as the password the click the Next: what roles does this user belong to? button.

    exercise create a user 04
  7. Select the appropriate roles for the user. Roles are sets of permissions that are granted allowing the user to have access to different sets of functionality within Funnelback. Select the default-analytics and resources-foodista roles. Observe how the derived permissions change as the roles are turned on and off.

    exercise create a user 05
  8. The next screen allows you to assign roles that can manage this user. We’ll skip this for now. Click on the review user button to see a summary of the user setup.

    exercise create a user 06
  9. Click the create user button to finish the user setup. The different panels on the screen allow you to edit the various permissions assigned to the user.

    exercise create a user 07
  10. Verify that the user has been created by checking that the user is listed on the manage users screen. Select manage users from the breadcrumb trail.

    exercise create a user 08
  11. Change to the search-analytics user. Log out of the administration interface by selecting logout from the user menu at the top right hand corner of window. Log back in as the search-analytics user with the password searchanalyticsarecool. You will be logged in to the marketing dashboard.

  12. Observe that the username displayed on the user menu changes from Sample Super User to Search Analytics and that the user only has access to the foodista collection. Click on the foodista tile and observe that only the analytics and reporting functions are available.

  13. Switch back to the admin user. (username and password are both admin).

4. Funnelback PADRE binaries

Funnelback’s core is made up of a number of programs that build and interact with the collection indexes. This suite of programs is called PADRE (PArallel Document Retrieval Engine).

PADRE binaries are stored within the Funnelback binaries folder ($SEARCH_HOME/bin) and can be called via the command line and from collection update workflow commands. Running a padre command without arguments will return usage information. The main padre binaries are:

  • build_autoc: the program used for building auto-completion indexes.

  • padre-di: the program for showing stored document information, including metadata.

  • padre-fl: the program for adjusting flag bits. It is most commonly used to mark documents in the index as being duplicates or deleted.

  • padre-iw: the indexing program. It reads the text files gathered during an update and creates the search indexes.

  • padre-qs: the program used to return auto-completion suggestions from the index.

  • padre-sw: the search program. It parses the CGI parameters and executes the user’s query, returning XML results.

Most padre binaries require an argument called the index stem. The index stem is the file prefix to the set of index files. This is normally $SEARCH_HOME/data/<live or offline>/idx/index - all the index files are contained within that folder and are prefixed with index (e.g. index.bldinfo, index.urls).

Be careful to use the correct view (live or offline) when specifying the index_stem. This will depend on which version of the index needs to be modified (e.g. offline is normally used when you are running an update, but live is needed if you wish to manually modify a live index).
Exercise 5: Manually apply gscopes to an index
  1. Log in to the administration interface and switch to the simpsons collection

  2. Create a gscopes.cfg containing the following. The first column contains a gscope ID and the second column contains a URL pattern.

    episodes /episodes/
    articles /articles/
    characters /characters/
    reviews /reviews/
  3. Run padre-gs on the command line without any arguments to see usage instructions:

    [search@localhost simpsons]$ /opt/funnelback/bin/padre-gs
    Purpose: Display or manipulate document gscopes in an index.
    
    Usage0: /opt/funnelback/bin/padre-gs -v|-V|-help   # print version info or detailed help
                  on types of instructions and on program operation.
    Usage1: /opt/funnelback/bin/padre-gs index_stem -clear   # clear all gscopes
    
    Usage2: /opt/funnelback/bin/padre-gs index_stem -show   # show all gscopes
    
    Usage3: /opt/funnelback/bin/padre-gs index_stem file_of_instructions [-separate] [other_gscope]
            [-regex|-url|-docnum] [-verbose] [-quiet] [-dont_backup]
    
    Where:
          * index_stem may also be the name of a collection
          * file_of_instructions may be '-' to accept instructions from stdin
          * -separate indicates that gscope changes should be made to a
            copy of the .dt file first, and then copied over the original file
            when changes are complete. In this mode the number of gscope bits
            can NOT be expanded you will be required to ensure enough is available.
          * other_gscope specifies a gscope to be set on documents which
            end up with no gscopes set.
          * By default instruction patterns are expected to be regexes
            but this may be made explicit with -regex or altered with -url
            or -docnum.  Use /opt/funnelback/bin/padre-gs -help to obtain more information about
            instruction formats and pattern types.
          * gscope names may consist of alphanumeric ascii characters up to a length
            of 64 characters.
          * -dont_backup prevents backing up of the .dt file
          * -quiet don't show the before and after summary of gscopes
  4. Clear any existing gscopes on the live simpsons index by running the following command. Generally it is good practice to clear existing gscopes before applying gscopes to an index unless you wish to add to an existing set. padre-gs will leave existing gscopes when applying new ones.

    [search@localhost simpsons]$ /opt/funnelback/bin/padre-gs /opt/funnelback/data/simpsons/live/idx/index -clear
    Making backup: /bin/cp /opt/funnelback/data/simpsons/live/idx/index.dt /opt/funnelback/data/simpsons/live/idx/index.dt_bak
    
    ---------------- Initial Summary ------------
    No. docs: 911
    Bit no.             Count  GscopeName
    -------------------------------------------
    All gscopes cleared on all documents.
    
    ---------------- Final Summary ------------
    No. docs: 911
    Bit no.             Count  GscopeName
    -------------------------------------------
  5. Apply the gscope definitions from gscopes.cfg by running the following command and note the bit counts in the initial and final summaries:

    [search@localhost simpsons]$ /opt/funnelback/bin/padre-gs /opt/funnelback/data/simpsons/live/idx/index /opt/funnelback/conf/simpsons/gscopes.cfg
    Making backup: /bin/cp /opt/funnelback/data/simpsons/live/idx/index.dt /opt/funnelback/data/simpsons/live/idx/index.dt_bak
    
    ---------------- Initial Summary ------------
    No. docs: 911
    Bit no.             Count  GscopeName
    -------------------------------------------
    Loading instructions from /opt/funnelback/conf/simpsons/gscopes.cfg...
    Pattern[0] ("episodes") (Bit: 0) /episodes/
    Pattern[1] ("articles") (Bit: 1) /articles/
    Pattern[2] ("characters") (Bit: 2) /characters/
    Pattern[3] ("reviews") (Bit: 3) /reviews/
    Patterns loaded: 4
    
    ---------------- Final Summary ------------
    No. docs: 911
    Bit no.             Count  GscopeName
              0       492      episodes
              1        26      articles
              2        19      characters
              3       373      reviews
    -------------------------------------------

    This shows that before the gscopes were applied there were no existing gscopes present in the index. Four gscope definitions were loaded from gscopes.cfg and after padre-gs ran 492 documents were tagged with a gscope name of episodes, 26 documents were tagged with gscope name of articles and so on.

    Running padre-gs adds the matching gscope definitions to those already present in the index - if you need to cleanly apply gscopes then existing gscopes will need to be cleared before the new definitions are applied. This behaviour also means you can run padre-gs multiple times with different gscope.cfg files to build up the final set of applied gscope rules.
  6. Test that gscopes have been applied by running a search against the simpsons collection specifying the gscope1 CGI parameter (e.g. &gscope1=episodes will limit the search results to episode pages). http://training-search.clients.funnelback.com/s/search.html?collection=simpsons&profile=_default_preview&gscope1=episodes&query=!showall

    exercise manually apply gscopes to an index 01
Exercise 6: Manually kill documents from an index
  1. Run a query against the simpsons collection for simpsons crazy. Observe that the second result is to the home page. We will remove this from the index.

    exercise manually kill documents from an index 01
  2. Log in to the administration interface and switch to the simpsons collection

  3. Create a kill_exact.cfg containing the following

    www.simpsoncrazy.com/index
  4. Run padre-fl on the command line without any arguments to see usage instructions:

    [search@localhost simpsons]$ /opt/funnelback/bin/padre-fl
    Purpose: Display or operate on the document flags in an index.
    
    Usage1: /opt/funnelback/bin/padre-fl <index_stem> [-clearall|-clearbits|-clearkill|-killall|-show|-sumry|-quicken]
    
    Usage2: /opt/funnelback/bin/padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -unkill
    
    Usage3: /opt/funnelback/bin/padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -kill
    
    Usage4: /opt/funnelback/bin/padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -bits hexbits OR|AND|XOR
    
    Usage5: /opt/funnelback/bin/padre-fl <index_stem> -kill-docnum-list <file_of_docnums>
    
    Usage6: /opt/funnelback/bin/padre-fl -v
    
    Note: Specify '-' as the file of url patterns to supply a single URL to standard input.
    Running padre-fl sets the matching document flags within the index without resetting any existing flags. As for gscopes, you may need to reset the document flags or unkill previously killed documents if you need the definitions to be freshly applied.
  5. View a summary of the current document flags - run padre-fl on the live index of the simpsons collection specifying the -sumry option. Observe that the index does not contain any killed documents:

    [search@localhost simpsons]$ /opt/funnelback/bin/padre-fl /opt/funnelback/data/simpsons/live/idx/index -sumry
    {
     "total_documents": 911,
     "expired_documents": 0,
     "killed_documents": 0,
     "duplicate_documents": 0,
     "noindex_documents": 0,
     "filtered_binary_documents": 0,
     "documents_without_an_early_binding_security_lock": 911,
     "documents_with_paid_ads": 0,
     "unfiltered_binary_documents": 0,
     "documents_matching_admin_specified_regex": 0,
     "noarchive_documents": 0,
     "nosnippet_documents": 0
    }
  6. Remove the URLs defined in kill_exact.cfg by running the following command and note the killed_documents value before and after running the command:

    [search@localhost simpsons]$ /opt/funnelback/bin/padre-fl /opt/funnelback/data/simpsons/live/idx/index /opt/funnelback/conf/simpsons/kill_exact.cfg -exactmatch -kill
    Making backup: /bin/cp /opt/funnelback/data/simpsons/live/idx/index.dt /opt/funnelback/data/simpsons/live/idx/index.dt_bak
    
    Showing summary before changes, if any
    {
     "total_documents": 911,
     "expired_documents": 0,
     "killed_documents": 0,
     "duplicate_documents": 0,
     "noindex_documents": 0,
     "filtered_binary_documents": 0,
     "documents_without_an_early_binding_security_lock": 911,
     "documents_with_paid_ads": 0,
     "unfiltered_binary_documents": 0,
     "documents_matching_admin_specified_regex": 0,
     "noarchive_documents": 0,
     "nosnippet_documents": 0
    }
    URL Patterns: 1 found and sorted.
    Document URLs sorted: 911
    Performing specified operation (bittz = 0, bitop = 4)...
       num_docs = 911.  num_pats = 1
    Showing summary after changes if any:
    {
     "total_documents": 911,
     "expired_documents": 0,
     "killed_documents": 1,
     "duplicate_documents": 0,
     "noindex_documents": 0,
     "filtered_binary_documents": 0,
     "documents_without_an_early_binding_security_lock": 911,
     "documents_with_paid_ads": 0,
     "unfiltered_binary_documents": 0,
     "documents_matching_admin_specified_regex": 0,
     "noarchive_documents": 0,
     "nosnippet_documents": 0
    }
  7. Test that the URL has been removed by re-running the search against the simpsons collection for simpsons crazy. The home page result that was at rank 2 should disappear. (http://training-search.clients.funnelback.com/s/search.html?collection=simpsons&profile=_default_preview&query=simpsons+crazy).

  8. Unkill the URLs defined in kill_exact.cfg by running the following command observing the killed_documents value before and after:

    [search@localhost simpsons]$ /opt/funnelback/bin/padre-fl /opt/funnelback/data/simpsons/live/idx/index /opt/funnelback/conf/simpsons/kill_exact.cfg -exactmatch -unkill
    Making backup: /bin/cp /opt/funnelback/data/simpsons/live/idx/index.dt /opt/funnelback/data/simpsons/live/idx/index.dt_bak
    
    Showing summary before changes, if any
    {
     "total_documents": 911,
     "expired_documents": 0,
     "killed_documents": 1,
     "duplicate_documents": 0,
     "noindex_documents": 0,
     "filtered_binary_documents": 0,
     "documents_without_an_early_binding_security_lock": 911,
     "documents_with_paid_ads": 0,
     "unfiltered_binary_documents": 0,
     "documents_matching_admin_specified_regex": 0,
     "noarchive_documents": 0,
     "nosnippet_documents": 0
    }
    URL Patterns: 1 found and sorted.
    Document URLs sorted: 911
    Performing specified operation (bittz = 0, bitop = 5)...
       num_docs = 911.  num_pats = 1
    Showing summary after changes if any:
    {
     "total_documents": 911,
     "expired_documents": 0,
     "killed_documents": 0,
     "duplicate_documents": 0,
     "noindex_documents": 0,
     "filtered_binary_documents": 0,
     "documents_without_an_early_binding_security_lock": 911,
     "documents_with_paid_ads": 0,
     "unfiltered_binary_documents": 0,
     "documents_matching_admin_specified_regex": 0,
     "noarchive_documents": 0,
     "nosnippet_documents": 0
    }
  9. Test that the URL has been unkilled by re-running the search against the simpsons collection for simpsons crazy. The home page should reappear as the second result. (http://training-search.clients.funnelback.com/s/search.html?collection=simpsons&profile=_default_preview&query=simpsons+crazy)

5. Bringing it all together

The following exercise is an advanced example of how to use a number of features that have been discussed so far to achieve a single goal.

The autocompletion CSV template, hook script and workflow files used in this exercise can be downloaded from GitHub and used in your own projects: https://github.com/funnelback/funnelback-concierge/tree/master/helper_files
Exercise 7: Use Funnelback to generate structured auto-completion

In this exercise the Funnelback search index will be used to generate a structured auto-completion CSV file and structured auto-completion will be rebuilt using this generated file.

This exercise uses a custom profile that is configured to return the search results in the auto-completion CSV file format. The padre binary that creates the auto-completion indexes is then used to generate rich auto-completion using the generated CSV file.

Recall that the auto-completion CSV file format contains eight fields:

KEY,WEIGHT,DISPLAY,DISPLAY_TYPE,CATEGORY,CATEGORY_TYPE,ACTION,ACTION_TYPE

A template needs to be defined that returns the results in this format, with no extra formatting or line breaks.

  1. Return to the administration interface and switch to the nobel-prize-winners collection.

  2. Create an new profile named autocomplete.

  3. Create a new custom template within the autocomplete profile that formats the results as auto-completion CSV.

    The custom template will read the triggers from the collection configuration and produce a JSON object for each suggestion. A hook script will be used (this will be added after the template is saved) to normalise accented characters in the triggers, as well as load a list of stop words into a custom data field in the data model so that these can be stripped from triggers.

    Paste in the following code then save and publish the template as autocomplete.ftl

    <#ftl encoding="utf-8" /><#compress>
    <#--
    auto-completion.ftl: This template is used to generate auto-completion CSV from the Funnelback index. This file is included as part of the funnelback-concierge GitHub project. See: https://github.com/funnelback/funnelback-concierge/
    Author: Peter Levan, 2017
    -->
    <#import "/web/templates/modernui/funnelback_classic.ftl" as s/>
    <#import "/web/templates/modernui/funnelback.ftl" as fb/>
    <@s.Results>
    <#assign displayJson>
    <@compress single_line=true>
    {
        "title": "${s.result.title!"No title"?json_string}",
            <#if s.result.date?exists>"date": "${s.result.date?string["dd MMM YYYY"]?json_string}",</#if>
        "summary": "${s.result.summary!?json_string}",
        "fileSize": "${s.result.fileSize!?json_string}",
        "fileType": "${s.result.fileType!?json_string}",
        "exploreLink": "${s.result.exploreLink!?json_string}",
            "metaData": {
            <#if s.result.metaData??><#list s.result.metaData?keys as md>
                "${md?json_string}": "${s.result.metaData[md]?json_string}"<#if md_has_next>,</#if>
            </#list></#if>
        },
        "displayUrl": "${s.result.liveUrl!?json_string}",
        "cacheUrl": "${s.result.cacheUrl!?json_string}"
    }
    </@compress>
    </#assign>
    <#--Check to see if the action has been configured, URL mode directing to the ClickUrl is the default-->
    <#assign configActionMode>question.currentProfileConfig.get("auto-completion.${question.profile?replace("_preview","")}.action-mode")</#assign>
    <#assign configActionModeEval=configActionMode?eval!"U">
    <#if configActionModeEval == "Q">
        <#assign actionmode = "Q">
    <#else>
        <#assign actionmode = "U">
    </#if>
    <#if s.result.class.simpleName != "TierBar">
        <#-- read in a comma separated list of triggers from collection.cfg for each auto-completion profile.  Each trigger can be made up of multiple words sourced from
            different fields.  Profile is read from the profile CGI parameter when the auto-completion is generated.
            e.g. Configure three triggers for a staff record (profile=staff)
            auto-completion.staff.triggers=s.result.metaData["firstname"] s.result.metaData["lastname"],s.result.metaData["lastname"] s.result.metaData["firstname"],s.result.metaData["department"]
            e.g. Configure a single trigger for a news entry (profile=news)
            auto-completion.news.triggers=s.result.title
        -->
        <#assign triggerConfig>question.currentProfileConfig.get("auto-completion.${question.profile?replace("_preview","")}.triggers")</#assign>
        <#-- several compound triggers can be defined in the profile or collection configuration, separated with commas.  Split these and process each compound trigger -->
        <#if triggerConfig?eval??>
            <#assign triggerConfigList = triggerConfig?eval/>
        <#else>
            <#assign triggerConfigList = "s.result.title"/>
        </#if>
        <#list triggerConfigList?split(",") as triggerList>
            <#assign trigger = "">
            <#list triggerList?split(" ") as triggerVars>
                <#-- each (compound) trigger can be made from a set of values that are combined from different metadata.  Eval these vars and join with a space -->
                <#assign triggerClean = triggerVars?eval?lower_case?replace("[^A-Za-z0-9\\s]"," ","r")?replace("\\s+"," ","r")>
                <#assign trigger += triggerClean+" ">
            </#list>
            <#assign trigger=trigger?replace("\\s+$","","r")?replace("^\\s+","","r")>
            <#-- set up the action -->
            <#if actionmode == "Q">
                <#assign action = trigger>
            <#else>
                <#assign action = s.result.clickTrackingUrl>
            </#if>
            <#list trigger?split(" ") as x>
                <#-- process each trigger, stripping out stop words -->
                <#if response.customData["stopwords"]?? && response.customData["stopwords"]?seq_contains(x)>
                    <#assign trigger>${trigger?replace("^"+x+"\\s+","","r")}</#assign>
                <#elseif trigger??>
                    "${trigger}",900,${escapeCsv(displayJson)},J,"",,"${action}",${actionmode}
                    <#assign trigger>${trigger?replace("^"+x+"\\s+","","r")}</#assign>
                </#if>
            </#list>
        </#list>
    </#if>
    </@s.Results>
    </#compress>
    
    <#function escapeCsv str>
        <#return str!?chop_linebreak?trim?replace("\"", "\\\"")?replace(",","\\,") />
    </#function>

    The template is configured to return multiple rows of CSV for each search result. Recall each suggestion must have a trigger defined - and various triggers are defined for each suggestion. Suggestions are returned based on the year of award, and the name of the person receiving the award.

  4. The template is designed to read the triggers from a configuration setting. Configure the items to use for triggers by setting the following in your profile configuration:

    auto-completion.autocomplete.triggers=s.result.metaData["year"]!,s.result.metaData["nameNormalized"]!
    Freemarker templates can use the question.currentProfileConfig.get(COLLECTION_CFG_KEY) function to access configuration values.
  5. Configure the metadata fields that should be returned for auto-completion. The template will include all the metadata fields that are configured to be available with the search results. Update the display options for the autocomplete profile to ensure that the required metadata fields are returned with the result packet. Create a padre_opts.cfg for the autocomplete profile and add the following:

    -SM=meta -SF=[year,category,name,birthDate,birthPlace,country,residence,roleAffiliate,fieldLanguage,prizeName,motivation]
  6. Create a post-process hook script (reminder - if creating from the command line this needs to be in the collection’s configuration folder and not the autocomplete profile) to add stop words to the data model and clean the input. Add the following code:

    // Library imports required for the normalisation
    import java.text.Normalizer;
    import java.text.Normalizer.Form;
    
    if (transaction.question.form == "autocomplete") {
            transaction?.response?.resultPacket?.results.each() {
            // Do this for each result item
    
            // Create a normalised version of the name metadatafield that removes diacritics
            // This calls a Java function to normalize the name metadata field and write the normalised
            // version of the name into a new metadata field called nameNormalized
                it.metaData["nameNormalized"] = Normalizer.normalize(it.metaData["name"], Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
            }
    
            // read stop words into data model customData element
            //using $SEARCH_HOME for the path of the file
            // Linux
            def stop_file = "/opt/funnelback/share/lang/en_stopwords"
            // Windows
            // def stop_file = "c:\\funnelback\\share\\lang\\en_stopwords"
            def stop = new File(stop_file).readLines()
            transaction.response.customData["stopwords"] = stop
    }

    This script does two things:

    • Loads Funnelback’s built-in stop words list (words that should be removed from a query such as a, and, the, of into a custom part of the data model - this is for use in the autocomplete template; and

    • Clones the name metadata value to another metadata field and replace any diacritic (accented) characters with non-accented versions - this is so that auto-completion triggers can be defined that can be typed easily on a standard keyboard, while still triggering the correctly spelt names.

  7. Test the autocomplete.ftl to ensure that correct CSV is generated. Run a query for Hjalmar using the autocomplete profile and autocomplete template and examine the output by viewing the page source in the browser. This shows the output for the first 10 results. (http://training-search.clients.funnelback.com/s/search.html?collection=nobel-prize-winners&query=Hjalmar&profile=autocomplete&form=autocomplete)

    Also observe that triggers have had special characters normalised (e.g. look for trigger for the Dag Hjalmar Agne Carl Hammarskjöld entry and observe that the trigger is dag hjalmar agne carl hammarskjold.

    Hint: when testing the CSV it is a good idea to download the CSV file then open it up in a program like Microsoft Excel to ensure the file is correctly delimited and that the correct number of fields are being returned.
  8. View the XML or JSON output and observe the name and nameNormalized metadata values in the results, and also that stop words that are loaded into the customData element in the response packet.

  9. Update padre_opts.cfg to increase num_ranks and disable logging. Increase num_ranks to 1000 to ensure enough results are returned for the auto-completion.

    Note: large values of num_ranks will result in long response times and could cause the web server to run out of memory if a value too large is chosen. Optimising the options set in padre_opts.cfg can also assist in reducing the load and memory requirements caused by running a large query. The settings can be optimised by turning off any unused features (for this profile) and also minimising the amount of metadata returned (so limit the SF values to only those used in the template). Set the following options in the padre_opts.cfg:
    -SM=meta -SF=[year,category,name,birthDate,birthPlace,country,residence,roleAffiliate,fieldLanguage,prizeName,motivation] -log=false -num_ranks=1000
    nameNormalized isn’t required in the list of summary fields because it is injected into the data model in the post process hook script, based on the existence of the name field.

    Don’t forget to publish the padre_opts.cfg file after saving the changes.

  10. Optional: configure profile.cfg to send the mime type for the autocomplete template. This isn’t required for Funnelback to work but will cause your browser to correctly identify the type when you view it.

    ui.modern.form.autocomplete.content_type=text/csv
  11. Create a post-index workflow script at $SEARCH_HOME/conf/nobel-prize-winners/@workflow that does the following:

    • Runs a Funnelback query to produce auto-completion, using the autocomplete template. Observe that the query operates on the $CURRENT_VIEW. This means that for a normal update auto-completion will be created from the offline view, but for a reindex of the live view the auto-completion will be generated from the live index.

    • Writes the output to the auto-completion.csv

    • Runs build_autoc to build auto-completion, and replaces the default auto-completion index file of the $CURRENT_VIEW with the newly generated file.

      The file will need to be created from the command line.

      [search@localhost ~]$ cd /opt/funnelback/conf/nobel-prize-winners/@workflow
      [search@localhost @workflow]$ nano post_index.sh
  12. Add the following code to post_index.sh:

    #!/bin/bash
    # Post index workflow script
    
    # Allow the collection.cfg variables to be passed in and made available within this script.
    # Pass the variables in as -c $COLLECTION_NAME -g $GROOVY_COMMAND -v $CURRENT_VIEW -p <comma separated list of profiles>
    while getopts ":c:g:v:p:" opt; do
      case $opt in
        c) COLLECTION_NAME="$OPTARG"
        ;;
        g) GROOVY_COMMAND="$OPTARG"
        ;;
        v) CURRENT_VIEW="$OPTARG"
        ;;
        p) PROFILES="$OPTARG"
        ;;
        \?) echo "Invalid option -$OPTARG" >&2
        ;;
      esac
    done
    
    # Generate autocompletion for each profile
    
    IFS=',' read -r -a PROFILE <<< ${PROFILES}
    
    for p in "${PROFILE[@]}"
    do
            # Run the Funnelback query to return the CSV, catching any errors.
            echo "Generating autocompletion CSV for $p"
            curl --connect-timeout 60 --retry 3 --retry-delay 20 'http://localhost:9080/s/search.html?collection='$COLLECTION_NAME'&query=!generate_autoc&profile='$p'&form=autocomplete&view='$CURRENT_VIEW'' -o $SEARCH_HOME/conf/$COLLECTION_NAME/$p/auto-completion.csv || exit 1
            # Build auto-completion using the generated auto-completion.csv
    $SEARCH_HOME/bin/build_autoc $SEARCH_HOME/data/$COLLECTION_NAME/$CURRENT_VIEW/idx/index $SEARCH_HOME/conf/$COLLECTION_NAME/$p/auto-completion.csv -collection $COLLECTION_NAME -profile $p
    
    done
  13. Change the permissions to make the script executable:

    chmod 775 $SEARCH_HOME/conf/nobel-prize-winners/\@workflow/post_index.sh
  14. Test the post-index command by running the script on the live index

    [search@localhost nobel-prize-winners]$ /opt/funnelback/conf/nobel-prize-winners/\@workflow/post_index.sh -c nobel-prize-winners -v live -p autocomplete
    Generating autocompletion CSV for autocomplete
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 4530k    0 4530k    0     0  1372k      0 --:--:--  0:00:03 --:--:-- 1372k
    Completions added from /opt/funnelback/conf/nobel-prize-winners/autocomplete/auto-completion.csv: 3059
  15. Add the post index command to collection.cfg

    post_index_command=$SEARCH_HOME/conf/$COLLECTION_NAME/@workflow/post_index.sh -c $COLLECTION_NAME -v $CURRENT_VIEW -p autocomplete
  16. Run an update of the collection from the administration interface.

  17. Examine the update log and observe the messages from the workflow scripts and check that no errors are being produced.

    Detailed log: /opt/funnelback/data/nobel-prize-winners/two/log/update.log
    Detailed output is being logged to: /opt/funnelback/data/nobel-prize-winners/two/log/update.log
    Started at: Thu, 14 Nov 2019 22:57:43 GMT
    Starting: Executing collection update cycle.
    Executing collection update phases. (UpdateCollectionStep)
      Starting: Runs the update cycle steps
      Recording update start time. (RecordUpdateStartTime)
      	 completed in 0.187s
      Phase: Gathering content. (GatherPhase)
      	 skipped because gathering has been disabled with collection config option: gather, in 0.0s
      Running pre_reporting_command to provide backward compatibility for fetching remote logs. (PreReportCommand)
        Starting: Running pre_reporting_command to provide backward compatibility for fetching remote logs.
        Executes the command set in the collection.cfg option: 'pre_report_command' (PrePostStep)
        	 skipped because 'pre_report_command' is not set., in 0.1s
        Finished 'Running pre_reporting_command to provide backward compatibility for fetching remote logs.', in 0s
      	 completed in 0.47s
      Phase: Archiving query logs. (ArchivePhase)
        Starting: Phase: Archiving query logs.
        Executes the command set in the collection.cfg option: 'pre_archive_command' (PrePostStep)
        	 skipped because 'pre_archive_command' is not set., in 0.0s
        Archiving query logs. (ArchiveLiveViewLogs)
        	 completed in 0.33s
        Executes the command set in the collection.cfg option: 'post_archive_command' (PrePostStep)
        	 skipped because 'post_archive_command' is not set., in 0.0s
        Finished 'Phase: Archiving query logs.', in 0s
      	 completed in 0.140s
      Phase: Indexing content. (IndexPhase)
        Starting: Index pipeline for collection: nobel-prize-winners
        Executes the command set in the collection.cfg option: 'pre_index_command': /opt/funnelback/conf/nobel-prize-winners/@workflow/pre_index.sh -c nobel-prize-winners (PrePostStep)
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    
      0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
    100  303k  100  303k    0     0  26.0M      0 --:--:-- --:--:-- --:--:-- 26.9M
    TSV2XML: Processing files at Thu Nov 14 22:57:44 2019...
    TSV2XML: processing data file: /opt/funnelback/data/nobel-prize-winners/offline/data/nobel.tsv
    TSV2XML: Finished processing files at Thu Nov 14 22:57:44 2019
        	 completed in 0.222s
        Configuring permission on query lock file for impersonation. (SetWindowsPermissionsOnLock)
        	 skipped because not running on windows., in 0.0s
        Creating duplicate and redirects database. (CreateDupredrexAndRedirectsDB)
        	 completed in 0.229s
        Processing click logs for ranking improvements. (ClickLogs)
        	 completed in 0.125s
        Assembling metadata mappings. (AssembleMetadataMapping)
        	 completed in 0.440s
        Working out how many documents can be indexed to stay within license limits. (SetMaxDocumentsThatCanBeIndexed)
        	 completed in 0.242s
        Indexing documents. (Index)
        	 completed in 0.348s
        Indexing documents. (ParallelIndex)
        	 skipped because config option:'collection-update.step.ParallelIndex.run' was set to 'false'., in 0.0s
        Killing fully-matching documents. (ExactMatchKill)
        	 skipped because kill_exact.cfg does not exist, in 0.1s
        Killing partially-matching documents. (PartialMatchKill)
        	 skipped because kill_partial.cfg does not exist, in 0.1s
        Processing annotations for the collection. (AnnieAPrimaryCollection)
        	 completed in 0.18s
        Setting Gscopes. (SetGscopes)
        	 skipped because gscopes.cfg does not exist, in 0.0s
        Setting Query Gscopes. (SetQueryGscopes)
        	 skipped because query-gscopes.cfg does not exist, in 0.1s
        Applying gscopes derived from Faceted Navigation. (FacetBasedGscopes)
        	 skipped because no URL Pattern based facet categories., in 0.26s
        Building result collapsing signatures file. (BuildCollapsingSignatures)
        	 completed in 0.13s
        Creating spelling suggestions. (BuildSpelling)
        	 completed in 0.34s
        Applying Query Independent Evidence. (QueryIndependentEvidenceCollectionLevel)
        	 completed in 0.24s
        Building query completion index. (BuildAutoCompletion)
        	 completed in 0.146s
        Carry over WCAG summary file (CarryOverWCAGSummary)
        	 skipped because Previous summary file didn't exist, in 0.1s
        Generating Content Auditor profile summaries (ContentAuditorSummary)
        	 completed in 1s
        Creating broken links report. (CreateBrokenLinksReportStep)
        	 skipped because broken link reports are only generated for web collections, in 0.0s
        Setting the time stamp on the index and data. (AddTimeStamp)
        	 completed in 0.10s
        Executes the command set in the collection.cfg option: 'post_index_command': /opt/funnelback/conf/nobel-prize-winners/@workflow/post_index.sh -c nobel-prize-winners -v $CURRENT_VIEW -p autocomplete (PrePostStep)
    Generating autocompletion CSV for autocomplete
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    
      0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
    100 16049    0 16049    0     0  46746      0 --:--:-- --:--:-- --:--:-- 46654
    100 1833k    0 1833k    0     0  1495k      0 --:--:--  0:00:01 --:--:-- 1495k
    100 4363k    0 4363k    0     0  1923k      0 --:--:--  0:00:02 --:--:-- 1923k
    100 4750k    0 4750k    0     0  1896k      0 --:--:--  0:00:02 --:--:-- 1896k
    Completions added from /opt/funnelback/conf/nobel-prize-winners/autocomplete/auto-completion.csv: 3059
    Autocompletion file written: 3059 entries
        	 completed in 2s
        Finished 'Index pipeline for collection: nobel-prize-winners', in 5s
      	 completed in 6s
      Creating recommendations. (BuildRecommenderDBPhase)
      	 skipped because recommender=false, in 0.15s
      Deleting content from store and index for instant delete. (InstantDeletePhase)
      	 skipped because Instant delete is not run in this update type, in 0.0s
      Phase: Swapping views to make the new index live. (SwapViewsPhase)
        Starting: Switching to the newly crawled data and index.
        Executes the command set in the collection.cfg option: 'pre_swap_command' (PrePostStep)
        	 skipped because 'pre_swap_command' is not set., in 0.0s
        Checking the total number of documents indexed is sufficient. (ChangeOverIndexCountCheck)
        	 completed in 0.3s
        Record Accessibility Auditor historical data  (RecordAccessibilityAuditorHistoryForSwapViews)
        	 skipped because Accessibility Auditor history has already been recorded (if it was enabled)., in 0.1s
        Switching to the new index. (SwapViews)
        	 completed in 0.6s
        Executes the command set in the collection.cfg option: 'post_swap_command' (PrePostStep)
        	 skipped because 'post_swap_command' is not set., in 0.0s
        Finished 'Switching to the newly crawled data and index. ', in 0s
      	 completed in 0.114s
      Phase: Updating Meta parent collections (spelling, auto completion, etc). (MetaDependenciesPhase)
        Starting: Update meta parents of: nobel-prize-winners
        Executes the command set in the collection.cfg option: 'pre_meta_dependencies_command' (PrePostStep)
        	 skipped because 'pre_meta_dependencies_command' is not set., in 0.1s
        Update the parent meta collections. (UpdateMetaParents)
        	 completed in 0.3s
        Executes the command set in the collection.cfg option: 'post_meta_dependencies_command' (PrePostStep)
        	 skipped because 'post_meta_dependencies_command' is not set., in 0.0s
        Finished 'Update meta parents of: nobel-prize-winners', in 0s
      	 completed in 0.46s
      Phase: Archiving query logs. (ArchivePhase)
        Starting: Phase: Archiving query logs.
        Executes the command set in the collection.cfg option: 'pre_archive_command' (PrePostStep)
        	 skipped because 'pre_archive_command' is not set., in 0.0s
        Archiving query logs. (ArchiveOfflineViewLogs)
        	 completed in 0.1s
        Executes the command set in the collection.cfg option: 'post_archive_command' (PrePostStep)
        	 skipped because 'post_archive_command' is not set., in 0.0s
        Finished 'Phase: Archiving query logs.', in 0s
      	 completed in 0.42s
      Running post updates tasks. (PostUpdateTasks)
        Starting: Running post update tasks
        Executing post update hook scripts. (PostMetaUpdateHookScripts)
        	 skipped because not a meta collection, in 0.1s
        Triggering post update command. (PostUpdateCommand)
        	 skipped because No post update command specified, in 0.0s
        Finished 'Running post update tasks', in 0s
      	 completed in 0.30s
      Recording update finish time. (RecordUpdateFinishTime)
      	 completed in 0.85s
      Recording update metrics. (RecordMetrics)
      	 completed in 0.1s
      Sending success email. (SendSuccessfulUpdateMessage)
      	 skipped because config option mail.on_failure_only=true, in 0.1s
      Marking this update as a success. (SetUpdateSuccessfulState)
      	 completed in 0.2s
      Finished 'Runs the update cycle steps', in 7s
    	 completed in 7s
    Finished 'Executing collection update cycle.', in 7s
    Finished at: Thu, 14 Nov 2019 22:57:50 GMT
    
    Update: Exit Status: 0
  18. Verify that auto-completion.csv has been created in the autocomplete profile’s folder and that it contains sensible looking CSV.

    exercise use funnelback to generate structured auto completion 01
  19. Configure auto-completion to display the autocomplete dataset. Edit the simple.ftl for the _default_preview profile and edit the auto-completion JavaScript configuration near the bottom of the file. Replace the auto-completion code block with the following code:

    <#if question.currentProfileConfig.get('auto-completion') == 'enabled'>
    <script src="${GlobalResourcesPrefix}thirdparty/typeahead-0.11.1/typeahead.bundle.min.js"></script>
    <script src="${GlobalResourcesPrefix}thirdparty/handlebars-4.1/handlebars.min.js"></script>
    <script src="${GlobalResourcesPrefix}js/funnelback.autocompletion-2.6.0.js"></script>
    <script>
      jQuery(document).ready(function() {
        jQuery('input.query').autocompletion({
          datasets: {
            <#if question.currentProfileConfig.get('auto-completion.standard.enabled')?boolean>
            organic: {
              collection: '${question.collection.id}',
              profile : '${question.profile}',
              format: '<@s.cfg>auto-completion.format</@s.cfg>',
              alpha: '<@s.cfg>auto-completion.alpha</@s.cfg>',
              show: '<@s.cfg>auto-completion.show</@s.cfg>',
              sort: '<@s.cfg>auto-completion.sort</@s.cfg>',
              group: true
            },
            </#if>
            autocomplete: {
              collection: '${question.collection.id}',
              profile : 'autocomplete',
              show: '3',
              template: {
                suggestion: '<div><strong>{{label.metaData.prizeName}} ({{label.metaData.year}})</strong><br/>Winner: {{label.metaData.name}} ({{label.metaData.country}})</div>'
              },
              group: true
            },
            <#if question.currentProfileConfig.get('auto-completion.search.enabled')?boolean>
            facets: {
              collection: '${question.collection.id}',
              itemLabel: function(suggestion) { return suggestion.query + ' in ' + suggestion.label; },
              profile : '${question.profile}',
              program: '<@s.cfg>auto-completion.search.program</@s.cfg>',
              queryKey: 'query',
              transform: $.autocompletion.processSetDataFacets,
              group: true,
              template: {
                suggestion: '<div>{{query}} in {{label}}</div>'
              }
            },
            </#if>
          },
          program: '<@s.cfg>auto-completion.program</@s.cfg>',
          horizontal: true,
          typeahead: {hint: true},
          length: <@s.cfg>auto-completion.length</@s.cfg>
        });
      });
    </script>
    </#if>
  20. Add the following style to the style block (approx. line 40) to increase the width of the search box. This will make the auto-completion popup width wider as well.

    .twitter-typeahead .query {width: 600px;}
  21. Run a search against the collection observing that auto-completion now uses the structured auto-completion that was generated via the workflow script. Ensure you test against the default (not default preview) profile as the workflow above is writing the generated autocomplete index to the live profile for the collection. http://training-search.clients.funnelback.com/s/search.html?collection=nobel-prize-winners&profile=_default

    exercise use funnelback to generate structured auto completion 02

6. Document filtering

Filtering is the process of transforming gathered content into content suitable for indexing by Funnelback.

This is achieved using a series of document filters which work together to transform the document. The raw document is the input to the filter process and filtered text is the output. Each filter transforms the document in some way. e.g. extracting the text from a PDF which is then passed on to the next filter which might alter the title or other metadata stored within the document.

Filters operate on the data that is downloaded by Funnelback and any changes made by filters affect the index.

Most Funnelback collection types can use filters. However collection types which never gather content (local and meta collections) can’t use document filtering. Refer to the documentation for the specific collection type for information on how to make use of filters.

A full update is required after making any changes to filters as documents that are copied during an incremental update are not re-filtered. For push collections all existing documents in the index will need to be re-added to the index so that the content is re-filtered.

Full updates are started from the advanced update screen.

6.1. The filter chain

During the filter phase the document passes through a series of general document filters with the modified output being passed through to the next filter. The series of filters is referred to as the filter chain.

There are a number of preset filters that are used to perform tasks such as extracting text from a binary document, and cleaning the titles.

A typical filter process is shown below. A binary document is converted to text using the Tika filters. This extracts the document text and outputs the document as HTML. This HTML is then passed through the JSoup filter which runs a separate chain of JSoup filters which allow targeted modification of the HTML content and structure. Finally a custom filter performs a number of modifications to the content.

the filter chain 01

JSoup filters should be used for HTML documents when making modifications to the document structure, or performing operations that select and transform the document’s DOM. Custom JSoup filters can be written to perform operations such as:

  • Injecting metadata

  • Cleaning titles

  • Scraping content (e.g. extracting breadcrumbs to metadata)

The filter chain is made up of chains and choices - separated using two types of delimiters. These control if the content passes through a single filter from a set of filters (a choice, indicated by commas), or through each filter (a chain, indicated by colons).

The set of filters below would be processed as follows: The content would pass through either Filter3, Filter2 or Filter1 before passing through Filter4 and Filter5.

Filter1,Filter2,Filter3:Filter4:Filter5

There are some caveats when specifying filter chains which are covered in more detail in the documentation.

There is also support for custom general document filters written in Groovy. Custom filters receive the document’s URL and text as an input and must return the transformed document text ready to pass on to the next filter. The custom filter and can do pretty much anything to the content, and uses Groovy (and Java) code.

Custom filters should be used when a JSoup filter is not appropriate. Custom filters offer more flexibility but are more expensive to run. Custom filters can be used for operations such as:

  • Manipulating complete documents as binary or string data

  • Splitting a document into multiple documents

  • Modifying the document type or URL

  • Removing documents

  • Transforming HTML or JSON documents

  • Implementing document conversion for binary documents

  • Processing/analysis of documents where structure is not relevant

6.1.1. General document filters

General document filters make up the main filter chain within Funnelback. A number of built-in filters ship with Funnelback and the main filter chain includes the following filters by default:

  • TikaFilterProvider: converts binary documents to text using Tika

  • ExternalFilterProvider: uses external programs to convert documents. In practice this is rarely used.

  • JSoupProcessingFilterProvider: converts the document to and from a JSoup object and runs an extra chain of JSoup filters.

  • DocumentFixerFilterProvider: analyses the document title and attempts to replace it if the title is not considered a good title.

There are a number of other built-in filters that can be added to the filter chain, the most useful being:

  • JSONToXML and ForceJSONMime: Enables Funnelback to index JSON data.

  • CSVToXML and ForceCSVMime: Enables Funnelback to index CSV data.

  • InjectNoIndexFilterProvider: automatically inserts noindex tags based on CSS selectors.

  • MetadataNormaliser: used to normalise and replace metadata fields.

Custom filters can also be written in Groovy that operate on the document content. However, for html documents most custom filtering needs are best served by writing a JSoup filter. Custom filters are appropriate when filtering is required on non-html documents, or to process the document as a whole piece of unstructured content.

The documentation includes some detailed examples of general document filters.

6.1.2. JSoup filtering

JSoup filtering allows for a series of micro-filters to be written that can perform targeted modification of the HTML/XML document structure and content.

The main JSoup filter, which is included in the filter chain takes the HTML document and converts it into a structured DOM object that the JSoup filters can then work with using DOM traversal and CSS style selectors, which select on things such as element name, class, ID.

A series of JSoup filters can then be chained together to perform a series of operations on the structured object - this includes modifying content, injecting/deleting elements and restructuring the HTML/XML.

The structured object is serialised at the end of the JSoup filter chain returning the text of the whole data structure to the next filter in the main filter chain.

Exercise 8: Simple JSoup filter

In this exercise a simple JSoup filter will be developed to scrape some page content for additional metadata.

Scraping of content is not generally recommended as it depends on the underlying code structure. Any change to the code structure has the potential to cause the filter to break. Where possible avoid scraping and make any changes at the content source.
  1. The source data must be analysed before any filters can be written - this informs what filtering is possible and how it may be implemented. Examine an episode page from the source data used by the simpsons collection. Examine http://training-search.clients.funnelback.com/training/training-data/simpsons/www.simpsoncrazy.com/episodes/dancin-homer.html and observe that there is potential useful metadata contained in the vitals box located as a menu on the right hand side of the content. Inspect the source code and locate the HTML code to determine if the code can be selected. Check a few other episode pages on the site and see if the code structure is consistent.

  2. The vitals information is contained within the following HTML code:

    <div class="sidebar vitals">
    	<h2>Vitals</h2>
    
    	<p class="half">PCode<br>
    		7F05</p>
    	<p class="half">Index<br>
    		2&#215;5</p>
    
    	<p class="half">Aired<br>
    		8 Nov, 1990</p>
    	<p class="half">Airing (UK)<br>
    		unknown</p>
    
    	<p>Written by<br>
    					Ken Levine<br>David Isaacs						</p>
    
    	<p>Directed by<br>
    					Mark Kirkland
    
    	<p>Starring<br>Dan Castellaneta<br>Julie Kavner<br>Nancy Cartwright<br>Yeardley Smith<br>Harry Shearer</p><p>Also Starring<br>Hank Azaria<br>Pamela Hayden<br>Daryl L. Coley<br>Ken Levine</p><p>Special Guest Voice<br>Tony Bennett as Himself<br>Tom Poston as Capitol City Goofball
    </div>

    There appears to be enough information and consistency available within the markup to write JSoup selectors to target the content. The elements can be selected using a JSoup select("div.vitals p"). Once selected the content can be extracted and written out as metadata.

  3. Log in to the Funnelback server via SSH and change to the simpsons collection

    [search@localhost ~]$ cd /opt/funnelback/conf/simpsons/
  4. Create the filter folders (if they don’t already exist). Create a @groovy/com/funnelback/training/simpsons folder beneath the simpsons configuration folder. Add your JSoup filters into this folder.

    [search@localhost simpsons]$ mkdir -p /opt/funnelback/conf/simpsons/\@groovy/com/funnelback/training/simpsons
  5. Create the filter. The filename and class names used by the filter script determines the filter name to use in the JSoup filter chain. In the example below a filter will be created that maps to the following name within the JSoup filter chain: com.funnelback.training.simpsons.scrapeMetadata. Create a file inside the @groovy/com/funnelback/training/simpsons folder called scrapeMetadata.groovy

    [search@localhost simpsons]$ nano \@groovy/com/funnelback/training/simpsons/scrapeMetadata.groovy
  6. Cut and paste the following groovy code into the file then save the filter. When filtering with JSoup a class needs to be defined to implement the IJSoupFilter class and the processDocument function implements the actual filter logic. This filter is applied to each document after it is downloaded by Funnelback. The filter receives a JSoup object that can be read and modified by the filter. This object is then passed on to the next filter in the JSoup filter chain.

    package com.funnelback.training.simpsons
    
    import com.funnelback.common.filter.jsoup.*
    
    // Imports required for logging
    @groovy.util.logging.Log4j2
    
    /**
     * scrapes and inserts alt text from first image and inserts as custom metadata
     */
    
    public class ScrapeMetadata implements IJSoupFilter {
    
        public void processDocument(FilterContext context) {
    
            // get the document as a Jsoup object
            def doc = context.getDocument()
    
            // run some Jsoup selects. eg select the vitals div element in the document
            def vitals = doc.select("div.vitals p")
    
            // get the document <head> as an object - this will be used for appending new elements
            def head = doc.select("head").first()
    
            // For each paragraph within the vitals div
            vitals.each {
               if (it != null) {
                    def item = it;
    
                    // split the item using '<br>' as the delimiter
                    def splitItem = item.html().split('<br>')
                    def key ="";
                    def val ="";
                    if (splitItem.length > 1) {
                        // The metadata key will be the first item in the array
                        key = splitItem[0];
                        // remove the key from the array leaving just the values for the metadata field
                        def vals = splitItem - splitItem[0]
                        // clean each remaining value
                        vals.each {
                            it=it.trim()
                        }
                        // replace whitespace in the key with '-' characters and convert to lowercase
                        key = key.replaceAll(~/\s+/,"-").toLowerCase()
                        // join the values together using the pipe symbol as a delimiter
                        val = vals.join('|').trim()
    
                        // print out some logging to crawler.central.log
                        log.error("Adding metadata: simpsons."+key+": "+val)
                        // append a new metadata element to the bottom of the head section
                        head.appendElement("meta").attr("name","simpsons."+key).attr("content",val)
    
                        // special custom processing
                        // if the key is the episode index extract the season and episode number and write out metadata
                        if (key =~ /index/) {
                            def se = val.split("\u00D7")
                            log.error("Adding metadata: simpsons.season: "+se[0])
                            log.error("Adding metadata: simpsons.episode: "+se[1])
                            head.appendElement("meta").attr("name","simpsons.season").attr("content",se[0])
                            head.appendElement("meta").attr("name","simpsons.episode").attr("content",se[1])
                        }
                        // if the key is the original aired date then extract the year
                        if (key =~ /aired$/) {
                            if (val =~ /\d{4}/) {
                                def year = (val =~ /\d{4}/)
                                log.error("Adding metadata: simpsons.year: "+year[0])
                                head.appendElement("meta").attr("name","simpsons.year").attr("content",year[0])
                            }
                        }
                    }
                }
            }
        }
    }
  7. Add the filter to the Jsoup filter chain. Edit the collection configuration and update the filter.jsoup.classes value, adding the following to the end of the filter chain:

    com.funnelback.training.simpsons.scrapeMetadata
  8. Run a full update of the collection (filter changes always require a full update to ensure everything is re-filtered).

  9. While the update is running tail the crawler.central.log and observe the log messages produced by the filter’s logger.error lines. Use Ctrl-C to cancel the tailing.

    [search@localhost simpsons]$ tail -f /opt/funnelback/data/simpsons/offline/log/crawler.central.log
    ...
    016-06-15 04:06:00,253 [com.funnelback.crawler.NetCrawler 0] ERROR jsoup.ScrapeMetadata - Adding metadata: simpsons.pcode: EABF13
    2016-06-15 04:06:00,253 [com.funnelback.crawler.NetCrawler 0] ERROR jsoup.ScrapeMetadata - Adding metadata: simpsons.index: 14×18
    2016-06-15 04:06:00,253 [com.funnelback.crawler.NetCrawler 0] ERROR jsoup.ScrapeMetadata - Adding metadata: simpsons.season: 14
    2016-06-15 04:06:00,253 [com.funnelback.crawler.NetCrawler 0] ERROR jsoup.ScrapeMetadata - Adding metadata: simpsons.episode: 18
    2016-06-15 04:06:00,253 [com.funnelback.crawler.NetCrawler 0] ERROR jsoup.ScrapeMetadata - Adding metadata: simpsons.aired: 27 Apr, 2003
    2016-06-15 04:06:00,253 [com.funnelback.crawler.NetCrawler 0] ERROR jsoup.ScrapeMetadata - Adding metadata: simpsons.year: 2003
    2016-06-15 04:06:00,253 [com.funnelback.crawler.NetCrawler 0] ERROR jsoup.ScrapeMetadata - Adding metadata: simpsons.aired-(uk): 17 Aug, 2003
    2016-06-15 04:06:00,253 [com.funnelback.crawler.NetCrawler 0] ERROR jsoup.ScrapeMetadata - Adding metadata: simpsons.written-by: Ian Maxtone-Graham
    2016-06-15 04:06:00,254 [com.funnelback.crawler.NetCrawler 0] ERROR jsoup.ScrapeMetadata - Adding metadata: simpsons.directed-by: Chris Clements
    2016-06-15 04:06:00,254 [com.funnelback.crawler.NetCrawler 0] ERROR jsoup.ScrapeMetadata - Adding metadata: simpsons.starring: Dan Castellaneta|Julie Kavner|Nancy Cartwright|Yeardley Smith|Hank Azaria|Harry Shearer
    2016-06-15 04:06:00,254 [com.funnelback.crawler.NetCrawler 0] ERROR jsoup.ScrapeMetadata - Adding metadata: simpsons.also-starring: Pamela Hayden|Tress MacNeille|Karl Wiedergott
    2016-06-15 04:06:00,254 [com.funnelback.crawler.NetCrawler 0] ERROR jsoup.ScrapeMetadata - Adding metadata: simpsons.special-guest-voice: Andy Serkis as Cleanie|David Byrne as Himself|Jonathan Taylor Thomas as Luke Stetson
    2016-06-15 04:06:00,285 [com.funnelback.crawler.NetCrawler 0] ERROR jsoup.ScrapeMetadata - Adding metadata: simpsons.pcode: EABF15
    2016-06-15 04:06:00,286 [com.funnelback.crawler.NetCrawler 0] ERROR jsoup.ScrapeMetadata - Adding metadata: simpsons.index: 14×20
    2016-06-15 04:06:00,286 [com.funnelback.crawler.NetCrawler 0] ERROR jsoup.ScrapeMetadata - Adding metadata: simpsons.season: 14
    2016-06-15 04:06:00,286 [com.funnelback.crawler.NetCrawler 0] ERROR jsoup.ScrapeMetadata - Adding metadata: simpsons.episode: 20
    ...
  10. Search for the Homer the heretic and view the cached version for the episode guide.

    exercise simple jsoup filter 01
  11. View the page source and observe that a number of custom metadata fields have been written (by the JSoup filter) into the source document. Remember that the filter is modifying the content that is stored by Funnelback, which is reflected in what is inside the cached version (it won’t modify the source web page).

    exercise simple jsoup filter 02

    Various log levels (INFO, DEBUG, ERROR etc.) are defined in the logging system. By default, INFO and DEBUG errors won’t be displayed unless the configuration is modified to increase the log level.

    When developing filters printing to the ERROR level will ensure messages are displayed without changing the log level, but these should be removed or set to the correct level once testing is complete.

7. Funnelback APIs

Funnelback includes a number of REST APIs and web services that can be used to integrate with other systems.

The Push API was briefly covered in the FUNL202 training course. Some of the other available APIs or API-like interfaces that Funnelback offers will be examined in the next section.

7.1. Push API

The push API includes a set of API calls for updating and managing push collections.

7.2. Administration API

Funnelback’s administration API provides access to a set of Funnelback administrative functions.

7.2.1. API tokens

Most API calls to the administration API require an API token to be provided along with the request.

The administration API login API call can be used to generate this API token.

Non-expiring API tokens can be generated for use with trusted applications. See: Non-expiring application tokens.
Exercise 9: Generate an API token
  1. Log in to the administration interface and select view API-UI from the system menu.

    exercise generate an api token 01
  2. Click on the Admin API button then click on the user-account-management section.

    exercise generate an api token 02
    exercise generate an api token 03
  3. Click on the /account/v1/login API heading

    exercise generate an api token 04
  4. Select remember me=false and enter your username and password the click the execute button.

    exercise generate an api token 05
  5. The security token is returned in the response headers section. The value of the x-security-token HTTP header can be copied and passed into the API token input box on the menu bar when an API call requiring a token is executed.

Exercise 10: Administration API

In this exercise the administration API will be used to obtain the top queries for a specified time period.

  • Log in to the administration interface and select view API-UI from the system menu.

  • Select the administration API

  • Locate the top queries API call under the analytics heading

  • Determine the top 10 queries for the foodista collection for last month.

    • collection: foodista

    • profile: _default

    • pageSize: 10

    • pageNumber: 1

    • earliestDate: Set this to the date of the first day of the previous month (in YYYY-MM-DD format).

    • latestDate: Set this to the date of the last day of the previous month (in YYYY-MM-DD format).

    exercise administration api 01
  • Observe the JSON response (note the values will vary as the analytics are automatically generated):

    {
      "errorMessage": null,
      "data": {
        "lastUpdatedDate": "2017-06-30T02:02:08Z",
        "nextPageNumber": 2,
        "queries": [
          {
            "query": "vegetarian",
            "count": 548
          },
          {
            "query": "side dish",
            "count": 548
          },
          {
            "query": "mackarel",
            "count": 548
          },
          {
            "query": "snacks",
            "count": 547
          },
          {
            "query": "scandinavian",
            "count": 547
          },
          {
            "query": "nibbles",
            "count": 546
          },
          {
            "query": "salami",
            "count": 546
          },
          {
            "query": "ice cream",
            "count": 545
          },
          {
            "query": "chocolate",
            "count": 544
          },
          {
            "query": "wine",
            "count": 543
          }
        ]
      }
    }
  • When accessing the URL from your own programs the security token will need to be set as a HTTP header when the request is made. Paste your security token (from the previous exercise) into the paste your API token here box and then resubmit the API call observing the curl command. This shows how the API token needs to be set.

7.3. Search results API

Funnelback’s search results can are queried by making a HTTP get request to Funnelback’s web server.

This get request normally returns HTML, template using Freemarker templates.

The underlying search data model can also be accessed as a JSON by passing the parameters to the search.json (or search.xml) endpoint. The viewing the data model for a search as JSON or XML exercise from the FUNL201 training provided an introduction to accessing the JSON or XML data directly.

These raw endpoints allow for close integration with other systems, however there are a number of things that should be considered before integrating directly with one of these endpoints.

  • A lot of Funnelback’s front-end functionality is implemented in Funnelback’s template layer. If the JSON (or XML) is accessed directly then it is up to the client to interpret the data structure and implement any front-end logic. For some features (such as faceted navigation) this can be quite complicated as it involves building the feature from a number of different raw elements and placing a series of rules around how this information is interpreted.

  • This also means that additional work is required from any client to make use of new features that are added to Funnelback

  • The JSON (and XML) end points are not currently versioned so there is a risk that the data structure may change over time, potentially breaking existing integrations.

These endpoints are accessed using REST style GET requests, returning JSON (or XML) data packets containing the question and response data.

The JSON endpoint is generally more useful as it clearly indicates the type (string, number, list etc.) of the fields.

7.3.1. Common parameters for the search results API

The JSON and XML endpoints accept the same set of parameters as the standard search interface (search.html).

There are a huge number of display and ranking parameters that will be accepted. See the query processor options documentation for a full list. Commonly used parameters are listed below.

Default values for ranking and display parameters can be set at the collection level (as query processor options in collection.cfg) or profile/service level (as parameters set on the profile’s padre_opts.cfg). See exercises in the FUNL201 and FUNL202 courses that cover the setting of display options.

Parameter Description

collection

Mandatory. This parameter must be set to the collection id of the collection that is to be queried.

query

Mandatory. This parameter must be set to a string containing the search query, specified using Funnelback’s query language.

num_ranks

Integer indicating the number of results to return.

start_rank

Integer indicating the rank of the first result to be returned. (e.g. start_rank=11 means return results starting from result no. 11).

SF

Value indicating metadata fields to return in the results.

sort

String indicating the type of sorting to apply to the results.

A tool for interacting with the public UI APIs is included in the training resources website. This tool provides a Swagger-UI interface for interacting with the more common parameters that are available when accessing search.json or search.xml.

If you are only interested in the contents of the response.results element of the data model consider using the all-results endpoint which is designed to return all the search results as either a single JSON or CSV response. This endpoint is great for exporting data but is also useful for integration if only the search results themselves are required. Note: this endpoint does not include anything beyond the result-level data (so no counts, faceted navigation etc.)
Exercise 11: Search results as JSON or XML

In this exercise we will construct queries to return results from the simpsons collection.

For these exercises the following data is required.

  1. Construct a query to return the first 30 results for a homer as XML. Use the XML URL base with query=homer and num_ranks=30

    http://training-search.clients.funnelback.com/s/search.xml?collection=simpsons&query=homer&num_ranks=30
  2. Compare this to the same results returned in JSON

    http://training-search.clients.funnelback.com/s/search.json?collection=simpsons&query=homer&num_ranks=30
  3. Compare this to the same results returned in HTML

    http://training-search.clients.funnelback.com/s/search.html?collection=simpsons&query=homer&num_ranks=30
  4. Compare this to the same results returned using the all results endpoint. Observe that the all results endpoint only returns the result items. The response is limited to 30 results because of the num_ranks parameter.

    http://training-search.clients.funnelback.com/s/all-results.json?collection=simpsons&query=homer&num_ranks=30
  5. Update this query so that metadata for season and episode is returned and the results are returned in XML. Add parameters to set the summary field (SF) value. The metadata class names for the SF parameter can be looked up in the configure metadata mappings screen.

    http://training-search.clients.funnelback.com/s/search.xml?collection=simpsons&query=homer&num_ranks=30&SF=[sSeason,sEpisode]
  6. Construct a query to return the 3rd page of results in JSON for a query of treehouse of horror, with 20 results per page. Use the JSON URL base. Specify the correct start_rank and num_ranks parameters

    http://training-search.clients.funnelback.com/s/search.json?collection=simpsons&query=treehouse+of+horror&num_ranks=20&start_rank=61
Exercise 12: The public API testing tool

Use the public API testing tool to redo the searches in the previous exercise.

The tool can be accessed from the training resources website on your VM: http://training-search.clients.funnelback.com/training/public-api/index.html

  1. Open the public API testing tool and expand the search.json endpoint.

  2. Enter the following information into the form, then click the try it out! button.

    • collection: simpsons

    • query: homer

    • num_ranks: 30

Repeat for the other steps above and compare the returned output in the tool with the browser source from the previous exercise.

The public API testing tool is a good way to start interacting with the json endpoints but is not comprehensive. It is not a part of Funnelback and is provided ‘as is' to assist with your training.

7.3.2. Data model elements for selected features

Question element

The question element defines all the data objects that are used to describe the user’s search query.

Input parameters (CGI parameters)

These elements hold the set of parameters that are passed in to the describe the user’s query.

  • question.inputParameterMap[]

  • question.rawInputParameters[]

  • question.additionalParameters[]

These maps are mostly equivalent - the inputParameterMap is a simplified version of the rawInputParameters and can be used for unique CGI parameters (where only a single value is supplied for a CGI parameter) with the element containing a single string value. (e.g. question.inputParameterMap["collection"] would contain the value of the collection CGI parameter.

The rawInputParameters contain arrays and should be used when a CGI parameter can hold multiple values. (e.g. question.rawInputParameters["subject"] might contain ["keyword 1","keyword 2","keyword 3"]). additionalParameters includes environment variables and other special parameters such as the origin.

The additionalParameters element should be used for any parameters that need to be passed directly to padre as query processor options. This includes the following:

  • origin

  • maxdist

  • sort

  • numeric query parameters (e.g. `lt_x)

  • SM

  • SF

  • num_ranks

Collection

This contains the ID of the collection that is being queried. Mandatory.

  • question.collection.id

Profile

This contains the ID of the profile that is being queried. Default value is "_default".

  • question.profile

Query

This contains the query, defined using the Funnelback query language.

  • question.query

Template to use

This specifies which template file to use to format the results when using search.html. Default value is simple to use the simple.ftl.

  • question.form

Custom data

Custom data is a special map that can be populated with custom values using a hook script. The custom data element can be accessed from Groovy hook scripts and Freemarker templates.

  • question.customData

Query string

This element contains the browser’s query string containing all the CGI parameters as passed to Funnelback.

  • QueryString

Search results
  • response.resultPacket.results []

Search result summary
  • response.resultPacket.resultsSummary {}

Contextual navigation
  • response.resultPacket.contextualNavigation {}

Faceted navigation
  • response.facets []

Best bets and curator exhibits
  • response.curator.exhibits []

Query
  • response.resultPacket.query

  • response.resultPacket.queryAsProcessed

  • response.resultPacket.queryCleaned

Query string
  • QueryString

Pagination

These elements contain the necessary fields used to construct the pagination control.

  • response.resultPacket.resultsSummary.currStart

  • response.resultPacket.resultsSummary.currEnd

  • response.resultPacket.resultsSummary.numRanks

  • response.resultPacket.resultsSummary.prevStart

  • response.resultPacket.resultsSummary.nextStart

  • response.resultPacket.resultsSummary.totalMatching

Spelling suggestions
  • response.resultPacket.spell

Query blending
  • response.resultPacket.qsup

Extra searches

The data model of the extra search question and response matches the data model for the main search.

  • extra.question

  • extra.response

Custom data

Custom data is a special map that can be populated with custom values using a hook script. The custom data element can be accessed from Groovy hook scripts and Freemarker templates.

  • response.customData

Search session
  • session

Extended exercises: search results API
  1. Construct a URL to return the first 5 results about Charlie Chaplin from the silent-films collection as JSON.

  2. Construct a URL to return the 2nd page of results about egg from the foodista collection. Return tags metadata in the result set, sort the results by title and return the results as XML.

7.4. Auto-completion API

Funnelback’s auto-completion web service can are queried by making a HTTP get request to the Funnelback’s suggest.json endpoint.

This get request returns a JSON response containing auto-completions for a partial query.

The public API testing tool is a good way to test the suggest.json endpoints.

7.5. Recommender API

Funnelback has the ability to recommend a list of items that are related to a URL.

This feature is designed to be a web service provided by Funnelback where the recommendation results are embedded in content pages.

This can be used to supplement content pages with functionality akin to ‘people who are interested in this product are also interested in these products'.

Recommendations are based on usage of the search and ideally should not be enabled until there is sufficient click data to produce sensible recommendations. In the absence of click data recommender falls back to using Funnelback’s explore functionality.

It is also possible to convert other sources of usage (such as e-commerce system purchase data) for use with recommender.

In order to generate the recommendations database analytics updates must have run at least once for the collection.
Exercise 13: Enabling recommendations
  1. Log in to the administration interface and switch to the foodista collection

  2. Edit the collection configuration and add:

    recommender=true

    then save the file. The recommender database will now be built automatically whenever the collection is updated.

  3. Select view analytics now to ensure some analytics are returned - if you receive a message saying analytics have never run then keep refreshing the page periodically until some data is returned.

  4. The recommender database can be rebuilt without doing a full update of the collection. To build the database without updating the collection first swap the views by selecting swap live and offline view from the list of advanced update options. Then restart the update from the index phase for the collection by running another advanced update by selecting index from the restart update options. (Note: A reindex of the live view doesn’t build recommendations).

  5. View the main collection update log (update-foodista.log from the collection logs) and observe messages relating to the recommender build.

  6. Log in to the Funnelback server via SSH and verify that the recommender database has been built and has been written to $SEARCH_HOME/data/foodista/live/databases/foodista_recommender*.

Exercise 14: Retrieving recommendations

Recommendations are returned by the recommendations API. To retrieve a set of recommendations you need to supply a minimum of a seed URL (which must be within the search index) as well as the collection.

For this example we will retrieve a set of recommendations for: http://www.foodista.com/recipe/5DF22BMY/apricot-glazed-chicken.html

  1. Access the following API endpoint: http://training-search.clients.funnelback.com/s/recommender/similarItems.json with the following parameters:

  2. Observe the JSON response:

    exercise retrieving recommendations 01

    There are additional API parameters that can be supplied - see the documentation link below for more information.

  3. Use the public API testing tool to rerun the recommender query above.

8. Configuring ranking

8.1. Ranking options

Funnelback’s ranking algorithm determines what results are retrieved from the index and what how the order of relevance is determined.

The ranking of results is a complex problem, influenced by a multitude of document attributes. It’s not just about how many times a word appears within a document’s content.

  • Ranking options are a subset of the query processor options which also control other aspects of query time behaviour (such as display settings).

  • Ranking options are applied at query time - this means that different services and profiles can have different ranking settings applied, on an identical index. Ranking options can also be changed via CGI parameters at the time the query is submitted.

8.2. Automated tuning

Tuning is a process that can be used to determine which attributes of a document are indicative of relevance and adjust the ranking algorithm to match these attributes.

The default settings in Funnelback are designed to provide relevant results for the majority of websites. Funnelback uses a ranking algorithm that is influenced by many different weighted factors that scores each document in the index when a search is run. These individual weightings can be adjusted and tuning is the recommended way to achieve this.

The actual attributes that inform relevance will vary from site to site and can depend on the way in which the content is written and structured on the website, how often content is updated and even the technologies used to deliver the website..

For example the following are examples of concepts that can inform on relevance:

  • How many times the search keywords appear within the document content

  • If the keywords appear in the URL

  • If the keywords appear in the page title, or headings

  • How large the document is

  • How recently the document has been updated

  • How deep the document is within the website’s structure

Tuning allows for the automatic detection of attributes that influence ranking. The tuning process requires training data from the content owners. This training data is made up of a list of possible searched - keywords with what is deemed to be the URL of the best answer for the keyword, as determined by the content owners.

A training set of 50-100 queries is a good size for most search implementations. Too few queries will not provide adequate broad coverage and skew the optimal ranking settings suggested by tuning. Too many queries will place considerable load on the server for a sustained length of time as the tuning tool runs each query with different combinations of ranking settings. It is not uncommon to run in excess of 1 million queries when running tuning.

Funnelback uses this list of searches to optimise the ranking algorithm, by running each of the searches with different combinations of ranking settings and analysing the results for the settings that provide the closest match to the training data.

It is very important to understand that tuning does not guarantee that any of the searches provided in the training data will return as the top result - but this information should result in improved results for all searches.

The tuning tool consists of two components - the training data editor and the components to run tuning.

Any user with access to the marketing dashboard has the ability to edit the tuning data.

Only an administrator can run tuning and apply the optimal settings to a search.

The running of tuning is restricted to administrators as the tuning process can place a heavy load on the server and the running of tuning needs to be managed.

8.2.1. Editing training data for tuning

The training data editor is accessed from the marketing dashboard by clicking on the tuning tile, or by selecting tuning from the left hand menu.

A blank training data editor is displayed if tuning has not previously been configured.

editing training data for tuning 01

Clicking the add new button opens the editor screen.

editing training data for tuning 02

The tuning requires 50-100 examples of desirable searches. Each desirable search requires the search query and one or more URLs that represent the best answer for the query.

Two methods are available for specifying the query:

  1. Enter the query directly into the keyword(s) field, or

  2. Click the suggest keyword(s) button the click on one of the suggestions that appear in a panel below the keyword(s) form field. The suggestions are randomised based on popular queries in the analytics. Clicking the button multiple times will generate different lists of suggestions.

editing training data for tuning 03

Once a query has been input the URLs of the best answer(s) can be specified.

URLs for the best answers are added by either clicking the suggest URL to add or manually add a URL buttons.

Clicking the suggest URLs to add button opens a panel of the top results (based on current rankings).

editing training data for tuning 04

Clicking on a suggested URL adds the URL as a ‘best answer'.

editing training data for tuning 05

Additional URLs can be optionally added to the best URLs list - however the focus should be on providing additional query/best URL combinations over a single query with multiple best URLs.

A manual URL can be entered by clicking the manually add a URL button. Manually added URLs are checked as they are entered.

editing training data for tuning 06

Clicking the save button adds the query to the training data. The tuning screen updates to show the available training data. Hovering over the error status icon shows that there is an invalid URL (the URL that was manually added above is not present in the search index).

editing training data for tuning 07

Once all the training data has been added tuning can be run.

Tuning is run from the tuning history page. This is accessed by clicking the history sub-item in the menu, or by clicking the tuning runs button that appears in the start a tuning run message.

The tuning history shows the previous tuning history for the service and also allows users with sufficient permissions to start the tuning process.

Recall that only certain users are granted the permissions required to run tuning.

editing training data for tuning 08

Clicking the start tuning button initiates the tuning run and the history table provides updates on the possible improvement found during the process. These numbers will change as more combinations of ranking settings are tested.

editing training data for tuning 09

When the tuning run completes a score over time graph will be updated and the tuning runs table will hold the final values for the tuning run.

editing training data for tuning 10

Once tuning has been run a few times additional data is added to both the score over time chart and tuning runs table.

editing training data for tuning 11

The tuning tile on the marketing dashboard main page also updates to provide information on the most recent tuning run.

editing training data for tuning 12
The improved ranking is not automatically applied to the search. An administrator must log in to apply the optimal settings as found by the tuning process.
Exercise 15: Tuning search results
  1. Log in to the marketing dashboard switch to the foodista collection.

  2. Access the tuning section by selecting tuning from the left hand menu, or by clicking on the tuning tile.

    editing training data for tuning 01
  3. Click on the add new button to open up the tuning editor screen and start defining the training data. An empty edit screen loads.

    exercise tuning search results 01
  4. Enter a query by adding a word or phrase to the keyword(s) field. Click on the suggest keyword(s) button to receive a list of suggested keywords and click on one of the suggestions observing that the value populates the keyword(s) field. Edit the value in the keyword(s) field and enter the word carrot.

    exercise tuning search results 02
  5. Observe that the best URLs panel updates with two buttons allowing the best answers to be defined. Click on the suggest URLs to add button to open a list containing of pages to choose from.

    exercise tuning search results 03
  6. Select the page that provides the best answer for a query of carrot. Note that scrolling to the bottom of the suggested URLs allows further suggestions to be loaded. Click on one of the suggested URLs to set it as the best answer for the search. Observe that the selected URL appears beneath the Best URLs heading.

    exercise tuning search results 04
  7. Save the sample search by clicking on the save button. The training data overview screen reloads showing the suggestion that was just saved.

    exercise tuning search results 05
  8. Run tuning by switching to the history screen. The history screen is accessed by selecting history from the left hand menu, or by clicking on the tuning runs button contained within the information message at the top of the screen.

    exercise tuning search results 06
  9. The history screen is empty because tuning has not been run on this service. Start the tuning by clicking the start tuning button. The screen refreshes with a table showing the update status. The table shows the number of searches performed and possible improvement (and current score) for the optimal set of raking settings (based on the combinations that have been tried so far during this tuning run.

    exercise tuning search results 07
  10. When the tuning run completes the display updates with a score over time chart that shows the current (in green) and optimised scores (in blue) over time.

    exercise tuning search results 08
  11. Return to the main services screen by clicking the Foodista dashboard item in the left hand menu and observe the tuning tile shows the current performance.

    exercise tuning search results 09
  12. To apply the optimal tuning settings return to the administration interface and select view tuning results from the tune tab

    exercise tuning search results 10
  13. The tuning results screen will be displayed showing the optimal set of ranking settings found for the training data set.

    exercise tuning search results 11
  14. To apply these tuning options click the apply new options button. The settings are written to the padre_opts.cfg for the preview profile allowing you to compare the search before and after tuning was run.

  15. Open the file manager by selecting browse collection configuration files from the administer tab and observe the padre_opts.cfg file in the presentation (preview) section contains the set of ranking options listed above.

    exercise tuning search results 12
  16. Return to the administration interface home screen and run a search for carrot against the live profile. This will run the search with current ranking settings.

    exercise tuning search results 13
  17. Observe the results noting the first few results.

    exercise tuning search results 14
  18. Return to the administration interface home screen and rerun the search for carrot, this time against the preview profile. This will run the search with the tuned ranking settings.

    exercise tuning search results 15
  19. Observe the results noting the first few results and that the URL you selected previously in the Best URLs is now listed as the first result.

    exercise tuning search results 16
  20. To make the ranking settings live return to the file manager display and publish the padre_opts.cfg file. Retest the live search to ensure that the settings have been applied successfully.

The padre_opts.cfg contains values that are appended to the query_processor_options collection.cfg parameter. An alternative way of making the settings live is to add all the options listed in padre_opts.cfg to the collection.cfg query_processor_options value then remove them from the padre_opts.cfg. See also applying ranking options below.

8.3. Setting ranking indicators

Funnelback has an extensive set of ranking parameters that influence how the ranking algorithm operates.

This allows for customisation of the influence provided by 73 different ranking indicators.

Automated tuning should be used (where possible) to set ranking influences as manually altering influences can result in fixing of a specific problem at the expense of the rest of the content.

The main ranking indicators are:

  • Content: This is controlled by the cool.0 parameter and is used to indicate the influence provided by the document’s content score.

  • On-site links: This is controlled by the cool.1 parameter and is used to indicate the influence provided by the links within the site. This considers the number and text of incoming links to the document from other pages within the same site.

  • Off-site links: This is controlled by the cool.2 parameter and is used to indicate the influence provided by the links outside the site. This considers the number and text of incoming links to the document from external sites in the index.

  • Length of URL: This is controlled by the cool.3 parameter and is used to indicate the influence provided by the length of the document’s URL. Shorter URLs generally indicate a more important page.

  • External evidence: This is controlled by the cool.4 parameter and is used to indicate the influence provided via external evidence (see query independent evidence below).

  • Recency: This is controlled by the cool.5 parameter and is used to indicate the influence provided by the age of the document. Newer documents are generally more important than older documents.

A full list of all the cooler ranking options is provided in the documentation link below.

8.4. Applying ranking options

Ranking options are applied in one of three ways:

  • Set as a default for the collection by adding the ranking option to the query_processor_options parameter in the collection.cfg. This can be done either by editing the collection.cfg file directly, or via the administration interface’s interface editor screen which is accessible via the edit collection settings option.

  • Set as a default for the profile by adding the ranking option to the list of options defined in the profile’s padre_opts.cfg. This can be done by editing the padre_opts.cfg file within the relevant profile folder directly, or by editing padre_opts.cfg for the relevant profile from the file manager screen in the administration interface for the collection.

  • Set at query time by adding the ranking option as a CGI parameter. This is a good method for testing but should be avoided in production unless the ranking factor needs to be dynamically set for each query, or set by a search form control such as a slider.

Many ranking options can be set simultaneously, with the ranking algorithm automatically normalising all of the supplied ranking factors. E.g.

query_processor_options=-stem=2 -cool.1=0.7 -cool.5=0.3 -cool.21=0.24

Automated tuning is the recommended way of setting these ranking parameters as it uses an optimisation process to determine the optimal set of factors. Manual tuning can result in an overall poorer end result as improving one particular search might impact negatively on a lot of other searches.

Exercise 16: Manually set ranking parameters
  1. Run a search against the foodista collection for sugar. (http://training-search.clients.funnelback.com/s/search.html?collection=foodista&profile=_default_preview&query=sugar). Observe the order of search results.

    exercise manually set ranking parameters 01
  2. Provide maximum influence to document recency by setting cool.5=1.0. This can be added as a CGI parameter, in the query processor options of the collection’s collection.cfg or in the padre_opts.cfg defined for a profile on a collection. Add &cool.5=1.0 to the URL and observe the change in result ordering. Observe that despite the order changing to increase the influence of date, the results are not returned sorted by date (because there are still other factors that influence the ranking).

    exercise manually set ranking parameters 02
  3. Log in to the administration interface and select the foodista collection.

  4. Select edit collection configuration from the administer tab, then the interface option from the left hand menu. Locate the query processor heading and add -cool.5=1.0 to the query processor options.

    exercise manually set ranking parameters 03
  5. Rerun the first search against the foodista collection for search (http://training-search.clients.funnelback.com/s/search.html?collection=foodista&query=sugar&profile=_default_preview) and observe that the result ordering reflects the search that was run when recency was upweighted. This is because the cool.5 setting has been set as a default for the collection. It can still be overridden by setting cool.5 in the URL string, or at profile level in the padre_opts.cfg but will be set to 1.0 when it’s not specified elsewhere.

8.5. Meta collection component weighting

When different collections are combined into a meta collection it is often beneficial to weight the collections differently. This can be for a number of reasons, the main ones being:

  • Some collections are simply more important than others. E.g. a university’s main website is likely to be more important than a department’s website.

  • Some collection types naturally rank better than others. E.g. web collections generally rank better than other collection types as there is a significant amount of additional ranking information that can be inferred from attributes such as the number of incoming links, the text used in these links and page titles. XML and database collections generally have few attributes beyond the record content that can be used to assist with ranking.

Meta collection component weighting is controlled using the cool.21 parameter.

Exercise 17: Apply component weighting to a meta collection
  1. If you have completed FUNL202 and the Funnelback - combined search and Funnelback - website collections that you created are still present from you administration interface then skip to step 4.

  2. Create a web collection with the following details and update the collection immediately:

  3. Create a meta collection with the following details:

    • collection ID: funnelback-search

    • collection title: Funnelback - combined search

    • Sub-collections: Funnelback - website; Funnelback Documentation

  4. Log in to the Funnelback server via SSH and change to the live index folder for the funnelback-search collection.

    [search@localhost ~]$ cd /opt/funnelback/data/funnelback-search/live/idx
  5. Edit the index.sdinfo file for the funnelback-search meta collection. The index.sdinfo file lists the index stems, alias and a component weighting for each sub-collection that is part of the meta collection. Each index component has a weighting of 0.5 against it as a default. The weighting values are between 0.0 and 1.0. The alias cannot currently be changed from the default value.

    [search@localhost idx]$ nano index.sdinfo
    /opt/funnelback/data/funnelback-website/live/idx/index funnelback-website 0.5
    /opt/funnelback/data/funnelback_documentation/live/idx/index funnelback_documentation 0.5
  6. Define the relative weightings for each collection. 0.5 is the default weighting. To upweight a collection set a value between 0.5 and 1.0. To downweight a collection set a value below 0.5. Set the Funnelback website collection to have a large upweight and the Funnelback documentation collection to have a moderate downweight. Note: the relative weightings will have no effect until an influence is set on the component weighting factor in the ranking algorithm. Edit the index.sdinfo file with the following weightings and save the file.

    /opt/funnelback/data/funnelback-website/live/idx/index funnelback-website 0.9
    /opt/funnelback/data/funnelback_documentation/live/idx/index funnelback_documentation 0.7
  7. Run a search for content auditor against the Funnelback - combined search collection and observe the results without any influence from the component weighting. The top results are all returned from the Funnelback documentation. (http://training-search.clients.funnelback.com/s/search.html?collection=funnelback-search&query=content+auditor)

    exercise apply component weighting to a meta collection 01
  8. Apply an influence by setting the cool.21 query processor option. The influence is a value between 0 and 1. A higher value of cool.21 indicates more influence. Set cool.21 to a value of 0.3 by adding &cool.21=0.3 to the URL. Observe that a website result appears at rank 7.

    exercise apply component weighting to a meta collection 02
  9. Increase the influence to 0.9 and observe the change on the results.

    exercise apply component weighting to a meta collection 03
  10. Set the influence as a collection default or profile default by adding -cool.21=0.9 to the query_processor_options (in the collection configuration) or padre_opts.cfg.

8.6. Result diversification

There are a number of ranking options that are designed to increase the diversity of the result set. These options can be used to reduce the likelihood of result sets being flooded by results from the same website, collection etc.

8.6.1. Same site suppression

Each website has a unique information profile and some sites naturally rank better than others. Search engine optimisation (SEO) techniques assist with improving a website’s natural ranking.

Same site suppression can be used to downweight consecutive results from the same website resulting in a more diverse set of search results.

Same site suppression is configured by setting the following query processor options:

  • SSS: controls the depth of comparison (in the URL) used to determining what a site is. This corresponds to the depth of the URL (or the number of subfolders in a URL).

    • Range: 0-10

    • SSS=0: no suppression (default for non-web collections)

    • SSS=2: default for web and meta collections (site name + first level folder)

    • SSS=10: special meaning for big web applications.

  • SameSiteSuppressionExponent: Controls the downweight penalty applied. Larger values result in greater downweight.

    • Range: 0.0 - unlimited (default = 0.5)

    • Recommended value: between 0.2 and 0.7

  • SameSiteSuppressionOffset: Controls how many documents are displayed beyond the first document from the same site before any downweight is applied.

    • Range: 0-1000 (default = 0)

  • sss_defeat_pattern: URLs matching the simple string pattern are excluded from same site suppression.

8.6.2. Same meta suppression

Downweights subsequent results that contain the same value in a specified metadata field. Same meta suppression is controlled by the following ranking options:

  • same_meta_suppression: Controls the downweight penalty applied for consecutive documents that have the same metadata field value.

    • Range: 0.0-1.0 (default = 0.0)

  • meta_suppression_field: Controls the metadata field used for the comparison. Note: only a single metadata field can be specified.

8.6.3. Same collection suppression

Downweights subsequent results that come from the same collection. This provides similar functionality to the meta collection component weighting above and could be used in conjunction with it to provide an increased influence. Same collection suppression is controlled by the following ranking options:

  • same_collection_suppression: Controls the downweight penalty applied for consecutive documents that reside in the same collection.

    • Range: 0.0-1.0 (default = 0.0)

8.6.4. Same title suppression

Downweights subsequent results that contain the same title. Same title suppression is controlled by the following ranking options:

  • title_dup_factor: Controls the downweight penalty applied for consecutive documents that have the same title value.

    • Range: 0.0-1.0 (default = 0.5)

Exercise 18: Same site suppression
  1. Run a search against the simpsons collection for homer. Observe the order of the search results noting that the results are spread quite well with consecutive results coming from different folders. Funnelback uses same site suppression to achieve this and the default setting (SSS=2, which corresponds to hostname and first folder) is applied to mix up the results a bit.

    exercise same site suppression 01
  2. Turn off same site suppression by adding &SSS=0 to the URL observing the effect on the search result ordering.

    exercise same site suppression 02
  3. Remove the SSS=0 parameter from the URL to re-enable the default suppression and add SameSiteSuppressionOffset=2 to change the suppression behaviour so that it kicks in after a few results. This causes several reviews items to remain unaffected by any penalty that is a result of being from the same section as the previous result.

    exercise same site suppression 03

8.6.5. Result collapsing

While not a ranking option, result collapsing can be used to effectively diversify the result set by grouping similar result items together into a single result.

Results are considered to be similar if:

  • They share near-identical content

  • The have identical values in one or a set of metadata fields.

Result collapsing requires configuration that affects both the indexing and query time behaviour of Funnelback.

Exercise 19: Configure result collapsing
  1. Log in to the administration interface and change to the nobel-prize-winners collection

  2. Configure three keys to collapse results on - year of prize, prize name and a combination of prize name and country. Edit the collection settings (administer tab) then select the indexer item from the left hand menu.

  3. Update the result collapsing fields to add collapsing on year, prize name and prize name+country.

    [$],[H],[year],[prizeName],[prizeName,country]
    exercise configure result collapsing 01
    $ is a special value that collapses on the document content (can be useful for collapsing different versions or duplicates of the same document). H contains an MD5 sum of the document content which is used for collapsing of duplicate content.
  4. Rebuild the index by selecting reindex the live view from the advanced options.

  5. Ensure the template is configured to displayed collapsed results. Locate the section of the results template where the each result is printed and add the following code just below the closing </ul> tag for each result (approx line 520). This checks to see if the result element contains any collapsed results and prints a message.

                   <#if s.result.collapsed??>
                      <div class="search-collapsed"><small><span class="glyphicon glyphicon-expand text-muted"></span>&nbsp; <@fb.Collapsed /></small></div>
                    </#if>
    exercise configure result collapsing 02
  6. Test the result collapsing by running a query for prize and adding the following to the URL: &collapsing=on&collapsing_sig=[year]. (http://training-search.clients.funnelback.com/s/search.html?collection=nobel-prize-winners&profile=_default&query=prize&collapsing=on&collapsing_sig=%5Byear%5D) Observe that results contain an additional link indicating the number of very similar results and the result summary includes a message indicating that collapsed results are included.

    exercise configure result collapsing 03
  7. Clicking on this link will return all the similar results in a single page.

    exercise configure result collapsing 04
  8. Return to the previous search listing by pressing the back button and inspect the JSON (or XML) view of the search observing that the result summary contains a collapsed count and that each result item contains a collapsed field. This collapsed field indicates the number of matching items and the key on which the match was performed and can also include some matching results with metadata. Observe that there is a results sub-item and it’s currently empty:

    exercise configure result collapsing 05
  9. Return some results into the collapsed sub-item by adding the following to the URL: &collapsing_num_ranks=3&collapsing_SF=[prizeName,name] (http://training-search.clients.funnelback.com/s/search.json?collection=nobel-prize-winners&profile=_default&query=prize&collapsing=on&collapsing_sig=%5Byear%5D&collapsing_num_ranks=3&collapsing_SF=%5BprizeName)

    The first option, collapsing_num_ranks=3 tells Funnelback to return 3 collapsed item results along with the main result. These can be presented in the result template as sub-result items. The second option, collapsing_SF=[prizeName,name] controls which metadata fields are returned in the collapsed result items.

    exercise configure result collapsing 06
  10. Return to the initial search collapsing on year (http://training-search.clients.funnelback.com/s/search.html?collection=nobel-prize-winners&profile=_default&query=prize&collapsing=on&collapsing_sig=%5Byear%5D) and change the collapsing_sig parameter to collapse on prizeName: (http://training-search.clients.funnelback.com/s/search.html?collection=nobel-prize-winners&profile=_default&query=prize&collapsing=on&collapsing_sig=%5BprizeName%5D). Observe that the results are now collapsed by prize name.

    exercise configure result collapsing 07
  11. Finally change the collapsing signature to collapse on [prizeName,country] (http://training-search.clients.funnelback.com/s/search.html?collection=nobel-prize-winners&profile=_default&query=prize&collapsing=on&collapsing_sig=%5BprizeName). This time the results are collapsed grouping all the results for a particular prize category won by a specific country. E.g. result item 1 below groups all the results where the Nobel Peace Prize was won by someone from the USA.

    exercise configure result collapsing 08
  12. Click on the 24 very similar results link and confirm that the 25 results returned are all for the Nobel Peace Prize and that each recipient was born in the USA.

    exercise configure result collapsing 09
  13. The values for collapsing, collapsing_sig, collapsing_SF and collapsing_num_ranks can be set as defaults in the same way as other display and ranking options and can be set either at the collection level (in the collectino configuration) or profile/service level (in padre_opts.cfg).

8.7. Metadata weighting

It is often desirable to up (or down) weight a search result when search keywords appear in specified metadata fields.

The following ranking options can be used to apply metadata weighting:

  • sco=2: Setting the -sco=2 ranking option allows specification of the metadata fields that will be considered as part of the ranking algorithm. By default link text, clicked queries and titles are included. The list of metadata fields to use with sco2 is defined within square brackets when setting the value. E.g. -sco=2[k,K,t,customField1,customField2] tells Funnelback to apply scoring to the default fields as well as customField1 and customField2. Default: -sco=2[k,K,T]

  • wmeta: Once the scoring mode to is set to 2, any defined fields can have individual weightings applied. Each defined value can have a wmeta value defined specifying the weighting to apply to the metadata field. The weighting is a value between 0.0 and 1.0. A weighting of 0.5 is the default and a value >0.5 will apply an upweight. A value <0.5 will apply a downweight. E.g. -wmeta.t=0.6 applies a slight upweighting to the t metadata field while -wmeta.customField1=0.2 applies a strong downweighting to customField1.

Exercise 20: Custom metadata weighting
  1. Switch to the foodista collection and perform a search for pork (http://training-search.clients.funnelback.com/s/search.html?collection=foodista&query=pork).

    exercise custom metadata weighting 01
  2. Change the settings so that the maximum upweight is provided if the search query terms appear within the tags field. Remove the influence provided by the title metadata. Add the following to the query processor options:

    -sco=2[k,K,t,tags] -wmeta.tags=1.0 -wmeta.t=0.0

    or as CGI parameters (http://training-search.clients.funnelback.com/s/search.html?collection=foodista&query=pork&sco=2%5bt). Observe the effect on the ranking of the results

    exercise custom metadata weighting 02

8.8. Query independent evidence

Query independent evidence (QIE) allows certain pages or groups of pages within a website (based on a regular expression match to the document’s URL) to be upweighted or downweighted without any consideration of the query being run.

This can be used for some of the following scenarios:

  • Provide upweight to globally important pages, download the home page.

  • Weighting different file formats differently (e.g. upweight PDF documents).

  • Apply up or down weighting to specific websites.

Query independent evidence is applied in two steps:

  • Defining the relative weights of items to upweight or downweight by defining URL patterns in the qie.cfg configuration file. This generates an extra index file that records the relative weights of documents.

  • Applying an influence within the ranking algorithm to query independent evidence by setting a weighting for cool.4. The weighting for cool.4 can be adjusted dynamically (e.g. to disable it via a CGI parameter by setting &cool.4=0.0) or per profile.

Exercise 21: Query independent evidence
  1. Log in to the administration interface and switch to the simpsons collection

  2. Open the file manager and create a qie.cfg file

  3. Add the following URL weightings to the qie.cfg and then save the file: 0.25 provides a moderate downweight, while 1.0 is the maximum upweight that can be provided via QIE. Items default to having a weight of 0.5.

    # down-weight reviews and upweight episodes
    0.25  ^www.simpsoncrazy.com/reviews/
    1.0   ^www.simpsoncrazy.com/episodes/
  4. Re-index the collection.

  5. Run a query for homer against the collection with cool.4=0 and then cool.4=1.0 to observe the effect of QIE when it has no influence and when it has the maximum influence. Applying the maximum influence from QIE pushes episode pages to the top of the results (and this is despite the default same site suppression being applied).

    exercise query independent evidence 01
    exercise query independent evidence 02

9. Custom collections

A custom collection provides a way for to extend Funnelback to gather content from unsupported data sources. The same underlying mechanism is used to gather content for supported social media repositories.

This is achieved by providing a mechanism that allows the administrator or implementer to write a custom gather script in groovy.

This is an advanced topic and requires knowledge of the Java and Groovy programming languages.

This custom gather script is responsible for connecting to the remote data source, gathering/filtering the content and writing the result of this to Funnelback’s data store.

Two data store types are available - and xml store which is optimised for storage of XML records and a raw bytes store for storage of other data.

The custom gatherer has access to a configuration object, providing access to all the items defined within the collection configuration (including inherited values).

Custom collection configuration settings can also be defined and should be used to configure the custom gatherer so that custom gatherers are reusable once built.

Funnelback provides several data store types which the custom gather writes the transformed content to.

Funnelback ships with several custom gathers that implement connection to a number of common social media repositories (YouTube, Facebook, Twitter, Flickr).

The basic structure of a custom gatherer is:

  1. Library imports

  2. Create a collection object

  3. Clear out the offline view (optional).

  4. Open the data store.

  5. Gather the content, perform any filtering and write each item to the data store.

  6. Close the data store and clean up.

Custom gatherers can call the Funnelback filter chain and built-in or custom filters can be run across the downloaded content before it is stored.

Exercise 22: Custom gatherer to download and index a CSV file

In this exercise the same dataset that was used for the previous nobel prize winners example will be gathered directly using a custom gatherer.

  1. Create a new custom collection called nobel-custom.

    • Project group ID: Training 203

    • Collection ID: nobel-custom

    • Collection title: Nobel prize winners (custom)

    • Template: none

  2. Create a new custom_gather.groovy from the file manager. Select view collection configuration files from the administer tab, then edit configuration files and create a new custom_gather.groovy.

    exercise custom gatherer to download and index a csv file 01
  3. Insert the following code into the file then save.

    import com.funnelback.common.*;
    import com.funnelback.common.config.*;
    import com.funnelback.common.io.store.*;
    import com.funnelback.common.io.store.xml.*;
    import com.funnelback.common.utils.*;
    import java.net.URL;
    
    /*
     * Custom gatherer for use with a custom collection for download of csv/tsv delimited files.
     *
     * Supports the following collection.cfg settings:
     * Collection.cfg options:
     * csv.format (NOTE: only tested with csv and tsv)
     *   values: These map to the available CSVFormat types csv xls rfc4180 tsv mysql
     * csv.encoding values: Java character encoding - eg: UTF-8, ISO-8859-1
     * csv.sourceurl set this to the URL of where to fetch the CSV document
     * csv.header values: true (CSV has a header line) / false (CSV does not have a header line)
     * csv.header.custom Value: comma separated list of header titles.
     *   Assumes the number of items matches the number of columns in CSV.
     *   Non word chars are converted to underscores
     *   eg. csv.header.custom=field1,field2,field3,field4 Required if csv.header=false
     */
    
    //CSV imports
    import org.apache.commons.csv.CSVParser
    import static org.apache.commons.csv.CSVFormat.*
    
    // Create a configuration object to read collection.cfg
    def config = new NoOptionsConfig(new File(args[0]), args[1]);
    
    // Create a Store instance to store gathered data
    def store = new XmlStoreFactory(config).newStore();
    
    // set the fileformat and encoding based on supported types.  see https://commons.apache.org/proper/commons-csv/archives/1.0/apidocs/org/apache/commons/csv/CSVFormat.html
    def format = config.value("csv.format")
    def csvFormat = ["csv":DEFAULT,"xls":EXCEL,"rfc4180":RFC4180,"tsv":TDF,"mysql":MYSQL]
    def csvEncoding = config.value("csv.encoding")
    
    // Open the XML store
    store.open()
    
    // Fetch the CSV file and convert it to XML
    // read CSV as text
    def csvText = new URL(config.value("csv.sourceurl")).getText(csvEncoding)
    
    // Parse the CSV text
    CSVParser csv = null
    if (Boolean.parseBoolean(config.value("csv.header"))) {
        // use the header row to define fields
            csv = CSVParser.parse(csvText, csvFormat[format].withHeader())
    }
    else {
        // use field definitions
        csv = CSVParser.parse(csvText, csvFormat[format].withHeader(config.value("csv.header.custom").split(",")))
    }
    
    // Convert each row into an XML record using the column headings as XML element names
    def i=0;
    for (record in csv.iterator()) {
        def fields = record.toMap();
            def xmlString = new StringWriter();
            def xmlRecord = new groovy.xml.MarkupBuilder(xmlString);
            xmlRecord.item() {
                    fields.each() {key, value ->
                            // replace non-word chars with _ to avoid illegal XML field names
                            "${key.replaceAll(/\W/, '_')}""${value}"
                    }
            }
    
        // Add the XML record to an XML store (add an XML declaration line before inserting)
        def xmlContent = XMLUtils.fromString("<?xml version=\"1.0\" encoding=\"utf-8\"?>\n"+xmlString)
        store.add(new XmlRecord(xmlContent, config.value("csv.sourceurl")+"/doc"+i))
        i++
    }
    
    // Close the XML store
    // close() required for the store to be flushed
    store.close()
  4. Edit the collection configuration and add custom configuration settings that are defined within the custom gatherer code. Also add a configuration option that indicates what type of store the custom gatherer uses. This must be matched to the store type used in your custom gatherer. Ensure you save all the settings once you are done.

    csv.sourceurl: http://training-search.clients.funnelback.com/training/training-data/nobel/nobel.csv csv.format: csv csv.encoding: UTF-8 csv.header: true store.record.type: XmlRecord

    exercise custom gatherer to download and index a csv file 02
  5. Run an update of the collection to perform an initial gather of the XML (and detection of XML fields).

  6. Configure the XML field to metadata class mappings. (Adminstration interface, administer tab, configure metadata mappings):

    Class name Source Type Search behaviour

    year

    /item/Year

    text

    searchable as content

    category

    /item/Category

    text

    searchable as content

    name

    /item/Name

    text

    searchable as content

    birthDate

    /item/Birthdate

    text

    display only

    birthPlace

    /item/Birth_Place

    text

    searchable as content

    country

    /item/County

    text

    searchable as content

    residence

    /item/Residence

    text

    searchable as content

    roleAffiliate

    /item/Role_Affiliate

    text

    searchable as content

    fieldLanguage

    /item/Field_Language

    text

    searchable as content

    prizeName

    /item/Prize_Name

    text

    searchable as content

    motivation

    /item/Motivation

    text

    searchable as content

  7. Run an advanced update of the nobel-custom collection to reindex the live view to apply the metadata changes.

  8. Copy the default template used by the nobel-prize-winners collection to the nobel-custom collection:

    1. Change to the nobel-prize-winners collection.

    2. Click edit result templates to list the templates defined for the profile.

    3. Locate the simple.ftl file and click the filename to open the template editor. Save the file to your desktop by choosing the download file option from the tools menu.

      exercise custom gatherer to download and index a csv file 03
    4. Switch to the nobel-custom collection and and select edit result templates from the customise tab. Click the upload files button and upload the simple.ftl that you just saved. You will receive a warning about the file existing. Overwrite the simple.ftl then publish the file.

      exercise custom gatherer to download and index a csv file 04
  9. Set the display and ranking options to the following. (Select edit collection configuration from the administer tab then update the query processor options on the interface screen).

    -stem=2 -SM=meta -SF=[year,category,name,birthDate,birthPlace,country,residence,roleAffiliate,fieldLanguage,prizeName,motivation]
  10. Run a search for nobel against the nobel-custom collection and observe that similar results are returned to those from the nobel-prize-winners collection.

    exercise custom gatherer to download and index a csv file 05

10. Template localisation

Funnelback natively handles documents and queries in non-English languages and in non-Latin character sets. Templates can also be configured to support localisation. This can be useful when providing a search interface designed to target several regions or languages.

Template localisation allows the definition of one or more alternate sets of labels for use within the Funnelback templates.

Additional configuration can be created defining the label text for different languages - and the templates configured to use the labels from the appropriate localisation file.

The localisation is selected by providing an additional CGI parameter that defines the translation file to apply to the template.

Exercise 23: Create a simple localisation file

In this exercise an English (simplified) localisation file will be created that defines some less technical labels for some of the headings used on the advanced search form.

  1. Log in to the administration interface and select the foodista collection

  2. Create a new localisation file from the file manager screen. (Select browse collection configuration files from the administer tab, then create a new ui.*.cfg file from the foodista / presentation (preview) section.

    exercise create a simple localisation file 01
  3. Define two alternate labels - add the following to the file then save and publish the file as ui.en_SI.cfg:

    metadata=Properties
    origin=Position
    exercise create a simple localisation file 02
  4. The default Funnelback search form hardcodes the labels in the template. Edit the default template and update the hardcoded labels to translation variables, defining a fallback to a default value. Locate the Metadata and Origin text in labels in the advanced search form (lines approx. 223 and approx. 307) and replace these with ${response.translations.metadata!"Metadata"} and ${response.translations.origin!"Origin"}. Save and publish the template once the changes have been made.

    exercise create a simple localisation file 03
  5. View the advanced search form - run a query for eggs (http://training-search.clients.funnelback.com/s/search.html?collection=foodista&query=eggs) then view the advanced search form by selecting advanced search from the menu underneath the cog.

    exercise create a simple localisation file 04
  6. Observe the metadata heading and origin items are displaying the default values.

    exercise create a simple localisation file 05
  7. Modify the URL to specify the language by adding &lang.ui=en_SI to the URL (http://training-search.clients.funnelback.com/s/search.html?collection=foodista&query=eggs&lang.ui=en_SI). Observe that the labels for metadata and origin update to properties and position as were defined in the localisation file. ]

    exercise create a simple localisation file 06