WhamSearch Intelligent Spider User Guide.. 1

Intelligent Spider System Configuration. 2

Spidering, Initializing and Using Web Content Indexes. 2

Step 2: Build Initial Web Content Indexes. 10

Step 3: Register Web Content Index with the EIQ Server as Data Source. 12

WhamSearch Intelligent Spider User Guide

Introduction

This guide describes the functionality of the WhamSearch Intelligent Spider and provides instructions on how to configure the tools to work with it.

WhamSearch and EIQ Product tools combine to help data analysts find relevant content on the web through advanced intelligent spidering directed by relevant keywords, watch lists, and information geometry models (find similar documents). Analysts can automatically extract entities out of the content (structured data out of unstructured text), combine web content with other sources of data through the EIQ Federation Server, and get real-time alerts when the intelligent spider encounters material of interest.

Note: If the WhamSearch Intelligent Spider is not included in the EIQ Product Suite Installation Media, contact WhamTech at devsupport@whamtech.com.

Intelligent Spider Components

WhamSearch: WhamSearch is an intelligent spider that finds relevant content from the web meeting user-defined criteria (based on keywords, watch lists, and information geometry based models). The output from WhamSearch is processed by other tools to provide text search and complex event processing.

JavaGateway configured as WhamEE: WhamEE (WhamTech Entity Extraction) identifies and extracts entities out of WhamSearch generated web content using third party, open source, GATE software. WhamEE passes the extracted entities, along with the original web content, to EIQ RTIS.

EIQ RTIS: EIQ RTIS monitors the output from WhamEE and passes new content to the EIQ Server to update web content search indexes in near-real-time. The new web content is immediately available for querying by clients.

EIQ Server: EIQ Server provides a SQL query interface to client applications. The clients connect to EIQ Server to query the indexes for WhamSearch generated web content.

EIQ Server RTI Tool: The EIQ Server RTI Tool helps admins initialize empty web content indexes. It provides users with options to build various types of indexes including word stems, synonyms, SOUNDEX, Metaphone, and Proximity. It also includes options to build indexes that consider word weighting.

System Requirements

This guide assumes the WhamSearch systems have the following configuration:

Intelligent Spider System Configuration

· Microsoft Windows 2003 or 2008 Server with the latest service packs for medium scale deployments

· Microsoft Windows XP with latest service packs for Developer/Trail versions/Small scale deployments

· High-speed Internet access

For performance and scalability, the following system configuration is strongly recommended:

Multiple physical disks; one for running WhamSearch projects and another for temporary files.
SATA drives with RAID configuration.

Spidering, Initializing and Using Web Content Indexes

Step 1: Spider the domain URLs

WhamSearch takes the initial domain URLs (given in a CSV file), crawls the web to find relevant documents, and writes content bucket files to a specified output directory.

Open the WhamSearch Tool (Start -> All Programs -> WhamTech EIQ Server Suite -> WhamSearch) and select 'File->New project', or click on the 'blank page' icon.

Seed URL(s) (starting URLs). Enter the project name; for example, 'project1'.

a. If you want to import the Seed URL(s) from a file:

Select the first radio button and click "Import" to import a comma separated value file (.CSV or .TXT) that contains the Seed URLs.

b. You can set the spider to continuously monitor two folders for seed URLs; one folder for adding seed URLs to the index or to re-spider existing URLs and the other for deleting specified seed URLs and the crawled child-URLs from the index:

Select the second radio button and Browse for the folders you want to specify for Addition and Deletion. You can also specify the polling interval time in minutes and seconds.

Click "Next".

Seed URLs File Format: Note that the file that has the seed URLs has to be a comma-separated values file (either .CSV or .TXT). The format for specifying a seed URL is "userdata, seedURL, priority", where userdata is generally some form of ID field. For example, "account ID" - seedURL is the starting URL that will be spidered. It must begin with "http://" - and priority is a single-character optional flag where 'Y' denotes high priority and 'N' denotes normal priority. URLs with priority 'Y' will be spidered the moment they are submitted, even if other normal-priority URLs are being processed currently.

On the Internal Domains screen, specify domains that can be treated as "Internal domains." By default, the sleep time for spider threads before revisiting pages in the same domain is 100 milliseconds for Internal domains. It is 5 seconds for external domains. These times can be changed on the same screen. The domain names (in the format www.domainname.com) can either be typed one-at-a-time by clicking "Add" or can be imported from a flat file.

Note: The distinction in thread sleep times for internal and external domains significantly improves spidering speed. Also, specifying internal domains is an optional step.

Description: Description: Description: Description: image008

Select the file types you would like to index (.htm, .html, .php, .PDF, .DOC, etc.) and click "Next".

Description: Description: Description: Description: image010

WhamSearch Project Properties is used to specify the output folders for WhamSearch and to change various project settings. WhamSearch uses the folder name entered in "Project Folder" to store files for internal use. It uses the "Bulkfile Folder" to store the output bucket files (.tsi). These are used by the EIQ Server RTI Tool and EIQ RTIS to build and update the URL indexes. Make changes to the other configuration settings if necessary and click "Finish".

WhamSearch will now start spidering the input URLs. This may take a while depending on the number of input URLs, server load, and the number of crawled URLs. WhamSearch displays the spidering activity information for each project on the right side. This information includes the number of URLs tried for download, the number of URLs written to bucket file, and the number of failed URLs.

At the end of spidering the current set of URLs, the "URLs in the queue" section in the status bar at the bottom displays 0 (zero).

Logging: WhamSearch creates a log file and writes the names of any failed URLs and the reason for failure. The log file is located in the "C:\Program Files\WhamTech\Log" folder. "C:\Program Files\WhamTech" is the WhamSearch Installation folder.

Adding additional Seed URLs while Spider is running: WhamSearch spiders the domains in the order given in the input file. Once spidering starts, users can add new domains that have not been previously spidered by right-clicking "Start Seed URL(s)" for the relevant project in the left pane. Users can set a high-priority flag to domains by selecting the "high-priority" checkbox. These domains will be spidered almost immediately and will be available for EIQ RTIS to update the indexes. Users can also add new domains or re-spider previously spidered domains by placing a new seed URL CSV file in the "Addition" folder. This is available if the ‘folders’ option was chosen while creating the project.

To close the WhamSearch tool while a project is being spidered, or to release the current set of output bulk files that the threads are writing to (without waiting till the "idle interval to release bulkfiles" elapses), the spider threads have to be manually stopped by the user. This can be done by right-clicking "Spider Thread(s)" and selecting "Stop". The spider threads can be started again manually by right-clicking "Spider Thread(s)" and selecting "Start" or automatically by placing a seed URL .CSV (or .TXT) file in the Addition/Deletion folder.

Saving and Opening Existing Projects:

Project settings are saved automatically as an .XML file under the project folder with the name <project_name>.xml.

To open a saved project, click "File” and select “Open”. Browse to the relevant XML project file and click "Open". You can also click the “Open” icon on the toolbar.

To see properties for a current project, right-click the project name and select "Properties".

This opens the “Properties” dialog shown below:

Here, the user can change properties such as the Folder polling interval, scan options, and maximum bulkfile size.

Note: WhamSearch uses the user specified time-out period for sockets before proceeding to the next URL in the queue. If a user stops and re-starts spider threads, WhamSearch retries those timed-out URLs first before processing any other URLs given in the Addition/Deletion folders.

Step 2: Build Initial Web Content Indexes

Using the EIQ Server RTI Tool, build initial web content indexes on bucket files from a WhamSearch output directory.

Note: Before using the EIQ Server RTI Tool to build web content indexes on the spidered bulkfiles (.TSI), make sure to stop the spider threads in the WhamSearch tool. This releases the locks on the .TSI files so that index building can proceed properly. You can resume spidering once the index building is completed.

Open the EIQ Server RTI Tool and select 'Create a new EIQ Index map'.
Select 'Document Folder' as the data source. Then enter the path or click 'Browse' to locate the WhamSearch output folder that contains the bucket files to be indexed. For example, 'C:\Program Files\WhamTech\Projects\project1\BulkFile'.

For 'Index Location', enter the path where you wish to store the web content index files, for example, C:\Program Files\WhamTech\Data\URLINDEX. A folder with the name entered in 'EIQ Index Name' is created under the 'EIQ Index Path'.

In the 'Select document Types to Index' dialog, unselect the 'Text File (TXT)' option and select the 'Text Search Index (TSI)' option.

Description: Description: Description: Description: image015

Click 'Finish'.
All .tsi files from the selected folder are displayed.
Right-click the folder name (BulkFile) in the left pane and select "Add All". This will add all the files to the index map on the right.
Right-click 'BulkFile' in the right pane and select 'Text Search Options'. Here you can specify the types of Text Search indexes to build on the input .tsi files. Select the desired build options and click 'OK'.

Note: Make sure to check "Build Indexes for searching words in the proximity of 1 or 2 words" to enable phrase search. Without these indexes, WhamSearch cannot find phrases.

DO NOT select weighted indexes; they are not applicable to WhamSearch and might increase indexing time and disk space needed to store the index files.

Finally, click 'Build EIQ Index' on the toolbar.

Step 3: Register Web Content Index with the EIQ Server as Data Source

Register the web content index as a data source with the EIQ Server. Make sure that "WhamTech EIQ Server" and "WhamTech EIQ RTIS" are running.

Create a Thunderbolt Data Source associated with the sample web content index using the EIQ Server Configuration Tool by following these steps:

· Go to the 'Data Sources' tab and select 'Add Data Source.'

· In the 'Data Source Registration' window, enter 'WHAMSEARCHDS' as the Alias; 'THUNDERBOLT' as the Data Source Type; and Browse for

'URLINDEX.DBD' as the index file

· Select 'Test Connection' and click ‘OK’.

· Close the EIQ Server Configuration Tool.

Now, you can query the web content indexes using the EIQ Server Query Tool or other client tools using EIQ Server drivers (ODBC, OLEDB, JDBC, and Web Services).

See detailed search syntax here.

Step 4: EIQ RTIS Configuration

Configure EIQ RTIS to poll a WhamSearch or WhamEE output directory for any new files to update the specified web content index in near-real-time.

See EIQ Update Configuration Tool for more information.

See Entity Extraction Help for details regarding WhamEE configuration.