Home page  
Help > EIQ Special Features >
Entity Extraction Help
Version 7.11

Entity Extraction in EIQ Product Suite.. 1

Installation And Configuration Guidelines. 2

Installing Prerequisite Software. 2

Installing and Running WhamEE with JavaGateway. 2

Configuring JavaGateway Properties. 2

Installing and Running Standalone WhamEE.. 3

Configuring WhamEE Properties. 3

Configuring GATE for Standalone WhamEE.. 5

Using Entity Extraction: Extracting Entities from Structured Data Source Text Columns. 6

Extract Entities and Build Indexes in RTI Tool 6

Configure the Virtual Data Source  for Entity Queries. 9

Query the Virtual Data Source. 10

GATE Quick Start Guide. 11

Running a Sample GATE Application. 11

Step 1: Initialize GATE GUI 11

Step 2: Add Language Resources. 12

Step 3: Load ANNIE System.. 13

Step 4: Run ANNIE application. 13

Step 5: View ANNIE annotations in the documents. 14

Adding a New Entity Type to GATE for Extraction. 15

Step 1: Add JAPE rule for the new entity. 15

Step 2: Add lookup values list for Gazetteer 16

Step 3: Add new entity name for extraction. 17

Step 4: Extract new entities. 17

Entity Extraction in EIQ Product Suite

The EIQ Product Suite comes with the following components to extract entities out of text content:

 

 

·         WhamEE uses the open source software GATE (General Architecture for Text Engineering) for text analytics. See http://gate.ac.uk/ for more information on GATE.

 

 

Installation And Configuration Guidelines

Installing Prerequisite Software

·         Download and install the latest version of Java runtime.

 

For advanced users who want to customize GATE:

·         Download Java SDK and set the system variable JAVA_HOME to the java running directory (e.g. C:\Program Files\Java\jdk1.6.0_17).

http://www.java.com/en/download/index.jsp

 

·         Download and install the latest version (6.0) of GATE. If possible, choose the Windows version of the installer as this document refers to the Windows version of GATE. GATE provides a flexible, powerful, open source framework to process textual data and identify entities using customizable rules and lookup lists.

http://gate.ac.uk/download/

 

·         Download and install the latest version of Apache Ant. Ant is used for running JavaGateway, which loads WhamEE and GATE.

http://ant.apache.org/

Installing and Running WhamEE with JavaGateway

·         Copy the JavaGateway package to the C:\javagateway folder.

 

The JavaGateway package comes with WhamEE libraries and several configuration files.

 

·         Edit the build.xml file and change the path of gate.home to point to the GATE installation folder on your local computer. There may be multiple places in the file where you need to make this change.

·         Edit the extraction.properties file and set properties for WhamEE including entities to extract. For further information on the properties file, see "Configuring JavaGateway properties" below.

·         Edit the logging.properties file to set the path for log files.

·         At the command prompt, change to the javagateway folder path, and type "ant run" to start JavaGateway.

 

JavaGateway loads WhamEE, initializing GATE.

Configuring JavaGateway Properties

The following files under the JavaGateway install path allow users to configure Java Gateway and WhamEE.

 

extraction.properties file contains settings for GATE such as the entities to extract and the processing batch size.

 

########################################################################

# GATE SETUP

########################################################################

# Maximum number of files/records to process in a batch

GATE.MaxFileProcess=30

 

# Entity types to extract

GATE.ExtractionEntities=Person,Location,Organization,Date,Address,Product

 

# Maximum corpus size

GATE.MaxCorpusSize = 500000

 

 

server.properties file contains JavaGateway server properties such as the thread pool size to use.

 

########################################################################

# JavaGateway SETUP

########################################################################

#--JavaGateway ServerBase Thread Pool Size

ServerBase.ThreadPoolSize=6

 

 

The logging.properties file contains JavaGateway logging properties.

 

########################################################################

# LOG SETUP

########################################################################

#Default Logging File.

 

# DEBUG property produces more logging information than INFO

# INFO produces minimal logging information from JavaGateway, WhamEE and GATE

# DEBUG produces more debugging information even from GATE.

log4j.rootLogger= INFO, A2

#log4j.rootLogger= DEBUG, A2

 

# Appender A2 writes to file.. rolls daily

log4j.appender.A2=org.apache.log4j.DailyRollingFileAppender

log4j.appender.A2.DatePattern='.'yyyyMMdd

log4j.appender.A2.append=true

 

# Log path

log4j.appender.A2.File=c:\\javagateway\\logs\\javaGateWay.log

 

# Appender A2 uses the PatternLayout.

log4j.appender.A2.layout=org.apache.log4j.PatternLayout

log4j.appender.A2.layout.ConversionPattern=%d %5p [%t] (%F:%L) - %m%n

 

Installing and Running Standalone WhamEE

·         Extract files from whamee_Setup.zip to a local temporary folder. Its located in the EIQ Product Suite Installation Media under folder "WhamEE".

·         From the command line, go to the local temporary folder and run ‘whameesetup.bat’.

·         ‘Whameesetup.bat’ creates the whamee folder under "C:\Program Files\Whamtech\Whamee".

·         Verify that the configuration files (extraction.properties) are set up under the ‘Whamee’ folder. Set the source directory, destination directory, and output directory. For further information on the configuration file, see "Configuring WhamEE properties" below.

·         Edit the ‘whamee_run.bat’ file located in "C:\Program files\whamtech\whamee" and modify the property "gate.home" to point to the GATE installation directory. Make sure to use double-quotes around the path, for instance, "C:\Program Files\GATE".

·         Run "whamee_run.bat" from the command prompt to launch WhamEE server.

Configuring WhamEE Properties

server.properties - contains socket and thread pools to use

logging.properties - logging properties and tags

extraction.properties - defined below

 

########################################################################

# Extraction Setups

########################################################################

 

# ExtractionConfig.Count used to define the number of configs below

# The below configs will be numbered 0 to n-1.

ExtractionConfig.Count=2

 

 

########################################################################

# Multiple configurations can be specified below using this format

# Extraction.<<configid>>.<<property>>=<<value>>

# configid: configuration id starting from 0

# property: configuration property name

# value: configuration property value

 

 

########################################################################

# Configuration ID=0

########################################################################

Extraction.0.SourceDir=C:\/wham\/Projects\/BulkFiles\/IGISWEBDS1

Extraction.0.DestinationDir=C:\/wham\/Projects\/BulkFiles\/whamee_output

Extraction.0.OutputDir=C:\/wham\/Projects\/BulkFiles\/whamee_output

 

# Unprocessed folder goes under the outputdir with the below directory #name - full path is not needed for this as the outputdir is appended as #a prefix

Extraction.0.UnprocessedDir=whamee_failed

 

 

# Output Format supported: CSV, UpdateSrvFile, UpdateSrvSingleFile

Extraction.0.OutputFormat=UpdateSrvSingleFile

 

# Single Pass/ Multiple Passes

Extraction.0.Continuous=TRUE

 

#Interval in seconds

Extraction.0.Interval=30

 

#Max Batch Size of collecting files

Extraction.0.BatchSize=100

 

#ExtractionSoftware = GATE is currently the only supported extraction software

Extraction.0.ExtractionSoftware=GATE

Extraction.0.ExtractSoftwareHashName=GATE0

 

#Entities to extract comma delimited)

#Person,Location,Organization,Date,Address

Extraction.0.EntitiesToExtract=Person,Location,Organization

 

#File mapping for the above entities

Extraction.0.OutputFiles=Person,Location,Organization

 

#Character separating each field value within the same record in the text file

Extraction.0.OutputFile.ColumnDelimiter=,

 

#character used to identify beginning and end of a string

Extraction.0.OutputFile.StringQualifier='

 

#FileName to be put in OutputDir for db update

Extraction.0.Person.File.Name.Prefix=

Extraction.0.Person.File.Name.Suffix=txt

 

#CSV file only needs columns to dump data

#DocName will be a constant for Document Name from which entity was retrieved.

#e.g. Extraction.0.Person.File.Values=DocName,Person

#Values format is entity:ColumnName

#If update server load file format we will need to have the column value #equivalent and a line for table name and schema

Extraction.0.Location.File.Name.Prefix=Person

Extraction.0.Location.File.Name.Suffix=txt

Extraction.0.Person.File.Values=DocName:Document,DocPageHash:PageHash,Person:Name,DocLoc:documentLocation

Extraction.0.Person.File.Schema=NULL

Extraction.0.Person.File.Database=NULL

Extraction.0.Person.File.Table=PersonEntity

 

Extraction.0.Location.File.Name.Prefix=Location

Extraction.0.Location.File.Name.Suffix=txt

Extraction.0.Location.File.Values=DocName:Document,DocPageHash:PageHash,Location:Place,DocLoc:documentLocation

Extraction.0.Location.File.Schema=NULL

Extraction.0.Location.File.Database=NULL

Extraction.0.Location.File.Table=LocationEntity

 

Extraction.0.Organization.File.Name.Prefix=Organization

Extraction.0.Organization.File.Name.Suffix=txt

Extraction.0.Organization.File.Values=DocName:Document,DocPageHash:PageHash,Address:Org,DocLoc:documentLocation

Extraction.0.Organization.File.Schema=NULL

Extraction.0.Organization.File.Database=NULL

Extraction.0.Organization.File.Table=OrganizationEntity

 

Configuring GATE for Standalone WhamEE

These steps are needed only when WhamEE is invoked by the command line. Skip this section if you are running WhamEE and GATE from JavaGateway.

 

WhamEE must load certain GATE plugins to use their processing resources.

·         Load the plugins by launching GATE and selecting "Manage CREOLE Plugins" from the "File" menu.

·         Select the "Load now" and "Load always" options for the plugins given below.

 

See http://gate.ac.uk/sale/tao/splitch3.html#x6-550003.5 for further information.

 

The required plug-ins:

·         ANNIE

·         Ontology

·         Gazetteer_Ontology_Based

·         Tools

·         Ontology_Tools

 

Description: Description: image002

 

Using Entity Extraction: Extracting Entities from Structured Data Source Text Columns

Many structured data sources contain vast amounts of unstructured information in text columns. Applications benefit from applying structured queries on this unstructured information. Use Entity extraction to identify and extract entities from unstructured text and build structured indexes on the entities.

 

Make sure that JavaGateway, WhamEE, and GATE are setup properly and that JavaGateway is running.

Extract Entities and Build Indexes in RTI Tool

In the EIQ Server RTI Tool, enable the entity extraction feature:

 

·         Connect to the structured data source and switch to RTI mode.

·         Select ‘Options’ from the ‘Tools’ menu and select ‘Entity Extraction Settings’.

·         Select 'Enable entity extraction'.

·         Enter the server address and port for the JavaGateway server where WhamEE and GATE are configured for entity extraction.

·         Select one or more entity types to extract.

 

Note: These options are global and apply to all columns designated as Entity Fields. See below for more information on Entity Field designation.

 

 

·         Designate the columns containing the unstructured text information as Entity Fields by right-clicking the column and selecting Modify Flags->Entity Field from the context menu.

 

 

Entity Fields tell the EIQ Server RTI tool to generate an entity table for each entity type and the corresponding association tables. The association tables relate the entity tables to the data source table containing the Entity Field column. These tables store the extracted entity data and allow SQL queries that relate and join the entity data with the source table.

 

The RTI generated tables have the following naming convention:

·         D#_E_Location – the table name for Location entity type ('D#' for derived table; E for entity type; Location for the type of entity)

·         D#_EA_Location_Person – the table name for the association table relating the D#_E_Location table with the data source Person table.

 

 

·         Proceed to build EIQ indexes as usual.

 

Configure the Virtual Data Source  for Entity Queries

 

This step is required to configure an EIQ SuperAdapter VDS and is unnecessary for an EIQ TurboAdapter VDS.

 

While configuring the EIQ SuperAdapter, the only additional step is to map the entity table columns to a virtual schema view (SuperSchema). Each entity type table contains a text column named EntityValue. This column contains the extracted values for that entity type.

 

·         Map the EntityValue columns to a virtual schema view.

 

 

Query the Virtual Data Source

·         Connect to the VDS and make queries involving the entity table values.

 

GATE Quick Start Guide

Running a Sample GATE Application

This section describes a sample scenario for running GATE to annotate sample documents for entity extraction.

Step 1: Initialize GATE GUI

·         Select GATE 6.0 GUI from the Start Menu.

 

This opens a workspace window.

 

Certain GATE plugins need to be loaded first.

 

·         Load the plugins by selecting ‘Manage CREOLE Plugins’ from the File menu.

·         Select the "Load now" options for the plugins given below.

 

See http://gate.ac.uk/sale/tao/splitch3.html#x6-550003.5 for further information.

 

Step 2: Add Language Resources

In this sample, Language resources are documents that you want GATE to process.

 

From GATE->Language Resources:

·         Right-click on Language Resources and select ‘New -> GATE Document’ to add html documents from your local system.

·         Right-click on Language Resources and select ‘New -> GATE Corpus’ and name the corpus.

·         Double-click on the newly created corpus under Language Resources, and add the above documents to the corpus on the right by clicking the '+' button.

 

 

Step 3: Load ANNIE System

ANNIE is the default information extraction system application that comes with GATE. It contains a collection of plugins to process the documents in the corpus created above.

 

·         Select File -> Load ANNIE System - with defaults

 

GATE loads various processing resources such as tokenizes, gazetteers, sentence splitters, taggers etc. See http://gate.ac.uk/releases/gate-6.0-build3764-ALL/doc/tao/splitch6.html#x9-1260021 for details on ANNIE processing resources.

Step 4: Run ANNIE application

·         Double-click on ‘GATE->ANNIE’.

 

GATE opens ANNIE on the right side and shows the loaded and selected processing resources. The order in which they are selected is very important:

o   Document Reset PR

o   ANNIE English Tokenizer

o   ANNIEE Gazetter

o   ANNIE Sentence Splitter

o   ANNIE POS Tagger

o   ANNIE NE Transducer

o   ANNIE OrthoMatcher

 

 

·         Click 'Run this Application' at the bottom (or run it through GATE->Applications->ANNIE->Run this Application)

Step 5: View ANNIE annotations in the documents

·         Double click on a document in the GATE->Language Resources.

 

The document content is shown on the right side.

 

·         Click 'Annotation Sets' and 'Annotation Lists' to view the corresponding information.

·         In the right-most panel, expand the arrows to open the original and ANNIE-created new markups (Address, Date, etc.).

·         Select the markups to see the corresponding text highlighted with matching colors. The annotation lists are shown at the bottom.

 

 

Adding a New Entity Type to GATE for Extraction

GATE comes with support for several default entity types such as Person, Organization, Address, etc.  Users can create their own entity types for extraction by GATE. The following steps show an example of creating 'Product' entity type.

Step 1: Add JAPE rule for the new entity

·         Under the “GATE-6.0\plugins\ANNIE\resources\NE\” directory, make a copy of an existing file, for example jobtitle.jape, and rename it for the new entity (product.jape).

·         Open and change the file by defining JAPE rules for the new entity.

 

JAPE file contents for the 'Product' entity type:

 

Rule: Product1

(

 {Lookup.majorType == product}

 (

  {Lookup.majorType == product}

 )?

)

:product

-->

 :product.Product = {rule = "Product1"}

 

·         Add an entry for 'product' in main.jape file.

 

JAPE rules are used by ANNIE NE Transducer. Verify that it can load the new file.

 

·         From GATE->Processing Resources, double-click on 'ANNIE NE Transducer' and verify that the new entity type is listed.

·         If the new name is not listed, try reinitializing the transducer.

 

 

Step 2: Add lookup values list for Gazetteer

·         Under the “GATE-6.0\plugins\ANNIE\resources\gazetteer\” directory, make a copy of an existing file, for example jobtitles.lst, and rename it for the new entity type (product.lst).

·         Open the file and delete all existing entries.

·         Add a couple of lookup values for the new entity; one value per line.

 

EIQ Product Suite

GATE

SQL Server 2008

 

·         Add an entry for 'product' in lists.def file as follows:

 

product.lst:product

 

ANNIE Gazetteer uses lists for lookups. Verify that it can access the new list.

 

·         From GATE GUI, double-click on 'ANNIE Gazetteer' and verify that the new entity type is listed.

·         If the new type is not listed, try reinitializing the gazetteer.

 

 

 

Step 3: Add new entity name for extraction

In the extraction.properties file, add the new entity name as follows:

 

# Entity types to extract

GATE.ExtractionEntities=Person,Location,Organization,Date,Address,Product

 

Step 4: Extract new entities

While building indexes for text search columns in EIQ Server RTI Tool, select the new entity type in the ‘Options’ menu under ‘Entity Extraction Settings’. The EIQ Server RTI tool would get the extracted entities from JavaGateway and build indexes for the new entity in a new derived table. Click here for more details on using entity extraction with EIQ Product Suite tools.

 

 

Copyright © 2019 , WhamTech, Inc.  All rights reserved. This document is provided for information purposes only and the contents hereof are subject to change without notice. Names may be trademarks of their respective owners.