Entity Extraction in EIQ Product Suite
Installation And Configuration Guidelines
Installing Prerequisite Software
Installing and Running WhamEE with JavaGateway
Configuring JavaGateway Properties
Installing and Running Standalone WhamEE
Configuring GATE for Standalone WhamEE
Using Entity Extraction: Extracting Entities from Structured Data Source Text Columns
Extract Entities and Build Indexes in RTI Tool
Configure the Virtual Data Source for Entity Queries
Running a Sample GATE Application
Step 2: Add Language Resources
Step 5: View ANNIE annotations in the documents
Adding a New Entity Type to GATE for Extraction
Step 1: Add JAPE rule for the new entity
Step 2: Add lookup values list for Gazetteer
Step 3: Add new entity name for extraction.
The EIQ Product Suite comes with the following components to extract entities out of text content:
· WhamEE uses the open source software GATE (General Architecture for Text Engineering) for text analytics. See http://gate.ac.uk/ for more information on GATE.
· Download and install the latest version of Java runtime.
For advanced users who want to customize GATE:
· Download Java SDK and set the system variable JAVA_HOME to the java running directory (e.g. C:\Program Files\Java\jdk1.6.0_17).
http://www.java.com/en/download/index.jsp
· Download and install the latest version (6.0) of GATE. If possible, choose the Windows version of the installer as this document refers to the Windows version of GATE. GATE provides a flexible, powerful, open source framework to process textual data and identify entities using customizable rules and lookup lists.
· Download and install the latest version of Apache Ant. Ant is used for running JavaGateway, which loads WhamEE and GATE.
· Copy the JavaGateway package to the C:\javagateway folder.
The JavaGateway package comes with WhamEE libraries and several configuration files.
· Edit the build.xml file and change the path of gate.home to point to the GATE installation folder on your local computer. There may be multiple places in the file where you need to make this change.
· Edit the extraction.properties file and set properties for WhamEE including entities to extract. For further information on the properties file, see "Configuring JavaGateway properties" below.
· Edit the logging.properties file to set the path for log files.
· At the command prompt, change to the javagateway folder path, and type "ant run" to start JavaGateway.
JavaGateway loads WhamEE, initializing GATE.
The following files under the JavaGateway install path allow users to configure Java Gateway and WhamEE.
extraction.properties file contains settings for GATE such as the entities to extract and the processing batch size.
########################################################################
# GATE SETUP
########################################################################
# Maximum number of files/records to process in a batch
GATE.MaxFileProcess=30
# Entity types to extract
GATE.ExtractionEntities=Person,Location,Organization,Date,Address,Product
# Maximum corpus size
GATE.MaxCorpusSize = 500000
server.properties file contains JavaGateway server properties such as the thread pool size to use.
########################################################################
# JavaGateway SETUP
########################################################################
#--JavaGateway ServerBase Thread Pool Size
ServerBase.ThreadPoolSize=6
The logging.properties file contains JavaGateway logging properties.
########################################################################
# LOG SETUP
########################################################################
#Default Logging File.
# DEBUG property produces more logging information than INFO
# INFO produces minimal logging information from JavaGateway, WhamEE and GATE
# DEBUG produces more debugging information even from GATE.
log4j.rootLogger= INFO, A2
#log4j.rootLogger= DEBUG, A2
# Appender A2 writes to file.. rolls daily
log4j.appender.A2=org.apache.log4j.DailyRollingFileAppender
log4j.appender.A2.DatePattern='.'yyyyMMdd
log4j.appender.A2.append=true
# Log path
log4j.appender.A2.File=c:\\javagateway\\logs\\javaGateWay.log
# Appender A2 uses the PatternLayout.
log4j.appender.A2.layout=org.apache.log4j.PatternLayout
log4j.appender.A2.layout.ConversionPattern=%d %5p [%t] (%F:%L) - %m%n
· Extract files from whamee_Setup.zip to a local temporary folder. Its located in the EIQ Product Suite Installation Media under folder "WhamEE".
· From the command line, go to the local temporary folder and run ‘whameesetup.bat’.
· ‘Whameesetup.bat’ creates the whamee folder under "C:\Program Files\Whamtech\Whamee".
· Verify that the configuration files (extraction.properties) are set up under the ‘Whamee’ folder. Set the source directory, destination directory, and output directory. For further information on the configuration file, see "Configuring WhamEE properties" below.
· Edit the ‘whamee_run.bat’ file located in "C:\Program files\whamtech\whamee" and modify the property "gate.home" to point to the GATE installation directory. Make sure to use double-quotes around the path, for instance, "C:\Program Files\GATE".
· Run "whamee_run.bat" from the command prompt to launch WhamEE server.
server.properties - contains socket and thread
pools to use
logging.properties - logging properties and tags
extraction.properties - defined below
########################################################################
# Extraction Setups
########################################################################
# ExtractionConfig.Count used to define
the number of configs below
# The below configs will be numbered 0 to n-1.
ExtractionConfig.Count=2
########################################################################
# Multiple configurations can be specified below using this format
# Extraction.<<configid>>.<<property>>=<<value>>
# configid: configuration id starting
from 0
# property: configuration property name
# value: configuration property value
########################################################################
# Configuration ID=0
########################################################################
Extraction.0.SourceDir=C:\/wham\/Projects\/BulkFiles\/IGISWEBDS1
Extraction.0.DestinationDir=C:\/wham\/Projects\/BulkFiles\/whamee_output
Extraction.0.OutputDir=C:\/wham\/Projects\/BulkFiles\/whamee_output
# Unprocessed folder goes under the outputdir
with the below directory #name - full path is not needed for this as the outputdir is appended as #a prefix
Extraction.0.UnprocessedDir=whamee_failed
# Output Format supported: CSV, UpdateSrvFile,
UpdateSrvSingleFile
Extraction.0.OutputFormat=UpdateSrvSingleFile
#
Extraction.0.Continuous=TRUE
#Interval in seconds
Extraction.0.Interval=30
#Max Batch Size of collecting files
Extraction.0.BatchSize=100
#ExtractionSoftware = GATE is currently
the only supported extraction software
Extraction.0.ExtractionSoftware=GATE
Extraction.0.ExtractSoftwareHashName=GATE0
#Entities to extract comma delimited)
#Person,Location,Organization,Date,Address
Extraction.0.EntitiesToExtract=Person,Location,Organization
#File mapping for the above entities
Extraction.0.OutputFiles=Person,Location,Organization
#Character separating each field value within the same record in
the text file
Extraction.0.OutputFile.ColumnDelimiter=,
#character used to identify beginning and end of a string
Extraction.0.OutputFile.StringQualifier='
#FileName to be put in OutputDir for db update
Extraction.0.Person.File.Name.Prefix=
Extraction.0.Person.File.Name.Suffix=txt
#CSV file only needs columns to dump data
#DocName will be a constant for Document
Name from which entity was retrieved.
#e.g. Extraction.0.Person.File.Values=DocName,Person
#Values format is entity:ColumnName
#If update server load file format we will need to have the column
value #equivalent and a line for table name and schema
Extraction.0.Location.File.Name.Prefix=Person
Extraction.0.Location.File.Name.Suffix=txt
Extraction.0.Person.File.Values=DocName:Document,DocPageHash:PageHash,Person:Name,DocLoc:documentLocation
Extraction.0.Person.File.Schema=NULL
Extraction.0.Person.File.Database=NULL
Extraction.0.Person.File.Table=PersonEntity
Extraction.0.Location.File.Name.Prefix=Location
Extraction.0.Location.File.Name.Suffix=txt
Extraction.0.Location.File.Values=DocName:Document,DocPageHash:PageHash,Location:Place,DocLoc:documentLocation
Extraction.0.Location.File.Schema=NULL
Extraction.0.Location.File.Database=NULL
Extraction.0.Location.File.Table=LocationEntity
Extraction.0.Organization.File.Name.Prefix=Organization
Extraction.0.Organization.File.Name.Suffix=txt
Extraction.0.Organization.File.Values=DocName:Document,DocPageHash:PageHash,Address:Org,DocLoc:documentLocation
Extraction.0.Organization.File.Schema=NULL
Extraction.0.Organization.File.Database=NULL
Extraction.0.Organization.File.Table=OrganizationEntity
These steps are needed only when WhamEE is invoked by the command line. Skip this section if you are running WhamEE and GATE from JavaGateway.
WhamEE must load certain GATE plugins to use their processing resources.
· Load the plugins by launching GATE and selecting "Manage CREOLE Plugins" from the "File" menu.
· Select the "Load now" and "Load always" options for the plugins given below.
See http://gate.ac.uk/sale/tao/splitch3.html#x6-550003.5 for further information.
The required plug-ins:
· ANNIE
· Ontology
· Gazetteer_Ontology_Based
· Tools
· Ontology_Tools
Many structured data sources contain vast amounts of unstructured information in text columns. Applications benefit from applying structured queries on this unstructured information. Use Entity extraction to identify and extract entities from unstructured text and build structured indexes on the entities.
Make sure that JavaGateway, WhamEE, and GATE are setup properly and that JavaGateway is running.
In the EIQ Server RTI Tool, enable the entity extraction feature:
· Connect to the structured data source and switch to RTI mode.
· Select ‘Options’ from the ‘Tools’ menu and select ‘Entity Extraction Settings’.
· Select 'Enable entity extraction'.
· Enter the server address and port for the JavaGateway server where WhamEE and GATE are configured for entity extraction.
· Select one or more entity types to extract.
Note: These options are global and apply to all columns designated as Entity Fields. See below for more information on Entity Field designation.
· Designate the columns containing the unstructured text information as Entity Fields by right-clicking the column and selecting Modify Flags->Entity Field from the context menu.
Entity Fields tell the EIQ Server RTI tool to generate an entity table for each entity type and the corresponding association tables. The association tables relate the entity tables to the data source table containing the Entity Field column. These tables store the extracted entity data and allow SQL queries that relate and join the entity data with the source table.
The RTI generated tables have the following naming convention:
· D#_E_Location – the table name for Location entity type ('D#' for derived table; E for entity type; Location for the type of entity)
· D#_EA_Location_Person – the table name for the association table relating the D#_E_Location table with the data source Person table.
· Proceed to build EIQ indexes as usual.
This step is required to configure an EIQ SuperAdapter VDS and is unnecessary for an EIQ TurboAdapter VDS.
While configuring the EIQ SuperAdapter, the only additional step is to map the entity table columns to a virtual schema view (SuperSchema). Each entity type table contains a text column named EntityValue. This column contains the extracted values for that entity type.
· Map the EntityValue columns to a virtual schema view.
· Connect to the VDS and make queries involving the entity table values.
This section describes a sample scenario for running GATE to annotate sample documents for entity extraction.
· Select GATE 6.0 GUI from the Start Menu.
This opens a workspace window.
Certain GATE plugins need to be loaded first.
· Load the plugins by selecting ‘Manage CREOLE Plugins’ from the File menu.
· Select the "Load now" options for the plugins given below.
See http://gate.ac.uk/sale/tao/splitch3.html#x6-550003.5 for further information.
In this sample, Language resources are documents that you want GATE to process.
From GATE->Language Resources:
· Right-click on Language Resources and select ‘New -> GATE Document’ to add html documents from your local system.
· Right-click on Language Resources and select ‘New -> GATE Corpus’ and name the corpus.
· Double-click on the newly created corpus under Language Resources, and add the above documents to the corpus on the right by clicking the '+' button.
ANNIE is the default information extraction system application that comes with GATE. It contains a collection of plugins to process the documents in the corpus created above.
· Select File -> Load ANNIE System - with defaults
GATE loads various processing resources such as tokenizes, gazetteers, sentence splitters, taggers etc. See http://gate.ac.uk/releases/gate-6.0-build3764-ALL/doc/tao/splitch6.html#x9-1260021 for details on ANNIE processing resources.
· Double-click on ‘GATE->ANNIE’.
GATE opens ANNIE on the right side and shows the loaded and selected processing resources. The order in which they are selected is very important:
o Document Reset PR
o ANNIE English Tokenizer
o ANNIEE Gazetter
o ANNIE Sentence Splitter
o ANNIE POS Tagger
o ANNIE NE Transducer
o ANNIE OrthoMatcher
· Click 'Run this Application' at the bottom (or run it through GATE->Applications->ANNIE->Run this Application)
· Double click on a document in the GATE->Language Resources.
The document content is shown on the right side.
· Click 'Annotation Sets' and 'Annotation Lists' to view the corresponding information.
· In the right-most panel, expand the arrows to open the original and ANNIE-created new markups (Address, Date, etc.).
· Select the markups to see the corresponding text highlighted with matching colors. The annotation lists are shown at the bottom.
GATE comes with support for several default entity types such as Person, Organization, Address, etc. Users can create their own entity types for extraction by GATE. The following steps show an example of creating 'Product' entity type.
· Under the “GATE-6.0\plugins\ANNIE\resources\NE\” directory, make a copy of an existing file, for example jobtitle.jape, and rename it for the new entity (product.jape).
· Open and change the file by defining JAPE rules for the new entity.
JAPE file contents for the 'Product' entity type:
Rule: Product1
(
{Lookup.majorType == product}
(
{Lookup.majorType == product}
)?
)
:product
-->
:product.Product = {rule = "Product1"}
· Add an entry for 'product' in main.jape file.
JAPE rules are used by ANNIE NE Transducer. Verify that it can load the new file.
· From GATE->Processing Resources, double-click on 'ANNIE NE Transducer' and verify that the new entity type is listed.
· If the new name is not listed, try reinitializing the transducer.
· Under the “GATE-6.0\plugins\ANNIE\resources\gazetteer\” directory, make a copy of an existing file, for example jobtitles.lst, and rename it for the new entity type (product.lst).
· Open the file and delete all existing entries.
· Add a couple of lookup values for the new entity; one value per line.
EIQ Product Suite
GATE
SQL Server 2008
· Add an entry for 'product' in lists.def file as follows:
product.lst:product
ANNIE Gazetteer uses lists for lookups. Verify that it can access the new list.
· From GATE GUI, double-click on 'ANNIE Gazetteer' and verify that the new entity type is listed.
· If the new type is not listed, try reinitializing the gazetteer.
In the extraction.properties file, add the new entity name as follows:
# Entity types to extract
GATE.ExtractionEntities=Person,Location,Organization,Date,Address,Product
While building indexes for text search columns in EIQ Server RTI Tool, select the new entity type in the ‘Options’ menu under ‘Entity Extraction Settings’. The EIQ Server RTI tool would get the extracted entities from JavaGateway and build indexes for the new entity in a new derived table. Click here for more details on using entity extraction with EIQ Product Suite tools.
Copyright © 2019 , WhamTech, Inc. All rights reserved. This
document is provided for information purposes only and the contents hereof are
subject to change without notice. Names may be
trademarks of their respective owners.