Classification Module 6.1.3

Advanced text extractors enable you to define the patterns that you want to locate in the extracted text based on matches to entries in grammar\library (.ecr) files.

To create a text extractor using the web portal

Select Governed Data | Categorization Manager | Extractors.
Select Advanced to create a new text extractor.
Enter a unique identifier.
The identifier is used by the classification system. Once created, you cannot change this value. It is recommended you use a naming convention that reflects the purpose of the text extractor.
Enter a name and description.
The name and descriptions are useful when building rules to ensure the proper text extractors are being included, so provide all necessary information.
You are ready to define the settings through the Definition tab.
Click Add new to enter the desired grammar files.
These files contain the library of entities that will form the basis for the text to be extracted from your resources. For a listing of Grammar files available with Data Governance Edition, see Sample Advanced Text Extractors Details.

Ensure that you have correctly entered the file name. A formatting issue, may cause the Classification workers to become unresponsive.
Refine the entities that you are interested in by manipulating the patterns in the Grammar xml.
For a listing of match patterns available for each grammar file available with Data Governance Edition, see Sample Advanced Text Extractors Details and for details on formatting the xml, see Writing Grammar XML for Advanced Text Extractors.
Once you are satisfied with the entity patterns, you can further refine the text extractor by clicking Add new to enter the exact matches that you want to extract from your resources.
For a listing of match patterns available for each grammar file available with Data Governance Edition, see Sample Advanced Text Extractors Details.
If desired, select to allow overlap to allow a value to be used more than once when determining pattern matches; and match whole word to ensure the exact term is used and not a substring of a larger word.
Carefully review your settings and save your changes.
Click Validate to ensure the text extractor can be processed by the classification system.

To add an advanced text extractor with PowerShell

Run the Add-QAdvancedTextExtractor command with the following mandatory parameters:
1. ServerAddress
  Provide the name of the computer hosting the Data Governance server, and the port. Enter in the form computername:port number. The default port is 8723.
2. Id
  Provide an ID for this text extractor. The identifier is used by the classification system and once created, it cannot be changed. It is recommended you use a naming convention that reflects the purpose of the text extractor.
3. Name
  The name should reflect the purpose of the text extractor.
4. Matches
  Indicates the entities to match as defined in the grammar xml.
5. GrammarXML
  Refine your entities through patterns. Patterns can consist of one or more regular expressions and pre-defined entities, or combinations of the two.
If desired, use the following optional parameters:
1. Description
  Provide a description for the text extractor. This is useful when building rules to ensure the proper text extractors are being included, so provide all necessary information.
2. AllowOverlap
  This option allows the patterns specified by the Matches to be used more than once when determining pattern matches. For example, when enabled, a search for “01/01” in a resource that includes “01/01/01/01” will return 3 matches, when disabled it would return 2.
3. MatchWholeWord
  When set to $true, this ensures that the word is not a substring of a larger text value.

To edit an advanced text extractor using the web portal

Select Governed Data | Categorization Manager | Extractors.
Select the required text extractor and click Edit.
From the General tab, you can edit the name and description.
The name and descriptions will be visible by all users who are building rules. Including detailed information helps to ensure the proper text extractors are being included.
Select the Definition tab.
1. Add and remove grammar files as required.
2. Refine the entities that you are interested in by manipulating the patterns in the Grammar xml.
3. Once you are satisfied with the entity patterns, you can further refine the text extractor by adding and removing the required match details.
4. Select Match whole word to ensure the exact term is used and not a substring of a larger word.
Carefully review your settings and save your changes.
Click Validate to ensure the text extractor can be processed by the classification system.

To edit an advanced text extractor with PowerShell

Make sure you know the ID of the desired text extractor. For more information, see Finding a Taxonomy, Category, or Extractor ID using PowerShell.
Run the Set-QAdvancedTextExtractorcommand with the following required parameters:
1. ServerAddress
  Provide the name of the computer hosting the Data Governance server, and the port. Enter in the form computername:port number. The default port is 8723.
2. Id.
Adjust any of the required parameters: Name, Description, Matches, GrammarXML, AllowOverlap, and MatchWholeWord.

Writing Grammar XML for Advanced Text Extractors

You can write grammar xml using any XML editor that supports UTF-8 encoding using the format described here. Once you have written your grammar xml and entered at least one match criteria, you can associate it with a rule, add it to the system, and test it. When your rule is performing as desired, you can associate it with a category. If you plan to reuse rules or text extractor across more than one category, ensure you take this into account when developing them. You should not refine it in a way that meets the needs of one situation but not all others.

Elements of Grammar XML

By understanding the grammar elements, and examining the sample text extractors included in Quest One Identity Manager Data Governance Edition, you can write your own text extractors, or edit existing ones. The following are the elements in the grammar XML.

<?xml version="1.0" encoding="UTF-8"?>   <!DOCTYPE grammars SYSTEM "edk.dtd">      <grammars>       <grammar name>       <entity name>         <pattern>         <pattern>     </grammars>

<!DOCTYPE grammars SYSTEM’edk.dtd>

<!DOCTYPE grammars SYSTEM’edk.dtd> specifies that this is an advanced text extractor that will base its matches on entries in grammar\library (.ecr) files.

<grammars>

<grammars> Represent patterns to be evaluated against content. The grammar element allows you to define custom grammars. You can combine grammars in a text extractor by adding multiple <grammar> tags. Only Advanced text extractors use this tag.

For a list of grammars included in Quest One Identity Manager, see Sample Advanced Text Extractors Details. The <grammar> tag, and can either be written inline, or referenced externally, as long as the <grammar> structure is followed in the external file.

<grammars name>

<grammars pattern> This is mandatory field that names the grammar you are using. This allows you to share entities between text extractors. Entities have full path names that include the grammar name. A grammar named “number” may be a reference to a custom entity called “cc/delim”, which details a delimited credit card. The full name of the entity, if referenced from another text extractor, is “number/cc/delim”. Within the same text extractor, the full path is not necessary.

Note: The grammar name cannot begin with a number.

<entity name>

<entity> Names the entity that is being built. You can reference entities from other grammars by using the full path.

<pattern>

<pattern> Identifies the exact patterns that make up the entity. You can use multiple patterns within an entity. You can use patterns that are included with Quest One Identity Manager, regular expressions, or a combination.

Each <pattern> tag can contain more than one element. Within one <pattern> tag, all elements must match. If there is more than one <pattern> tag, only one of them needs to match.

Sample Advanced Text Extractor

The following example illustrates how to write an Advanced text extractor in the web portal using XML, grammar files, and specific match criteria. In this case, the extractor is designed to find instances of US and Canadian companies.

To create this advanced text extractor

Select Governed Data | Categorization Manager | Extractors.
Select Create new text extractor.
Select Advanced as the type of text extractor to create.
Enter a unique identifier. For example, Extractor.USCACompany.Name.
Enter a name and description. For example, Major North American Companies.
You are ready to define the settings through the Definition tab.
Select the Definition tab.
1. Click Add new to enter the following grammar files:
  - company_engus.ecr
  - company_engca.ecr
  - These files contain the library of entities that will form the basis for the text to be extracted from your resources.
2. Refine the entities that you are interested in by manipulating the patterns in the Grammar xml. For example,
```
<![CDATA[ <?xml version="1.0" encoding="UTF-8"?>   <!DOCTYPE grammars SYSTEM "edk.dtd">    <grammars> <grammar name="names_all"> <entity name="company/all" type="public"> <pattern>(?A^company/all/engca)</pattern> <pattern>(?A^company/major_company/engus)</pattern> <pattern>(?A^company/fortune_500_2011/engus)</pattern> <pattern>(?A^company/forbes_largest_private_companies_2010/engus)</pattern> </entity> </grammar> </grammars>]]>
```
3. Once you are satisfied with the entity patterns, you can further refine the text extractor by clicking Add new to enter the exact matches that you want to extract from your resources. For example, names_all/company/all.
4. Select Match whole word to ensure the exact term is used and not a substring of a larger word.
Carefully review your settings and save your changes.
Click Validate to ensure the text extractor can be processed by the classification system

Testing and Reviewing Automated Classification

Before you make your category available to the classification system, you should test that the rules and category are behaving as desired. You can use the following diagnostics:

Test a rule against a resource
Test a resource against all rules
See what text is extracted from a resource
See why a particular resource is categorized the way it is
Browse all categorized resources to see the results of the system

Classification Module 6.1.3 - User Guide

Working with Advanced Text Extractors

Contents

Writing Grammar XML for Advanced Text Extractors

Elements of Grammar XML

<!DOCTYPE grammars SYSTEM’edk.dtd>

<grammars>

<grammars name>

<entity name>

<pattern>

Sample Advanced Text Extractor

Testing and Reviewing Automated Classification

Contents

Please select your product:

To serve you better, please complete the Purpose of your Chat:

Recommended Solutions for Your Problem

Classification Module 6.1.3 - User Guide

Working with Advanced Text Extractors

Contents

Writing Grammar XML for Advanced Text Extractors

Elements of Grammar XML

<!DOCTYPE grammars SYSTEM’edk.dtd>

<grammars>

<grammars name>

<entity name>

<pattern>

Sample Advanced Text Extractor

Testing and Reviewing Automated Classification

Contents