Elasticsearch Amazon Comprehend NLP Ingest Processor

Elasticsearch ingest processors using Amazon Comprehend for various NLP analysis. All Comprehend detection features are supported via separate processors. Topic Modeling is not supported, although it would be an interesting project to hook up Elasticsearch as a data source for AWS Comprehend topic modeling.

Each field that is sent through the ingest process will result in an AWS Comprehend API call, so this system is not meant for clusters with large workloads. There is no support for batch processing. For better performance, your Elasticsearch ingest nodes should not only be hosted in AWS, but should also be in the region used in the AWS Comprehend API (configurable).

AWS Comprehend Pricing

$0.0001 PER UNIT Up to 10M units

$0.00005 PER UNIT From 10M - 50M units

$0.000025 PER UNIT Over 50M units

NLP requests are measured in units of 100 characters, with a 3 unit (300 character) minimum charge per request.

Supported Features

Building

There is no downloadable version of the plugin for two reasons:

It is difficult to release a plugin for each minor version of Elasticsearch. You can only run plugins built for the exact version of Elasticsearch.
Due to the warning at the very top regarding cost and performance, it prefered that the plugin is built and not blindly installed so that users are aware.

Only Elasticsearch 5.6+ is supported in order to take advantage of the secure keystore.

Installation

Only basic credentials are supported. The AWS access and secret keys are added to Elasticsearch keystore, before the node is started.

Plugin Settings

Setting	Description
ingest.aws-comprehend.credentials.access_key	AWS Acesss key
ingest.aws-comprehend.credentials.secret_key	AWS Secret Key
ingest.aws-comprehend.region	AWS region used to the API call. Default region is us-east-1

AWS Credentials are not configured in elasticsearch.yml, or in the plugin settings, but in the keystore. Settings must be in place before Elasticsearch is started.

Processor settings

Name	Required	Default	Description
field	yes	-	The field to analyze
target_field	no	A new field with the name of the source field with a processor specific suffix appended	The field to assign the converted value to.
language_code	no	en	The language of the analyzed text. Not used in the Language Detection processor.
min_score	no	0 (all returned)	The minimum score threshold of values to be returned
max_values	no	0 (all returned)	The number of values to return. If max_value is 1, a single value is returned and not an array. Not used in the Sentiment Analysis processor.
ignore_missing	no	false	If true and field does not exist or is null, the processor quietly exits without modifying the document
types	no	empty (all returned)	Filter the values returned by the entity types

Feature	Processor Name	Default suffix
Keyphrase Extraction	detect-key-phrases	_keyphrases
Sentiment Analysis	detect-sentiment	_sentiment
Entity Recognition	detect-entities	_entities
Language Detection	detect-dominant-language	_language

Examples

After each pipeline is configured, the same document is indexed

PUT /my-index/my-type/1?pipeline=aws-comprehend-pipeline
{
  "my_field" : "It is raining today in Seattle. Good thing I live in California."
}

Language detection

PUT _ingest/pipeline/aws-comprehend-pipeline
{
   "description": "A pipeline to test AWS Comprehend",
   "processors": [
      {
         "detect-dominant-language": {
            "field": "my_field"
         }
      }
   ]
}

Result

{
   "_index": "my-index",
   "_type": "my-type",
   "_id": "1",
   "_version": 1,
   "found": true,
   "_source": {
      "my_field": "It is raining today in Seattle. Good thing I live in California.",
      "my_field_language": [
         "en"
      ]
   }
}

Entity detection

PUT _ingest/pipeline/aws-comprehend-pipeline
{
   "description": "A pipeline to test AWS Comprehend",
   "processors": [
      {
         "detect-entities": {
            "field": "my_field"
         }
      }
   ]
}

Result

{
   "_index": "my-index",
   "_type": "my-type",
   "_id": "1",
   "_version": 1,
   "found": true,
   "_source": {
      "my_field_entities": [
         {
            "text": "today",
            "type": "DATE"
         },
         {
            "text": "Seattle",
            "type": "LOCATION"
         },
         {
            "text": "California",
            "type": "LOCATION"
         }
      ],
      "my_field": "It is raining today in Seattle. Good thing I live in California."
   }
}

Keyphrase Extraction

PUT _ingest/pipeline/aws-comprehend-pipeline
{
   "description": "A pipeline to test AWS Comprehend",
   "processors": [
      {
         "detect-key-phrases": {
            "field": "my_field"
         }
      }
   ]
}

Result

{
   "_index": "my-index",
   "_type": "my-type",
   "_id": "1",
   "_version": 1,
   "found": true,
   "_source": {
      "my_field": "It is raining today in Seattle. Good thing I live in California.",
      "my_field_keyphrases": [
         "today",
         "Seattle",
         "Good thing",
         "California"
      ]
   }
}

Sentiment Analysis

PUT _ingest/pipeline/aws-comprehend-pipeline
{
   "description": "A pipeline to test AWS Comprehend",
   "processors": [
      {
         "detect-sentiment": {
            "field": "my_field"
         }
      }
   ]
}

Result

{
   "_index": "my-index",
   "_type": "my-type",
   "_id": "1",
   "_version": 1,
   "found": true,
   "_source": {
      "my_field": "It is raining today in Seattle. Good thing I live in California.",
      "my_field_sentiment": "POSITIVE"
   }
}

Change the target field to run different processors on the same field.

PUT _ingest/pipeline/aws-comprehend-pipeline
{
   "description": "A pipeline to test AWS Comprehend",
   "processors": [
      {
         "detect-key-phrases": {
            "field": "my_field"
         }
      },
      {
         "detect-key-phrases": {
            "field": "my_field",
            "target_field": "my_field_strict",
            "min_score": 0.9
         }
      },
      {
         "detect-key-phrases": {
            "field": "my_field",
            "target_field": "my_field_trimmed",
            "max_values": 1
         }
      }
   ]
}

Result

{
   "_index": "my-index",
   "_type": "my-type",
   "_id": "1",
   "_version": 1,
   "found": true,
   "_source": {
      "my_field_keyphrases": [
         "today",
         "Seattle",
         "Good thing",
         "California"
      ],
      "my_field_trimmed": "today",
      "my_field_strict": [
         "today",
         "Seattle",
         "California"
      ],
      "my_field": "It is raining today in Seattle. Good thing I live in California."
   }
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
src		src
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Elasticsearch Amazon Comprehend NLP Ingest Processor

AWS Comprehend Pricing

Supported Features

Building

Installation

Plugin Settings

Processor settings

Examples

About

Releases

Packages

Languages

License

brusic/elasticsearch-ingest-aws-comprehend

Folders and files

Latest commit

History

Repository files navigation

Elasticsearch Amazon Comprehend NLP Ingest Processor

AWS Comprehend Pricing

Supported Features

Building

Installation

Plugin Settings

Processor settings

Examples

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages