How do I implement named entity detection

Named entity recognition

  • 5 minutes to read

Recognizes named entities in a column of text

Category: Text Analysis

Note

Applies to: Machine Learning Studio (classic)

This content only applies to studio (classic). Similar drag and drop modules have been added to Azure Machine Learning designers. For more information, see this article comparing the two versions.

Module overview

This article describes how to do that named entity recognition Use the module in Azure Machine Learning Studio (classic) to get the names of items, e.g. b. Identify persons, companies or locations in a text column.

Named Entity Recognition is an important research area in the field of machine learning and natural language processing (NLP) as it can be used to answer numerous real-world questions, such as: B .:

  • Does a tweet contain someone's name? Is your current location also provided in the tweet?

  • Which companies were mentioned in news articles?

  • Have any specified products been mentioned in complaints or reviews?

To get a list of named entities, provide as input a dataset that contains a column of text. The named entity recognition Module then identifies three types of entities: People (per), Locations (LOC) and Organizations (org).

The module also identifies the sequences after which these words were found so that you can use the terms in further analysis.

For example, the following table shows a simple input sentence and the expressions and values ​​generated by the module:

Input textModule output
"Boston is a great place to live."0, Boston, 0.6, LOC

The output can be interpreted as follows:

  • The first '0' means that this string is the first item entered in the module.

    Since a single article can contain several entities, the inclusion of the article line number in the output is important for assigning characteristics to articles.

  • the recognized entity.

  • The following means that the entity starts with the first letter of the input string. Indexes are NULL based.

  • means that the length of the entity is 6.

  • means the entity is a place or a storage location. Other supported named entity types are Person () and Organization ().

Configure Named Entity Detection

  1. Add that named entity recognition Module to your experiment in studio (classic). You can find the module in the category Text analysis .

  2. Connect in the input with the name story a data set that contains the text to be parsed.

    "Story" should contain the text from which named entities will be extracted.

    As story The column used should contain multiple lines, each line consisting of a character string. the string can be short, like a sentence, or long, like a news article.

    You can join any DataSet that contains a Text column. However, if the input DataSet contains multiple columns, use select columns in DataSet to select only the column that contains the text to be parsed.

    Note

    The second input, custom resources (zip), is currently not supported.

    In the future, you can add custom resource files here to identify different entity types.

  3. Run the experiment.

Results

The module outputs a DataSet with one row for each entity identified in connection with the offsets.

Because each line of input text can contain multiple named entities, an Article ID is automatically generated and included in the output to identify the input line that contained the named entity. The article ID is based on the natural order of the rows in the input DataSet.

You can convert this output DataSet to a CSV file for download or save as a dataset for reuse.

Use named entity recognition in a web service

If you want to publish a web service from Azure Machine Learning Studio (Classic) and consume the web service using C #, Python, or another language such as R, you must first implement the service code provided on the web service's help page.

If your web service provides multiple lines in output, the web service URL that you add to your C #, Python, or R code should contain the suffix instead.

For example, suppose you use the following URL for your web service:

Change the URL to to enable multi-line output.

To publish this web service, you should add an additional Execute R Script module after the Named Entity Recognition module to transform the multiline output into a single output, separated by semicolons (;). The reason for consolidating the multiline output into a single line is to return multiple entities per input line.

For example, let's assume an input set with two named entities. Instead of returning two lines for each line of input, you can return individual lines with multiple entities separated by semicolons as shown here:

Input textOutput of the web service
"Microsoft has two office locations in Boston."0, Microsoft, 0.9, ORG,;, 0, Boston, 38.6, LOC ,;

The following code example illustrates how to do this:

Examples

This blog provides a detailed explanation of how Named Entity Discovery works, its background, and its possible uses:

For more information on using text classification methods that are commonly used in machine learning, see the following sample experiments in the Azure AI Gallery:

technical advice

Language support

Currently the module supports Named Entity Recognition English text only. It can recognize names of organizations, people, and places in English sentences. If you use the module in other languages ​​you may not get an error message, but the results are not as good as with English text.

In the future, it is planned that support for additional languages ​​can be activated by integrating the multilingual components provided in the Office Natural Language Toolkit.

Expected inputs

SurnameTypeDESCRIPTION
storyData tableAn input dataset (DataTable) that contains the column of text that you want to parse.
CustomResourcesZip(Optional) A zip file that contains additional, custom resources.

This option is currently unavailable and is provided for forward compatibility purposes only.

expenditure

SurnameTypeDESCRIPTION
EntitiesData tableA list of character offsets and entities.

additional Information

Text analysis
Feature hashing
Result of the vowpal wabbit 7-4 model
Exercising the Vowpal Wabbit 7-4 model