Extracting Entities with Automated Data Wrappers



Sharing is caring!

Extracting Entity Information From Web Pages with Data Wrappers

One of the areas of SEO that is worth exploring involves Semantic SEO, which I wrote about in more detail in the post What is Semantic SEO?.

And one of the important patents I wrote about in that post involved Entity Extraction for knowledge graphs and which I covered in more detail in the post Answering Questions Using Knowledge Graphs covers association scores which tell us about how confidence is measured about the relationship between entities and their attributes, entities and their types, and entities and their classes.

A new patent from Google filed in April 2016, and granted in April 2021 adds to how knowledge found on the web is further understood and should be explored in more detail.

The description of this automated data wrappers patent starts by telling us:

“Large-scale knowledge bases, such as the FREEBASE knowledge base and YAGO knowledge base store data and rules that describe knowledge about the data in a form that provides for deductive reasoning.”

It’s worth paying attention to the information in this patent which focuses more on knowledge bases. Here we are told:

For example, in a knowledge base, entities, such as people, places, things, concepts, etc., may be stored as nodes, and the edges between nodes may indicate the relationship between the nodes.

This is important for Google since when Google launched the knowledge graph, which focuses on collecting facts and information about entities. We see information about these such as this:

In such a knowledge base, the nodes “Maryland” and “United States” may be linked by the edges of in-country and/or has stated.

The basic unit of such a knowledge base is a tuple that includes two entities and a relationship between the entities.

Some tuples are also referred to as tuples.

A tuple may represent a real-world fact, such as “Maryland is a state in the United States.”

I am reminded of a post I wrote about knowledge graphs which are probably worth a look. The post is about Google Knowledge Graph Reconciliation. In saying that, it is interesting to have this automated data wrappers patent tell us more about some of the information that is collected in a knowledge graph about the tuples it contains:

The tuple may also include other information, such as context information and other metadata. Adding entities and relationships to a knowledge base has typically been a manual process, making large knowledge bases difficult and slow.

As most patents do, this automated data wrappers patent identifies a problem that an algorithm from the patent is intended to find and solve. The problem that this patent identifies is how it can be incomplete as a knowledge base:

Thus, while the knowledge base may include information about millions of entities, the knowledge base may still be incomplete.

Such missing entities and relationships reduce the usefulness of querying the knowledge base.

The entity extraction post I wrote about tells is about one patent aimed at making it possible for Google to find information about entities on the Web on pages that aren’t considered knowledge bases and extract that information and facts associated with it into a form where it can become part of Google’s knowledge graph. The patent about using association scores to weigh confidence between entities and attributes associated with them is a way to work on making knowledge graphs stronger and more complete.

The entity extraction approach described in that previous patent isn’t the only way Google can learn about an entity on the Web and facts associated with it. So when a new patent is granted and covers extracting entities from content on the web, it is worth taking a good look at.

This automated data wrappers patent is summarized in the patent as:

Implementations provide robust systems and methods for distantly supervised wrapper induction for semi-structured documents.

Data Wrappers are not new to Google. In local search pages about a local business entity often contained an address for that business in a postal code format that would go like this:

Business Name
Street Address 1
Street Address 2
City, State. Zip Code

This patent describes data wrappers in this way:

A wrapper is an extraction engine that can map data in a document to a set of data items.

Wrapper induction is a way of learning (generating) a wrapper from a set of labeled examples.

The wrapper, once generated, is an extraction engine that can extract information from a web page.
But labeling examples by humans (supervised learning) is time-consuming. Thus, implementations automatically generate and annotate training documents in a distantly supervised manner and use the training documents to induce the wrapper.

This may also be referred to as training the machine-learning extraction model.

The trained extraction model can then identify new entities and new facts for a knowledge base from other similarly semi-structured documents.

Implementations may take as input a document source and an entity type or a group of related entity types (e.g., a collection).

While supervised data wrappers induction can collect data from a page so that we can learn more about attributes associated with an entity, site-owners can also provide information about entities using fact information in a script using schema vocabulary using markup such as JSON-LD.

The postal code data wrappers for addresses of businesses have been being used at Google for a while, and local business schema can also be used.

This automated data wrappers patent approach provides something new as well. This one goes beyond the address wrapper induction approach and tells us that wrappers can include more information:

In some implementations, the input may also include a vertical.

The vertical represents a general area (e.g., movies, baseball, cricket, national parks).

In some implementations, the document source may be specified as a regular expression of a uniform resource locator (URL).

The document source represents a set of documents that share a similar, but not necessarily identical, structure (e.g., are semi-structured).

The document source may identify a location when documents with a similar structure can be found.

In some implementations, the repository identifier may be a domain, a URL pattern, a directory name, etc.

I have written about patents with Google presenting metadata about TV episodes and songs from specific songwriters, which I am reminded of in this Automated data wrappers patent. And using a data wrappers induction approach to collect such information makes sense. This patent provides more information about the kind of sources of information that can provide specialized search results about a specific entity.

Sources that provide information like this probably do so in a semistructured format that makes it possible to collect such data. They are likely in the form of databases where data can be extracted using data wrappers induction. The patent goes on to tell us:

Implementations may use one or more knowledge bases to identify and annotate training documents from the repository.

The annotations include positive as well as negative examples.

The negative examples may include various levels of certainty, represented by a weight.

Various levels of certainty may be adjusted using a bootstrapping training process.

Once the model is trained, implementations may run the model against pages in the document source that were not used as training pages, extracting new entities and new facts for insertion into the knowledge base.

This sounds like the association score ranking for relationships and knowledge I mentioned was included in a previous Google patent. We weren’t given a lot of information about that association scoring method for knowledge, and this patent talks about machine learning and training sets:

In one aspect, a method includes identifying, from a set of semi-structured web pages, a training set of the semi-structured web pages, a semi-structured web page in the training set being a training page, and having a subject entity that exists in a knowledge base.

For each training page in the training set, the method also includes identifying target objects in the training page, identifying predicates in the knowledge base that connects the subject entity of the training page to one of the target objects identified in, and annotating the training page.

Annotating the training pages includes, for at least some target objects identified in the training page, generating a feature set for a mention of the target object, generating predicate-target object pairs for the mention, and labeling each predicate-target object pair with a corresponding example type and weight.

The method also includes training, using the annotated training pages, a machine-learning model to extract new subject entities and new facts from the set of semi-structured web pages.

This training approach is for data wrappers induction and details how that may be done.

It would likely involve a system that includes at least one processor and memory that stores a training set generation engine.

We are told that the training set generation engine will likely identify, from a set of semi-structured web pages for a domain, a training set of the semi-structured web pages, and a semi-structured web page in the training set being a training page that has a subject entity that exists in a knowledge base.

This training set generation engine may also be configured to, for each training page in the training set:

  • Identify target objects in the training page
  • Identify predicates in the knowledge base that connect the subject entity of the training page to one of the target objects identified in the training page
  • Annotate the training page by generating predicate-target object pairs for at least some target object mentions, wherein each predicate-target object pair is one of a positive example type, a probable negative example type, or a weak negative example type

It appears to be even more detailed than that and includes the following steps as well:

The memory may also store a wrapper induction engine configured to train, using the annotated training pages, an extractor model, and a missing fact generation engine configured to use the trained extractor model on non-training pages in the domain to extract new subject entities and new facts from the set of semi-structured web pages.

In one aspect, a method includes dividing a set of training pages into a first set and a second set, training a first model using the first set, training a second model using the second set, running the first set through the second model to obtain probabilities for features in the first set, and running the second set through the first model to obtain probabilities for features in the second set.

The method also includes adjusting weights of the features according to the probabilities and training a final model using the training pages with adjusted weights

Advantages Of Using a Data Wrappers Induction System

As many patents do, this one provides a list of what it refers to as “advantages” to be gained from using the algorithm involved to solve the problem identified in the patent.

The advantages cited in this description include:

  • Enabling automatic identification of new entities and facts about the new entities
  • Identifying facts about existing entities to add to existing knowledge bases
  • Providing highly robust learning with a high recall due to robustness to label noise
  • Differentiating between different kinds of negative examples in the training pages allows for handling missing labels and incomplete knowledge base predicates
  • Bootstrapped training of the model with iterative reweighting of the different categories of negative examples also enables the model to more accurately identify informative entity mentions and ignore less useful, possibly incorrect, mentions
  • Annotation of training pages is automatic, these methods and systems are easily scalable, and wrappers can easily be kept up-to-date
  • Leveraging document source knowledge from the knowledge base can improve extraction effectiveness by fixing incorrect entity links and enforcing document source consistency constraints

The Data Wrappers Induction patent can be found at:

Distantly supervised wrapper induction for semi-structured documents
Inventors: Jeffrey Dalton, Karthik Raman, Evgeniy Gabrilovich, Kevin Patrick Murphy, and Wei Zhang
Assignee: Google LLC
US Patent: 10,977,573
Granted: April 13, 2021
Filed: April 15, 2016

Abstract

Systems and methods provide distantly supervised wrapper induction for semi-structured documents, including automatically generating and annotating training documents for the wrapper.

Training of the wrapper may occur in two phases using the training documents.

An example method includes identifying a training set of semi-structured web pages having a subject entity that exists in a knowledge base and, for each training page, identifying target objects, identifying predicates in the knowledge base that connect the subject entity to a target objects identified in the training page and annotating the training page.

Annotating a training page includes generating a feature set for the target object, generating predicate-target object pairs for the mention, and labeling each predicate-target object pair with a corresponding example type and weight.

The annotated training pages are used to train the wrapper to extract new subject entities and new facts from the set of semi-structured web pages.

A Data Induction Wrapper System

This patent is about an automatic wrapper induction system with examples.

The Wikipedia page on Wrapper (data mining) tells us that a wrapper is “…a function from a page to the set of tuples it contains.”

That Wikipedia page also points out the value behind using wrappers in this way:

Extraction of such data enables one to integrate data/information from multiple Websites to provide value-added services, e.g., comparative shopping, object search, and information integration.

To be able to use an automated data wrappers can result in information integration and the generation of knowledge graphs that can answer questions.

This patent tells us of this value with these words:

The system may be used to automatically generate high-quality annotated documents used to train an extraction engine for semi-structured web pages in a distantly supervised manner.

The extraction engine may identify new entities and new facts about entities to expand a knowledge base.

Because the systems and methods described result in a trained extraction model with almost no input from a human user, the systems and methods are scalable. They can train many extraction models, which can significantly enhance the knowledge represented in the knowledge base.

We are also told that the patent shows this off as a system for annotating entity-centric semi-structured web-based documents. But it also says that other configurations and applications may be used.

It then provides the example that the source documents need not be web-based.

Also, it may not cover Web-type implementations, and the term document or page may refer to a well-defined section of information that occurs in a single physical file.

Instead of a document focusing on many attributes of one entity, the semi-structured documents may include a few attributes for many entities, with each section within the document being considered an independent page.

Formats for An Automated Data Wrappers Induction System

We are told that the automatic wrapper induction system may include a graph-structured knowledge base.

This knowledge base may store information (facts) about entities.

Those Entities likely represent a:

  • Person
  • Place
  • Item
  • Idea
  • Topic
  • Abstract concept
  • Concrete element
  • Other suitable thing
  • Any combination of these
  • May be represented by a node in the knowledge base

An Automated Data Wrappers Induction System that is a Knowledge Graph

The patent provides an example of how a knowledge graph can contain a data wrapper induction system.

Entities in the knowledge base may be related to each other by edges.

The edges may represent relationships between entities, i.e., facts about entities. (this is referred to as knowledge in many Google Patents.)

For example, the data graph may have an entity that corresponds to the actor Humphrey Bogart and the data graph may have an acted in the relationship between the Humphrey Bogart entity and entities representing movies that Humphrey Bogart has acted in.

automated data wrappers

The facts may be stored in a tuple, such as .

A tuple may be considered to have a subject, predicate, object format, with the first entity representing the subject, the relationship the predicate, and the last entity the object.

A knowledge base with many entities and even a limited number of relationships may have billions of connections.

This knowledge base may be stored externally and be accessible from the system through a network connection.

How This Training Set for the Data Wrappers Induction System Might Work

The modules may include a training set generation engine, a wrapper induction engine, and a missing fact generation engine.

The training set generation engine may receive as input a semi-structured document source and an entity type or collection of related entity types and, using this input, generate training pages for the document source annotated with positive and negative examples.

Annotated training pages with positive and negative examples may be stored in memory, for example, in positive and negative examples.

The document source may represent a set of semi-structured documents such as semi-structured web pages of a particular domain.

The web pages of a domain may have a URL that matches a regular expression for the domain.

Accordingly, in some implementations, the document source may be a sub-set of web pages.

Web pages may represent any documents accessible via the Internet.

The document source may be identified via a URL, or uniform resource locator, for example, www.foo.net or www.foo.net/*people/.

The training set generation engine may take a set of document sources for a topic or vertical.

A vertical (i.e., topic) may represent a category that relates to various document sources.

A video game vertical may be represented by the first set of document sources. A baseball player vertical or entertainment vertical may be represented by the second set of document sources.

While these sources may not have the same structure, the types of facts about the verticals are similar. As will be discussed herein, information obtained from a first document source (e.g., a first domain) can help annotate other sources in the vertical.

Thus, a vertical is not necessary but can improve the wrapper induction process and quality of facts extracted by the wrapper.

Of course, the document source may be a non-Internet source of semi-structured documents.

For example, the document source may be accessible only via an intranet or other private network, and the documents need not be web pages.

However, each document in the document source should have a similar structure.

The documents in the document source (e.g., web pages in the domain) include several profile pages.

At least some of the web pages include information (e.g., facts) about a single entity or information about multiple entities in well-defined segments with one main entity per segment.

A profile page can refer to an entire document or a segment within an entire document.

The profile pages may include many entities having a particular fact.

Those profile pages may be semi-structured because they are often generated using a template and can include similar information in a similar format for the subject entity.

The training set generation engine may use the document source (e.g., domain) and entity type(s) to identify training pages from among the documents in the document source.

A training page is a document that.

1) has at least one main entity that matches the entity type(s) from the parameters and
2) where the entity exists in the knowledge base.

In other words, the training page may have one main entity per page or one main entity per segment in a single page.

Entities and entity attribute in a document may be identified using conventional entity identification techniques (entity linking).

Entity identification may use natural language processing tools, such as part-of-speech tagging, dependency parsing, noun-phrase extraction, and coreference resolution.

The training set generation engine may estimate the topicality of each entity.

The main entity may be the most topical.

The topicality score may be based on the entity’s location in the title or the URL of the page, or the segment.

The training set generation engine may then determine whether the main entity exists in the knowledge base.

If it does not, the page is not a training page but may be used for extraction later or used in a test set.

If the entity is in the knowledge base and the correct entity type (or types), the training set generation engine may consider the page a training page.

Not all pages with the main entity of the correct entity type in the knowledge base may become annotated with training examples.

The training set generation engine may filter out some of these pages.

Generating Seed Facts for Main Entities

The training set generation engine may then identify seed facts for the main entities.

Given a semi-structured web page for a first main entity, the training set generation engine may identify target objects in the semi-structured web page.

These Target objects can be any entity type, including dates, measures, names, etc.

The target objects may be identified by using the natural language processing tools that support entity linking.

A training set generation engine may explore the neighborhood of the main entity in the knowledge base to determine facts known about the main entity, for example, by determining the paths in the knowledge base that connect the main entity to the target objects identified in the document for the main entity.

The training set generation engine may do this for each training document.

The training set generation engine may be starting from each main entity in the knowledge base, traverse the edges in the knowledge base to determine which edges end at a target object found in the document.

The training set generation engine may use a maximum path length of two to explore the neighborhood.

Some main entity-target object pairs have more than one path in the knowledge base.

Such as a person playing for and coaching the same team.

The training set generation engine may look for all paths between the subject and a target object, not just stop once one path is found. This way, it may try to build as complete a knowledge graph as it can.

Natural Language Processining to Build a Knowledge Graph

The training set generation engine may use paths in the neighborhood of the main entity to identify target objects in the training document.

The natural language processing techniques used in entity linking may not identify or resolve some target objects in the document.

To create a more robust extraction system, the training set generation engine may use string matching on the names of known relationships in the knowledge base to identify target objects on the training document.

Thus, the training set generation engine may use the paths to find target objects.

The training set generation engine may aggregate the paths across the training documents.

For examples, after the training set generation engine finds the paths in the knowledge base that link a first subject entity (i.e., main entity) with target objects in a first semi-structured training document, the training set engine may determine the number of times the same path links other main entities with other target objects.

This can occur across all target pages in a document source (e.g., a domain) or across all target pages in a vertical.

The training set generation engine may filter out (or discard) paths that do not occur with sufficient frequency.

When a path fails to meet or exceed a frequency threshold, the path may be discarded.

The threshold may be based on a percentage of the training pages.

If a path appears in 10% of the training pages in the domain, the path may fail to meet the threshold.

The path must appear a minimum number of times across the training pages, or the training set generation engine may consider the path too infrequent.

The minimum number of times may be adjusted and may depend on the structural complexity of the webpage and the machine learning algorithm used.

The aggregation may occur across the vertical, not just the domain. Thus, a path that does not occur with sufficient frequency in the domain may occur with sufficient frequency in the vertical.

The paths that occur with sufficient frequency may be referred to as seed paths or seed facts.

The training set generation engine may filter the target pages based on the seed facts.

If a particular target page does not include a sufficient number of seed facts for the main entity, the training set generation engine may remove the page from the training pages.

This is done because the page may be considered noise that does not help train the extraction engine.

Negative and Positive Training Examples

The training set generation engine may then annotate the training documents (or the training pages left after filtering) to generate positive and negative training examples.

Automatically labeling examples is difficult because there are no obvious negative training examples.

While a fact (e.g., a tuple) that exists both on the training page and in the knowledge base is a positive example, a fact on the training page that does not match the knowledge base may reflect incompleteness in the knowledge base, not an incorrect fact.

Also, the target object mentions may be ambiguous.

For example, as illustrated in the patent, India may represent the player’s citizenship as well as a team the player plays for.

The relationship between each target object with the main entity is different for the different object mentions.

For example, the plays for a relationship should only apply to the team, not the country, while the relationship should only apply to the country, not the team.

Similarly, a target object mention in a training page does not always relate to the main entity.

When a training page includes free text, the target object may be mentioned in a context unrelated to any of the seed facts.

If the semi-structured page about Indian Players included free text that mentioned a particular game played in Delhi, India, the mention of India in the free text is not related to the subject entity by any of the seed facts.

All of these examples represent noise that the training set generation engine needs to address.

To annotate the training pages, the training set generation engine may map target objects to the page context.

In other words, each target object may be mapped to the mention of the object on the training page.

Of course, because of entity ambiguity, one mention may be mapped to two target objects (if the ambiguity cannot be resolved sufficiently).

Besides, two entities may be mapped to overlapping mentions.

The text “South Africa” may be linked to the entity representing the country of South Africa, and the “Africa” portion of “South Africa” may be linked to an entity for the continent of Africa.

The training set generation engine may also generate features for each mention.

Features are based on the web page itself and describe the context of the entity mention, wherein the page the mention occurs, any identifiers associated with the mention, etc.

For example, the string itself may be a feature; the string context may be a feature; the document object model (DOM) node for the object may be a feature; the DOM siblings may be features, etc.

The training set generation engine may determine how often each feature occurs over the set of training pages and drop features that do not occur with sufficient frequency.

Sometimes the features may be derived from positive examples only.

The training set generation engine also generates predicate-target object pairs for the entity mentions.

The predicate-target object pairs represent the cross product of every seed fact that expects an entity type corresponding to the target object.

In other words, a “born on” predicate expects a date entity for the target object, and it makes no sense to pair this predicate with a target object that is a person entity type.

However, the predicate may be paired with each date target object on the training page.

Thus, if the seed facts include born on, died on, and released, each date target object found on a training page may be paired with each of these three seed facts.

Similarly, a directed by predicate may expect a person entity type, as does a played by predicate and a married to a predicate.

Each target object representing a person may be paired with the directed by, married to, and played by predicates, generating three predicate-target object pairs.

The predicate-target object pairs are mapped to a target mention and, thus, a set of features.

The target set generation engine may then label each predicate-target object pair.

The label indicates whether the predicate-target object pair is a negative or positive training example.

To address the noise discussed above, the training set generation engine may classify the training examples and assign a confidence score or weight to each training example.

The weight may be initially based on the class of the training example, with probable negative examples having a high weight and less certain or weak negative examples having a low weight.

Positive examples may be given the highest weight.

Different Classifications or Labels

The training set generation engine may use five different classifications or labels.

The first classification may be for positive examples.

Positive examples may represent a predicate-target object pair for the main entity that exists in the knowledge base.

In other words, the main entity.fwdarw.predicate.fwdarw.target object represents a tuple that exists in the knowledge base.

A predicate may link the main entity to another entity or an entity attribute, for example, a single-value attribute like a date.

Such positive examples may be labeled as positive and assigned the highest weight, for example 1.0.

Next, the training set generation engine may determine and label known negative examples.

A known negative example occurs when a predicate-target object pair contradicts a fact in the knowledge base.

This occurs when the seed fact (i.e., the predicate) is functional, or in other words, can have only a single value object.

Such examples that contradict the knowledge base may be classified as negative and given a high weight.

For example, as explained above, every date in a training page may be paired with the “born on” predicate.

But only one of the dates matches what is in the knowledge base.

The matching “born on” target object is a positive example and any other “born on” target object pairs are categorized as known negative examples.

Known negative examples may also be given a high weight (e.g., at or close to 1.0).

The training set generation engine may then look for and label probable negative predicate-target object pairs.

A probable negative occurs when a training page contains a positive example for two different predicate-target object pairs.

The training set generation engine can assume that a tuple with the predicates switched is probably a negative example.

For example, if the tuples and are both on the same training page and are positive examples (e.g., the tuples exist in the knowledge base), the training set generation engine may guess that Sue was not born in Ireland.

In other words, born in Ireland is a probable negative.

Similarly, the training set generation engine may assume that Sue did not die in England.

This may not always be correct, but it represents a reasonable assumption. Thus, the weight for probable negatives may be high (e.g., at or close to 1.0).

The training set generation engine may also look for uncertain negative predicate-target object pairs.

An uncertain negative is a weak negative label (or classification) used for one-to-many predicates.

For example, an actor entity may have many acted in relationships, for example, one for each role the actor played.

Thus, the discrepancy of an acted in-target object pair with a tuple in the knowledge base does not necessarily mean the pair is incorrect.

The training set generation engine may label a predicate-target object pair as an uncertain negative when the instance (i.e., the target object mention) has not been labeled as positive for any other predicate, and other instances (i.e., other target object mentions) on the training page are marked as positive examples for the predicate.

For example, if a sports profile page includes five team names and three of the team name mentions match facts (tuples) in the knowledge base, the remaining two team name mentions may be labeled as uncertain negatives.

The weight for an uncertain negative may depend on the number of facts the knowledge base includes for the subject-predicate.

For example, if the knowledge base includes eight teams for a player and the training page includes two additional teams, the weight of the two additional teams may be larger than if the knowledge base only includes two teams, and the training page mentions two additional teams.

In some implementations, the weighting function may be expressed as 1-e.sup.-cn, where c is low (e.g., 0.2) and n is the number of entries in the knowledge base for the subject predicate.

Another label the training set generation engine may use is not-missing-not-found.

For example, as indicated above, the training set generation engine may obtain all seed facts from the knowledge base for the subject entity.

If a seed fact exists in the knowledge base but is not found on the page (e.g., has no matching mention), the training set generation engine may consider this a weak negative.

The predicate-target object pair is weak because there may have been an entity resolution error or alias problem that prevented the correct identification of the fact on the training page.

The weight for a not-missing-not-found label may also be low, for example 0.1.

In some implementations, the training set generation engine may use all training pages for the same entity in a vertical to determine when a particular predicate-target object pair is incorrect.

If a vertical includes three training pages with Sue as the subject entity, the training set generation engine may use a voting scheme to determine which single value facts on the training pages are correct.

In some implementations, when Sue is a subject entity on three training pages, the training set generation engine may decrease the weight of a fact that appears in all three training pages but does not appear in the knowledge base, or change the label on such a fact from negative to positive.

For example, if the three pages each indicate Sue acted in a particular role or movie, the system may label these as positive examples rather than uncertain negative.

As another example, the training set generation engine may extract several compatible positive examples across a vertical for a given subject entity and weigh more specific facts more highly.

Thus, if the predicate-target object pairs for a particular entity include and the training set generation engine may lower the weight of the more general () for a positive example.

As another example, the training set generation engine may use repository compatibility to adjust the weights.

In a basketball domain, birthdates for players do not reach much before the 20th century, when the game was invented.

Because of this, a birth date before the 20th century would be doubtful given the overall distribution of birthdates for players.

Thus, a fact that appears to make a birth date outside the overall distribution can be adjusted to reflect the likelihood that the fact is false.

Also, other external knowledge systems can be used to estimate the likelihood of a fact.

If a predicate-target object pair cannot be labeled with one of the labels above, it may be labeled as missing.

A missing label is another uncertain negative and may be given a low weight (e.g., 0.0.1).

In some implementations, the training set generation engine may adjust the weight assigned to a predicate-target object pair by a confidence factor.

This may occur when the same predicate-entity pair appears multiple times (i.e., with different mentions) on the training page.

Each instance may represent the same fact (same subject-predicate-object), and the system may adjust the weight to favor the most confident instance.

The confidence factor may represent confidence that the entity linking is correct.

The system may use an entity linking the engine to identify the target objects on the training page.

The entity linking may include a confidence score for each possible target object.

The text India may refer to the country or a team.

The entity linking may give a first confidence score with the country entity and a second confidence score with the team entity, based on the context.

The training set generation engine may adjust a weight based on the confidence score.

The confidence score may have a value between one and zero inclusive.

As one example, the predicate-target object pair of “born in”–India may be associated with two different entity mentions on a training page.

The first mention may have the confidence of c.sub.1, and the second mention may have the confidence of c.sub.2.

In some implementations, the training set generation engine may adjust the weight of the born-in-India pair for the first mention according to .times. ##EQU00001## where w.sub.1 is the weight assigned to the pair based on the label for the mention (e.g., positive, uncertain negative, missing, etc.).

Similarly, the training set generation engine may adjust the weight of the born in-India pair for the second mention according to

.times. ##EQU00002##

The training set generation engine may thus automatically generate various training examples for each page.

These training examples may be stored in the positive and negative examples. The positive and negative examples may be used to train a set of binary classifiers (one per predicate).

The automatic wrapper induction system may also include a wrapper induction engine.

The wrapper induction engine may use the positive and negative training examples to train the extractor model, generating the wrapper.

In some implementations, the wrapper induction engine may use conventional training techniques.

In some implementations, the wrapper induction engine may use a bootstrapping training method that includes iteratively re-weighting the negative examples.

For example, the wrapper induction engine may divide the training pages into a first group and a second group and use the first group to train a first model and the second group to train a second model.

The wrapper induction engine may then run the second model on the training pages in the first group and run the first model on the training pages in the second group.

The result of running the models is a probability for each labeled training example. The wrapper induction engine may adjust the weights assigned to some predicate-target object pairs based on the probabilities.

For example, if the born-in-India pair occurs three times on the page, each instance receives a probability.

If the probability for a first mention is less than or equal to 50% but the probability for another mention is greater than 50%, the wrapper induction engine may set the weight of the first mention to zero.

This is because another mention on the page is more relevant.

If the probability of the first mention is greater than 50%, the system may normalize the weights of all positive (greater than 50%) examples.

Any non-positive (less than 50% are set to zero. If no mention is positive, the system may normalize the weight overall mentions. In some implementations, the reweighting may be expressed as

.times..times..ltoreq..times..times..times..times..E-backward.>>.ti- mes..times..times.>.times..times..times..times..E-backward.>.times. ##EQU00003## where w.sub.i is the weight for the instance i, and p.sub.i is the probability for the instance i of the predicate-target object pair and p.sub.j is the probability of another instance j of the predicate-target object pair. The reweighting allows less relevant mentions to fall out.

The wrapper induction engine may repeat the training, running, and reweighting at least a second time.

Thus, for example, the first model is trained with the first set of training pages (with reweighted examples), and the second model is trained with the second set of reweighted training pages.

The models are then run against the set of training pages used to train the other model, producing a second round of probabilities, and the system adjusts the weights of the mentions as described above.

The iterative reweighting improves recall significantly. After the second iteration, the system may use the reweighted training pages to train a new model.

In other words, the two sets of reweighted training pages are used to train a new model. In some implementations, the system may use a random forest model, although linear and non-linear SVMs and boosted trees can also be used.

The wrapper induction results in a trained extractor model, which can be run against other documents structured similar to the training pages.

The trained extractor model (or wrapper) is specific to pages with a structure similar to the training pages.

In other words, if a vertical has five different document sources (e.g., domains), the automatic wrapper induction system may generate five different wrappers, one for each domain.

However, because the wrapper induction is distantly supervised, such generation is fast and can be easily scaled to many domains.

In other words, the system has no difficulty generating many trained extractor models.

Finding Missing Facts About Entities

The automatic wrapper induction system may also include a missing fact generation engine.

The missing fact generation engine may run the trained extractor model (wrapper) against pages in the domain to extract new facts to expand the knowledge base.

At an initial time, the pages may include the training pages (as there may be facts on the training pages that were not already in the knowledge base that can now be identified) and any other pages not used as training pages in the domain that fit the template.

Later, the wrapper may be used against new pages or updated pages so that the knowledge base may be kept current.

The missing fact generation engine may continue to use the knowledge base to refine the extracted facts and remove errant facts.

The missing fact generation engine may also use knowledge about the domain to help identify and remove errant facts.

In some implementations, after using the wrapper (trained extractor model) to extract facts from the pages in the domain, the missing fact generation engine may group predictions for new facts from the mention-level to the object-level by taking the maximum probability across tuples with the same predicate and object value, but different mentions.

The missing fact generation engine may also eliminate functional relations (or predicates that can only have one value) based on prediction probabilities.

For example, if the missing fact generation engine finds a tuple and a tuple, the missing fact generation engine may select the tuple that has a higher probability and discard the other. In some implementations, functional predicates may be identified by computing statistics from the seed fact identification process.

In other words, even if the predicate is not functional, it may be functional for the domain because the web pages only list one value of the predicate per page.

The missing fact generation engine may also employ subject linking.

For example, some profile pages may be for an entity not yet in the knowledge base.

However, if the subject entity has the same name as an existing entity in the knowledge base, the system may incorrectly identify the mention with the existing entity.

The missing fact generation engine can avoid such a false positive by checking to see if any of the extracted facts from the page conflict with known facts for the candidate subject.

When they do conflict, or when enough of the conflict, the missing fact generation engine may assume the entity is new and can add the entity and the extracted facts to the knowledge base.

In some implementations, the missing fact generation engine may enforce non-overlapping constraints.

For example, during entity linking, the system may have treated the text, South Africa, as one entity or may have considered south an adjective for Africa.

Thus, the system may identify the entity South Africa corresponding to “South Africa” and Africa corresponding to just “Africa.”

Because the two entities overlap, the missing fact generation engine may, if the probability associated with each is the same, break a tie by picking South Africa because it spans more text.

An automatic wrapper induction system may be in communication with the client(s) over a network.

Clients may allow a user to provide parameters to the training set generation engine and direct the training and/or evaluation of the extractor models.

The network may be, for example, the Internet or the network can be a wired or wireless local area network (LAN), wide area network (WAN), etc., implemented using, for example, gateway devices, bridges, switches, and/or so forth.

Via the network, the automatic wrapper induction system may communicate with and transmit data to/from clients.

In some implementations, an automatic wrapper induction system may communicate with or include other computing devices that provide updates to the knowledge base 150 and web pages.

For example, an automatic wrapper induction system may include or be in communication with an indexing engine that crawls web servers for documents and indexes the contents of the documents.

An automatic wrapper induction system represents one example configuration, and other configurations are possible.

Besides, components of the system may be combined or distributed in a manner differently than illustrated.

For example, in some implementations, one or more of the missing fact generation engine, the training set generation engine, and the wrapper induction engine may be combined into a single module or engine.

Also, components or features of the missing fact generation engine, the training set generation engine, and the wrapper induction engine may be distributed between two or more modules or engines.

A Semi-Structured Web Page in a Domain for a Sports Vertical

A domain refers to a source of documents available or generated from the same source.

For example, www.example.com represents a domain, but a directory can also be a domain.

In some implementations, the domain may include additional paths, such as www.example.com/players/or www.example.com/movies/profiles/

In some implementations, the domain may include a regular expression, like www.example.com/*/profiles.

As used herein, the domain or document source refers to an identifier used to locate documents with a similar structure.

The example is an example of a single entity profile page for the main entity.

In other words, most of the facts on the page are attributes of the main entity and not other entities.

The main entity may be identified as such due to the font size, presence at the top of the information, etc.

Example facts illustrated for the main entity can include gender, birthdate, country of citizenship, position, teams that the entity has played for, etc.

The main entity may also be referred to as the subject entity in a single entity profile page.

The semi-structured web page may be considered semi-structured because the domain (e.g., bigbashboard.com/player/) includes web pages with a similar structure for other entities.

Such pages can be generated using a template. The drawing of the team roster is an example of another semi-structured web page in a domain for a sport vertical.

The example of the team roster is a document that includes multiple profile pages, where each profile page is a defined segment in the document.

In other words, the webpage of the roster has multiple entities per page, with each entity sharing a similar set of attributes.

While a single entity profile page is used herein as an example, the automatic wrapper induction system can be modified to work in a similar manner using a web page that includes multiple entities segmented into multiple profile pages, such as the rows of players in the pictured roster.

Automatic Data Wrappers Induction and Use

This process may begin with the automatic wrapper induction system identifying and annotating training pages (e.g., positive and negative examples in training pages) for a document source using information from the knowledge base.

In some implementations, the system may be provided with a document source and an entity type or collection of related entities and may automatically identify pages in the document suitable for use as training pages and may automatically annotate the training pages with positive and negative examples, as explained in more detail concerning FIG. 5.

As part of annotating the training pages, the system may determine seed facts for the document source, or in other words relationships (links in the knowledge base) that are found in the semi-structured web pages associated with the document source.

The system may then use the training pages for wrapper induction, or in other words, training an extraction model for the document source.

In some implementations, the training may use conventional methods.

In other implementations, the training may use a bootstrapped method with iterative reweighting of the negative training examples.

The trained model (or wrapper) may be used on similar semi-structured web pages associated with the document source to extract information from those pages.

In other words, the system may identify new facts that can be added to the knowledge base by using the extraction model on the other semi-structured pages associated with the document source.

The facts can be new seed facts for an entity or can even include a new entity.

The system may perform steps many times after the model has been trained. For example, steps may be performed once the model has been trained and periodically for new pages added to the document source.

The wrapper for the document source may continue to be used to extract new facts from the document source as long as the structure of the semi-structured web pages remains similar.

Automatically Generating Training Examples for Wrapper Induction

entity training pages

The process may be performed, for example, by a training set generation engine in an automatic wrapper induction system.

The process may use a knowledge base to identify pages in a document source for which facts are already known, identify seed facts for the document source, and use this information to annotate the training pages.

The seed facts are facts relevant to the document source and, in particular, found across many of the training pages.

The training annotations, or examples, may be positive or negative.

The negative examples may be categorized or labeled and may be associated with various weights, depending on the category or label for the example, as explained below.

While the process described below describes web pages in a domain, it is understood that process operates on any set of similarly structured documents, including non-web-based documents associated with a particular document source.

The system may receive a document set and an entity type (or types) that the wrapper will apply to.

In some implementations, a vertical may also be provided.

The vertical represents domains for a similar topic, such as sports, movies, restaurants, etc.

In the example below, the document set is a domain, but the process applies to other kinds of document sets with documents of similar structure.

The system may use the domain and the entity type to identify training pages in the vertical.

Each training page may be associated with a respective subject entity when the training pages are single entity profile pages.

As discussed above, the profile pages may be a web page or maybe a section of a web page.

For example, a movie wiki may include a table-like structure for movies, with each row (or a collection of rows) representing a single movie.

Each row (or collection of rows) can be considered a different training page.

Thus, for example, each row from the team roster may be considered a different training page.

Washington Redskins

For example, the system may crawl the domain to determine each document (web page) for the domain.

The pages associated with a domain may include single entity profile pages as well as other pages.

The system may use a heuristic to eliminate non-profile pages and to determine which of the remaining pages have a subject entity that exists in the knowledge base, as explained in more detail in FIG. 6.

Pages associated with the domain that are profile pages for an entity that exists in the knowledge base are training pages.

The system may then inspect each training page and identify target objects found in the web page or, in other words, in the content of the web page.

Target objects can include entities, names, and single-value attributes, such as dates, measures (height, weight, etc.), etc.

The single-value attributes are attributes of an entity (e.g., a particular person’s birth date) that can have one value per entity.

For example, a person may have two phone numbers but may only have one birth date. Sometimes single-value attributes can also be referred to as entities.

Entity linking using natural language processing tools may be used to identify the target objects.

An example of natural language processing tools is the Stanford Core NLP package, available at http://nlp.stanford.edu/software/corenlp.shtml.

That step may be performed on each training page.

In other words, following that step may result in a subject entity and corresponding target objects for each training page.

The system may use the subject entity and target objects and find paths in the knowledge base that connect the subject entity to any target objects.

The system may perform this for each training page (e.g., finding paths from each subject entity to corresponding target objects).

In some implementations, the path may directly connect the subject entity to a target object.

In some implementations, the path may have a length of two.

The paths may thus represent predicates from the knowledge base.

The paths represent seed facts or, in other words, seed predicates.

Because the system identifies seed predicates for each training page and repeats step 520 for all training pages for a domain, the system generates a set of seed predicates.

In some implementations, the system may reverse.

For example, the system may use paths in the neighborhood of the subject entity to identify target objects in the training document.

For example, the natural language processing techniques used in entity linking may not identify or resolve some target objects in the document.

To create a more robust extraction system, the system may start from the subject entity in the knowledge base and use string matching on the names of relationships in the knowledge base that also appear in the training page to identify target objects in the training page that may not have been tagged as a target object originally.

Thus, the system may use the paths to find target objects.

The system may filter the set of seed predicates to eliminate those that are outliers.

For example, if a seed predicate does not occur with sufficient frequency in the set of training pages, the system may remove the seed predicate from the set.

In other words, if a particular seed predicate fails to occur in a minimum percentage of the training pages, the system may drop the particular seed predicate.

This ensures that the seed predicates in the seed predicates are consistently expressed on the training pages.

In some implementations, the system may also compute observed functionality across all training pages.

Observed functionality occurs when the same seed predicate occurs once per subject entity.

In other words, the seed predicate has a one-to-one relationship across all (or most of) the training pages.

This information can help select extracted facts and control for noise, as explained in more detail below.

In some implementations, the filtering of paths may be performed across a vertical.

As explained above, a vertical is a collection of domains that all have common subject matter.

The seed facts may, thus, be similar across the set of domains.

Accordingly, if a seed fact does not occur frequently enough in the domain, it may still occur with sufficient frequency in the training pages of all domains for the vertical.

The system may also optionally filter the training pages.

Some of the training pages may be noisy and unlikely to be true training pages.

The system may identify a training page as noisy when the training page has few of the seed paths for the entity.

Few may be determined as a percentage of the seed predicates.

For example, if a training page only includes 20% of the possible seed predicates, the system may consider the training page noisy.

The system can also filter training pages based on the absence of key attributes.

For example, in some implementations, the system may require a birth date attribute for a person entity, and if a training page lacks the birthdate, it may be considered noisy.

The system may remove the noisy training page from the set of training pages.

In other words, the noisy training page may not be annotated and used to train the extraction model.

The system may then obtain facts (tuples) from the knowledge base that include the subject entity as the subject and any of the seed predicates as the predicate.

This ensures that the system has all relevant facts for the subject entities, relevance being determined by the seed predicates.

This information can assist the system in identifying and labeling training examples.

The process then ends, having generated several thousand training examples for use in wrapper induction or training an extraction engine for the domain.

Identifying Training Pages and Their Subject Entities

topical entities

This process may use a knowledge base to identify pages in a domain that have a subject entity that exists in the knowledge base.

The pages identified by the process are training pages for the domain.

The system may then use the training pages to identify seed facts for the domain and to annotate the training pages with positive and negative training examples, as described above.

The process is illustrated for a single page in the domain, but the system may execute the process for each page in the domain to determine which pages are possible training pages.

Although the system may identify a page as a training page via process, the page may be filtered out later.

Thus, the training pages identified by process 600 may also be referred to as potential training pages or possible training pages.

First, the system may perform entity linking on the content of the web page.

As discussed above, the system may use any conventional entity linking (entity identification) techniques.

Entity linking uses natural language processing tools, such as speech tagging, dependency parsing, noun-phrase extraction, and coreference resolution to match a mention with an entity in the knowledge base.

The mention can be text-based and, in some implementations, may be image-based.

For example, some knowledge bases may include images associated with an entity, and entity linking may be accomplished via image similarity.

As part of entity linking, the system may associate a confidence score with each entity found on the page.

The confidence score represents how certain the system is that the entity in fact corresponds with the text or image that represents the entity mention.

For example, the text Jaguar may refer to either the car, the animal, or the operating system.

The entity linking process uses context to determine which entity is the most likely match, and the confidence in the match may be reflected by the confidence score.

The system may score the entities by topicality (610).

In other words, the system may calculate a topicality score for each entity.

The system may use features such as:

  • Location of the entity mention in a title
  • Location of the entity mention in a URL
  • The font size used for the entity mention
  • The number of times the entity is mentioned on the page (entity frequency)
  • The presence of the entity mention in anchor text from other websites to the webpage
  • Relationship connectedness in the knowledge base
  • Etc.

The system may consider the main or subject entity to be the most topical, or in other words, the entity with the highest topical score.

The system may then attempt to locate the subject entity in the knowledge base (615) using conventional techniques.

For example, the knowledge base often has one or more descriptions for an entity that can match the entity mention on the web page.

If an entity in the knowledge base does not match the main entity on the web page (620 No), the web page is not a training page.

In other words, the web page is not used for wrapper induction but can be used by the wrapper to extract new facts.

If an entity matching the main entity is found in the knowledge base (620, Yes), the system may determine whether the entity is of the correct type.

For example, if the entity type is a person, but the main entity for the web page is a movie, the main entity is not the right type (630, No), and the page is not used as a training page (625) but may be used for extraction later, or may be used in a test set.

If the entity in the knowledge base is of the expected type (630, Yes), the page is a possible training page with a main entity of the right type.

Of course, the training page can be filtered out at a later stage, for example, because it does not include a sufficient number of seed facts or includes a sufficient number of facts that disagree with the facts in the knowledge base for the entity.

This can occur, for example, when a new entity (e.g., not known in the knowledge base) has the same description as a known entity.

As discussed above, the system may discover that all facts or nearly all facts on the page are negative examples and remove the page from the training set. It likely includes a new entity and not information on a known entity.

The process then ends.

Identifying and Labeling Training Examples in Training Pages for Wrapper Induction

target objects

This process may use facts in a knowledge base to automatically label training examples on a training page.

The training examples can be positive or negative, and the negative examples can be one of several categories, each category having a different weight.

The weight represents the confidence or uncertainty that the example is actually negative. The process is illustrated for a single training page, but the system executes a process for each training page.

The process starts by getting context from the page for each target object found on the page.

In other words, the system may map the target object to the place or places in the web page where the target object occurs.

As indicated above, the target object may occur in multiple locations within the training page and may be associated with a confidence score for the location.

Each location is considered a mention or instance of the target object.

Each mention or instance is also associated with context-specific for that mention.

The context includes words surrounding the mention, the portion of the DOM the instance occurs in, etc.

As a result of mapping target objects to locations, each mention in the training page may have more than one target object mapped.

This may occur, for example, when more than one target object has the same text description.

The system may generate features for each of the mentions.

In some implementations, the features may include the mention string itself, or in other words, the text that represents the instance.

The features may also include the text that occurs before and after the mention string.

For example, the system may include several characters that occur before the mentioned text and many characters that occur after the mentioned text.

For example, the quantity maybe 80 characters, 40 characters, 100 characters, etc.

Another feature the system may extract is the DOM node for the mention.

This can include the top-down and bottom-up paths (XPath) from the root or other stable node in the DOM, the element ID, and class.

In some implementations, the system may also extract the DOM siblings; for example, the left and leftmost sibling and the left and leftmost table row siblings of the object are in a table.

In some implementations, the system may use CSS (cascading style sheet) style features, such as the name, class, identifier, and font information as features.

In some implementations, the system may use table position (where applicable), such as a first row, a particular column, etc., as a feature.

In some implementations, the system may use the text features, such as size, alignment within the surrounding block element, font, capitalization, etc.

In some implementations, the system may use mention features, such as the size and/or length of the mention, word embeddings from word2vec, the entity type of a target object, etc.

Of course, these are given as non-limiting examples, and the system may use other features not listed.

The system may generate predicate-target object pairs using the mapping and features.

As explained above, the system may determine a set of seed predicates applicable to the document source.

The system may pair each seed predicate with a target object identified in a training page.

Because each target object is also mapped to one or more instances or mentions, and each mention has a set of features, the outcome of generating the predicate-target object pairs may be a training example in the form where the subject is the main entity, the predicate is one of the seed predicates for the document source, a target is a target object, and Features are the features associated with a particular mention.

Thus, for example, if a training page has four seed predicates, a target object that occurs in two different mentions, generating the predicate-target object pairs may result in eight training examples (i.e., the four seed predicates paired with the target object at each of the two mentions).

Some of the examples may not map to a mention.

For example, the system obtains all seed facts from the knowledge base known about a subject entity.

The objects of these facts may not exist on the web page. Thus, a training example may not be associated with a mention or with a set of features.

Such examples are used, as explained below, in categorizing other examples and in training the model. In some implementations, the system may filter out any training examples where the target object does not fit the type expected by the seed predicate. For example, some seed predicates may expect a date, and the system may discard any pairings where the target object is not a date.

Similarly, some seed predicates may expect a person, and the system may filter out any pairs where the target object is not a person.

In some implementations, the system may keep such pairs but categorize them as not applicable or irrelevant.

The system may categorize the training examples, assigning each example to a category.

The system may first find and identify positive examples.

Positive examples represent examples where the subject-predicate-object of the example matches a fact in the knowledge base.

Such examples are categorized as positive and given the highest weight, for example 1.0.

In some implementations, this weight can be adjusted based on features for a mention.

For example, a training page may include the text “India,” which can refer to either the country or the cricket team.

The knowledge base may include two facts for the subject entity of the training pages, one being a “born in” relationship with India and another being a “played in” relationship with the team. Both facts are in the knowledge base, but the particular mention can only be a positive example of one of the two facts.

Thus, for example, if the text appears in a position where a team name is expected, the played-in-India pair may be marked as positive, and the born-in-India pair may not.

In some implementations, adjusting may occur after processing all training pages in the document source so that the system learns which types of entities occur in various locations.

In some implementations, the system may adjust the weight based on the particular vertical or document source.

Thus, for example, the team entity may be given a boost in a sport vertical, where otherwise the country would be the more likely interpretation.

Once labeled, the training example can be expressed as where the positive example is “Positive” or some value representing a positive example.

Once an example has been labeled, it is not considered again (e.g., a Positive example does not later change its category to one of the negative labels).

Once positive examples are identified, the system may find and label known negative examples.

Known negative examples are examples where the subject-predicate-object directly contradicts a fact in the knowledge base.

For example, some predicates are functional or single value, in that these predicates can only have one target object (a one-to-one relationship).

Examples of such predicates are birthdates, gender, etc. If an example includes one of these predicates paired with a target object that does not match the target object in the knowledge base, the system categorizes the example as Known Negative and may assign the highest weight to the example, for example 1.0.

The system may then find and label probable negative examples.

A probable negative exists where the page has two positive examples, one for predicate-target object pair p1-o1 and another for predicate-target object pair p2-o2.

When p1 and p2 are distinct (not equal), and o1 and o2 are distinct, the system assumes that any pairing of p1 and o2 is a negative example and a pairing of p2 and o1 is a negative example.

While this assumption may not always be true, it is a reasonable guess.

In some implementations, probable negatives are given a high weight.

In some implementations, the weight maybe 1.0.

The system may then find and label uncertain negative examples.

Uncertain negative examples are an example of a weak negative example that applies to non-functional or one-to-many predicates.

With such predicates, a subject entity may be linked to many objects via this predicate.

An example of such non-functional predicates is played for, which represents the teams a person has played or is playing for or acted in, which represents the movies an actor has acted in.

When an instance (a mention) is not associated with a positive example (e.g., another predicate represents a positive example for this instance) and some other instance is labeled as positive for the predicate on the training page, the system may label the example as an uncertain negative.

For example, in FIG. 2, the knowledge base may include facts for three of the five teams listed on the page.

Thus, the three known teams may be labeled as positive examples for the predicate plays.

The remaining two teams, which are not known to the knowledge base, maybe categorized as an uncertain negative, as the example may actually be a correct fact.

Because uncertain negative examples are weak, they may be given a lower weight.

In some implementations, the weight may depend on the completeness of the knowledge base.

For example, if the knowledge base includes n instances of a predicate for the subject entity, the weight of the uncertain negative may increase with n.

In some implementations, this may be expressed as 1-e.sup.-cn, where c is low (e.g., 0.2).

The system may then look for and categorize not-missing-not-found predicate-target pairs.

Not-missing-not-found predicate-target pairs represent facts known in the knowledge base for the subject entity that is not associated with a mention on the page.

This is another example of a weak negative, as the fact may be expressed on the page. Still, entity linking errors or aliasing problems may result in not identifying the target object.

Such examples may be given a small weight, such as 0.1. Any remaining examples the system may categorize as missing.

Typically, the remaining examples are functional predicates because the multi-valued predicates are addressed by the other uncertain negative categories.

For example, a subject entity may lack a birth date in the knowledge base; the system may pair each date with the born predicate (when this is a seed fact).

Such examples are weak negatives. The weight assigned to such an example may vary.

For example, it may be proportional to the number of times the missing fact for that predicate occurs on a page (e.g., 1/M where the fact occurs M times).

In some implementations, the weight may be based on vertical-specific logic (e.g., how much it deviates from an overall distribution for the vertical), or even how compatible the fact is with positive facts (e.g., a person cannot be born after their date of death). In some implementations where the system has not filtered out examples where the target object does not match the type expected by the predicate and has not already labeled these pairs as not applicable, the system may give such pairs the label of not applicable before labeling remaining pairs as missing.

Finally, in some implementations, the system may adjust the weights assigned to examples.

In some implementations, the adjustment may be based on the confidence score associated with the target object in the example.

As indicated above, each target object may be given a confidence score as part of the entity linking step. In some implementations, the weight of an example may be multiplied by the confidence score for the target object.

In some implementations, the system may normalize the weight of a fact that appears multiple times on the training page.

As indicated above, each instance of a target object may be associated with a confidence score from the entity linking.

Thus, when a target object appears multiple times, each instance may have a different confidence score, but each instance will have the same weight w.

The system may normalize the weight w across all the instances, for example, by using the following formulae:

.times. ##EQU00004## and

.times. ##EQU00005## where c.sub.1 is the commence score for the first instance, c.sub.2 is the confidence score for the second instance, and the first formula is used to adjust the weight w.sub.1 of the first instance and the second formula is used to adjust the weight w.sub.2 of the second instance.

In some implementations, the system may adjust the weights based on “approximate” or fuzzy matching for numeric and location values.

For example, the system may use Geographic containment. Geographic containment may boost the weight of a negative fact (make it a weaker negative) when the fact is geographically similar.

For example, if the knowledge base includes the fact and the training page includes , the system may make the fact a weaker negative.

As another example, the system may use numeric mismatch for single-value or functional predicates, especially those that can change slightly over time or vary slightly in how they are reported, such as height and weight.

Thus, the system may use a fuzzy matching technique. For example, the system may match a numeric value height 121 cm with 122 cm, but not 135 cm.

In some implementations, the system may use a simple percentage-based matching (within a given percentage, say +/-10%).

In some implementations, the system may also leverage vertical-specific population distribution to create tighter bonds.

For example, to estimate differences between actual vs. expected attribute values where the system has high confidence in the values. The resulting positive (or negative) examples can be weighted by their specificity proportionately to how close they match the observed knowledge base values.

The process then ends, having automatically generated positive and negative training examples for the training page.

Of course, the system may repeat the process for each training page in the document source (and for each source across the vertical).

Iterative Reweighting of Negative Training Examples in a Bootstrapped Training Process

Training Pages

This process may be performed by a wrapper induction engine in an automatic wrapper induction system.

This process uses a bootstrapped training process to reweight negative training examples to increase recall in the model.

The bootstrapped training process can improve recall by 5-10%, especially for multi-valued attributes (e.g., a list of all teams a player has ever played for).

The reweighting occurs over a predetermined number of training and application cycles.

The reweighted examples are then used to train the final extraction model.

The process begins with the system dividing the annotated training pages into groups or portions.

The system divides the training pages into two portions, a first portion, and a second portion.

However, the system may divide the training pages into three, four, or more portions.

This process is understood to represent k-fold cross-validation and is not limited to the number of portions illustrated.

The system may then use the first portion to train a first test extraction model and use the second portion to train a second test extraction model.

The system may then apply the second test model to the first portion of the pages and apply the first test model to the second portion of the pages.

In applying the model, the system generates a probability or confidence for each mention. The system may use the generated probabilities to reweight the examples.

For example, if a “born in” India predicate-target object pair (i.e., example) occurs three times on the page, each instance receives a probability.

If the probability for a first mention is less than or equal to 50%, but the probability for another mention is greater than 50%, the system may adjust the weight of the first mention to zero.

This is because the pair has a negative probability, and another mention on the training page is more relevant (i.e., has a positive probability).

If the probability of the first mention is greater than 50% and the second mention is also greater than 50%, the system may normalize the weights of all positive (greater than 50%) examples.

In this case, any negative (less than or equal to 50%) examples are set to zero.

If no mention is positive, the system may normalize the weight overall mentions.

In some implementations, this reweighting step may be expressed as

.times..times..ltoreq..times..times..times..times..E-backward.>>.ti- mes..times..times.>.times..times..times..times..E-backward.>.times. ##EQU00006## where w.sub.i is the weight for the instance i, and p.sub.i is the probability for the instance i of the predicate-target object pair and p.sub.j is the probability of another instance j of the predicate-target object pair.

The reweighting allows less relevant mentions to fall out.

The system may then determine whether to repeat the training and model application steps.

The system may repeat the steps a predetermined quantity of times.

In some implementations, the predetermined quantity is two.

If the predetermined quantity has not been reached, the system may use the reweighted training pages to train the model and apply the trained model.

If the predetermined quantity has been met, the system may use the reweighted examples in the training pages to train a final model.

In training the final extraction model, the system may use all the annotated training pages. The process then ends, and the final extraction model is ready to be applied to other similarly structured pages in the domain.

Adding and Removing Facts Extracted Using Data Wrappers

entity facts

This process may be performed by a missing fact generation engine in an automatic wrapper induction system.

This process demonstrates how the system may use domain knowledge and information about the knowledge base schema to refine extracted facts and correct errors in the entity linking.

The process also uses statistics of observed predicate functionality to remove errant facts.

Thus, the process helps ensure that facts extracted using the wrapper are suitable for insertion into the knowledge base.

The process may begin by applying the wrapper (or final trained extraction model) to the pages in the domain.

The pages may be training pages and other pages that have a structure similar to the training pages.

The pages may be all pages associated with a domain, or in other words, all pages that satisfy a regular expression used to represent the domain.

The application of the wrapper to the pages generates a probability for various facts identified in the pages, the fact being represented by a subject-predicate-object pairing.

The system may group predictions from the mention-level to the object level by taking the maximum probability across identical facts.

The system may also identify any functional (i.e., one-to-one) predicates. If there are multiple facts with this predicate for the same subject, the system may keep the fact with the highest probability.

The system may identify functional predicates by computing statistics from the seed fact identification process.

Some predicates may be non-functional in the knowledge base, but the system may determine that the predicate is functional within the domain.

Thus, the system may consider the predicate as functional for the domain only, eliminating facts with lower probabilities to increase the chance that the system saves the correct fact.

The system may also use a fact’s probability to select from between facts where the subject and predicate are the same, but the object differs. Still, the different objects are linked to a textual mention that overlaps (920).

For example, the text “Bob was born in South Africa” may lead to the two facts and (the latter because the entity linking views South as an adjective of Africa, not as a noun phrase.

If one fact has a higher probability than the other, the system may discard the one with a lower probability.

In a tie, the system may select the fact that represents the longest text span.

The system may determine whether the facts for a subject entity conflict with facts in the knowledge base.

When most of the facts actually conflict, the system may have incorrectly tagged the subject entity as an existing entity rather than a new entity.

For example, a lesser-known sports figure may have the same name as a well-known sports figure.

The well-known sports figure may be represented by an entity in the knowledge base, while the lesser-known sports figure is not.

If the system incorrectly identifies a mention of the lesser-known sports figure with the well-known figure, the system can catch this misidentification by checking the facts in the knowledge base.

If the facts (or a majority of the facts) contradict what is known about the subject entity, the system may assume the subject is actually a new entity and proceed to add the new entity to the knowledge base and add the extracted facts to the new entity.

If the knowledge base does not include an entity matching the subject entity, or if the facts for a known subject entity do not conflict, the system may add the facts, including adding a new subject entity when needed) to the knowledge base.

The system may use data wrappers to add facts that have a positive probability (e.g., greater than 50%).

The threshold for adding a fact maybe even higher, for example, 70%. Process 900 then ends, having extracted new facts from the web pages in the domain using a wrapper trained using automatically annotated training examples.

Of course, the system may perform a process periodically.

As new pages are discovered (e.g., identified during a web crawling process), the system may apply data wrappers to those newly discovered pages.

Thus, the system may add facts to the knowledge base as additional semi-structured pages are added to the domain.

Sharing is caring!





Source link

https://www.cupbord.com/extracting-entities-with-automated-data-wrappers/

Comments

Popular posts from this blog

Biden inauguration prayers: O’Donovan and Beaman represent the new president’s priorities

تحاور الحكومة الصينية فى أكثر من 45000 تطبيق من متجرها Apple بسبب اللوائح

قد يبدأ الفيسبوك Facebook في مراقبة سلوكيات الممثلين السيئين من خلال النظام الأساسي المخفي القائم على الروبوت