Knowledge representation homework 2016

KR homework as a part of a two-phase project

The knowledge representation (KR) homework is a first part of the project to create a small and simple system for understanding natural language (English). The second part of the project is a homework for the reasoning and deduction (RD) block. The pairs formed for the KR homework will be used for the RD howework as well.

We assume the English texts depict fictional or non-fictional information what you would expect to get in generic worldwide news articles. There is no need to analyze complex sentences and it is OK to make mistakes.

Administrative

Homework defence deadline: 3. November. Presentation afer this deadline will give half of the points. Absolute deadline is end of November: no submissions will be accepted after December 1.

Work should be submitted to git, latest one day before deadline.

New groups and repositories are available: they will be the same for both this lab (2) and the next (3).

What you have to do

In the first, KR part you have to make a command line program capable of reading a plain text file in English and printing the information read and understood in a logic-based formalism: you have to be able to output the result in RDF. The exact syntax is of your own choosing.

The text should be in one of three domains, choose the one you like:

Movies
Flights
Weather reports

You do not have to understand complicated texts: start by inventing your own small example texts in the chosen domain.

You do not have to determine the exact meaning of ambigous words and phrases: instead you should output a list of possible meanings.

The second part of the project - RD homework - will focus on removing ambiguity and on answering questions about the information in the text.

Steps and examples

The final result of the KR lab is:

a program which takes an English text file from the command line and prints the result of parsing as RDF-based representation of triples
several example texts showcasing what your program is capable of and what it is not capable of
doing an actual quick presentation of what your program does and how it is built

There is no "standard set" of sentences you should be able to understand. Write your own small and simple texts: there should be some variety.

We will now look at the actual steps you have to do. I'll present examples in Python/json syntax.

Parse the text into sentences and words

This is a very simple programming task. Example input:

"Barack Obama went to China yesterday. He lives in Grand Hyatt Beijing. This is a superb hotel."

Result:

[["Barack","Obama","went","to","China","yesterday"],["He","lives",...],...]

Perform NER and word identification on the text and annotate text

This is an involved task. Example input:

[["Barack","Obama","went","to","China","yesterday"],

Example output (choose your own representation):

[[{"word": "Barack Obama", "ner": "http://en.wikipedia.org/wiki/Barack_Obama", "type": "person"}, "went", "to", {"word": "China", "ner": "https://en.wikipedia.org/wiki/China", "type":"country"}], ["He","lives",...],...]

The main point of the step is to identify recognized words and phrases with some ID of a known entity for which we can potentially find more information elsewhere. It is also very useful to add a type identifier like "person", "country", "organization" to a recognized phrase.

It is OK to make mistakes in identification and to assume the most popular or obvious choice: there are certainly many different Barack Obamas, just pick the obvious.

You do not have to use the Wikipedia urls: other suitable databases are also OK.

Leave unrecognized words as they are.

How to find the ID-s and types? Options from simple to fancy:

Build your own short dictionary like {"Barack Obama": {"url": "http://en.wikipedia.org/wiki/Barack_Obama", "type": "person"},...} etc.
Find a way to use some large suitable database, like conceptnet or wikipedia (you may want to try dbpedia) or either in a downloaded form or through an API (just trying out wikipedia urls is also an OK approach).
Use a NER tool like Stanford NER

You get more points when you take a more sophisticated approach and manage to recognize a large set of phrases. However, building your own short list is also OK, although it gives fewer points.

It is also OK to combine a NER tool with a dataset like conceptnet.

Recognize and categorize generic words

Process the output of the NER recogniser and focus on common words like "went", "to", "yesterday", "he", etc. Annotate these words with the type of the word and - potentially - additional information.

How to recognize and annotate words:

Again, the simplest way to go is to create your own short dictionary like {"went": {"root": "go", "url": "http://conceptnet5.media.mit.edu/web/c/en/go", "time": "past"}, ..}
You can use conceptnet or wordnet to find words and their properties
You can use a specialized tool, for example, in the Stanford NLP toolkit

Replace ambiguos words/phrases with lists of potential candidates

Example: replace "He" or a annotated form of "He" with a list of potential candidates from the previous sentence: ["Barack Obama", "China"], replace "This" with ["He", "Grand Hyatt Beijing"].

Doing the final selection for the list is a task for the third lab.

Convert the annotated sentence to RDF and/or logic

The first sentence could be transformed to RDF like this:

"http://en.wikipedia.org/wiki/Barack_Obama", "http://conceptnet5.media.mit.edu/web/c/en/go", "https://en.wikipedia.org/wiki/China".

or better yet, to

"http://en.wikipedia.org/wiki/Barack_Obama", "myid:action", "myid:10".

"myid:10","myid:type","myid:action".

"myid:10","myid:activity","http://conceptnet5.media.mit.edu/web/c/en/go".

"myid:10","myid:time","http://conceptnet5.media.mit.edu/web/c/en/yesterday".

"myid:10","myid:location", "https://en.wikipedia.org/wiki/China".

In case you have a list of potential candidate meanings for the word, just output the list of suitable id-s in its place, like

["http://en.wikipedia.org/wiki/Barack_Obama", "https://en.wikipedia.org/wiki/China"]

What would be the optional multiarity representation? Like this:

"http://conceptnet5.media.mit.edu/web/c/en/go"(""http://en.wikipedia.org/wiki/Barack_Obama","https://en.wikipedia.org/wiki/China","http://conceptnet5.media.mit.edu/web/c/en/yesterday")

where the predicate "http://conceptnet5.media.mit.edu/web/c/en/go" has an actor, location and time arguments. Other predicates may have just one or, vice versa, a lot of arguments.

What about sentences like "to China Barack Obama went yesterday"? It is a plus if you can parse and represent these properly as well! However, if you do not manage to do this and are only able to handle very simple sentences, you will also pass.

Useful links

Different popular toolkits for NLP:

Google SyntaxNet (see in github)
CoreNLP: the main Stanford NLP tool in the context of a larger set of Stanford NLP toolkits like stanford NER etc
NLTK: the main Python toolkit, see also this tutorial and this NER tutorial
Pattern toolkit for Python
opennlp
PyNLP for Python
NER tutorial for Linux in the context of a larger practical tutorial

Web APIs:

Google cloud natural language API
opencalais (free registration required)

Important general ontologies:

Passing and grading

In case you manage to take a small variety of texts and give reasonable RDF-inspired output with some object id-s determined OK, you will pass.

The grade - ie the amount of points - you get for the lab depends mostly on how wide a variety of objects and sentences you manage to handle. Hence:

using conceptnet/wordnet/wikipedia is a plus
using NER and POS tools is a plus
parsing nontrivial sentences like "Yesterday to China Barack Obama went" is a plus

It is not OK, however, to simply run an existing tool and present the output as is: you have to replace phrases like "Barack Obama", "China" etc with usable ID-s with extra information attached and you have to be able to output the result as triplets.

Example code

The following is extremely simplistic partial solution in Python 3, using no tools (bad):

example NL extractor 1 does not create the rdf
example NL extractor 2 is extended to create the rdf with options for pronouns (lists of possible noun values), but only in trivial cases and losing some information.
example NL extractor 3 is extended to use adverbs and adjectives to create several triplets from one sentence. Single-element lists are also dropped, just using the single element inside.

Result of example code

The example NL extractor 3 above outputs these triplets


  http://en.wikipedia.org/wiki/Barack_Obama
  id:action
  id:local_1

  id:local_1
  id:isactivity
  http://conceptnet5.media.mit.edu/web/c/en/go

  id:local_1
  id:extrainfo
  http://conceptnet5.media.mit.edu/web/c/en/yesterday
 
  ['http://en.wikipedia.org/wiki/Barack_Obama', 'http://en.wikipedia.org/wiki/China']
  http://conceptnet5.media.mit.edu/web/c/en/live
  https://en.wikipedia.org/wiki/Grand_Hyatt_Beijing
 
  https://en.wikipedia.org/wiki/Grand_Hyatt_Beijing
  http://conceptnet5.media.mit.edu/web/c/en/type/v/identify_as_belonging_to_a_certain_type
  id:local_2
 
  id:local_2
  id:isobject
  http://conceptnet5.media.mit.edu/web/c/en/hotel
 
  id:local_2
  id:extrainfo
  http://conceptnet5.media.mit.edu/web/c/en/superb

where:

every triplet has a form object-property-value, except where we have lists like ['http://en.wikipedia.org/wiki/Barack_Obama', 'http://en.wikipedia.org/wiki/China']: the list means that one of the options is correct, but we do not - yet - know, which is correct.

id-s containing "local" like id:local_2 are invented during parsing to identify objects without an external known id: these objects typically have several properties