Knowledge representation homework 2015

KR homework as a part of a two-phase project

The knowledge representation (KR) homework is a first part of the project to create a small and simple system for understanding natural language (English). The second part of the project is a homework for the reasoning and deduction (RD) block. The pairs formed for the KR homework will be used for the RD howework as well.

We assume the English texts depict fictional or non-fictional information what you would expect to get in generic worldwide news articles. There is no need to analyze complex sentences and it is OK to make mistakes.

Administrative

Homework defence deadline: 3. November. Presentation afer this deadline will give half of the points. Absolute deadline is end of November: no submissions will be accepted in December.

Work should be submitted to git, latest one day before deadline.

New groups and repositories are available: they will be the same for both this lab (2) and the next (3).

What you have to do

In the first, KR part you have to make a command line program capable of reading a plain text file in English and printing the information read and understood in a logic-based formalism: you have to be able to output the result in RDF and you may optionally - additionally - output the result in a varied-arity first order representation. The exact syntax is of your own choosing.

You do not have to determine the exact meaning of ambigous words and phrases: instead you should output a list of possible meanings.

The second part of the project - RD homework - will focus on removing ambiguity and on answering questions about the information in the text.

Steps and examples

The final result of the KR lab is:

a program which takes an English text file from the command line and prints the result of parsing in two different representations
- RDF-based representation of triples
- Optionally and additionally a logic-based representation with multiple-arities predicates
several example texts showcasing what your program is capable of and what it is not capable of
doing an actual quick presentation of what your program does and how it is built

There is no "standard set" of sentences you should be able to understand. Write your own small and simple texts: there should be some variety.

We will now look at the actual steps you have to do. I'll present examples in Python/json syntax.

Parse the text into sentences and words

This is a very simple programming task. Example input:

"Barack Obama went to China yesterday. He lives in Grand Hyatt Beijing. This is a superb hotel."

Result:

[["Barack","Obama","went","to","China","yesterday"],["He","lives",...],...]

Perform NER and word identification on the text and annotate text

This is an involved task. Example input:

[["Barack","Obama","went","to","China","yesterday"],

Example output (choose your own representation):

[[{"word": "Barack Obama", "ner": "http://en.wikipedia.org/wiki/Barack_Obama", "type": "person"}, "went", "to", {"word": "China", "ner": "https://en.wikipedia.org/wiki/China", "type":"country"}], ["He","lives",...],...]

The main point of the step is to identify recognized words and phrases with some ID of a known entity for which we can potentially find more information elsewhere. It is also very useful to add a type identifier like "person", "country", "organization" to a recognized phrase.

It is OK to make mistakes in identification and to assume the most popular or obvious choice: there are certainly many different Barack Obamas, just pick the obvious.

You do not have to use the Wikipedia urls: other suitable databases are also OK.

Leave unrecognized words as they are.

How to find the ID-s and types? Options from simple to fancy:

Build your own short dictionary like {"Barack Obama": {"url": "http://en.wikipedia.org/wiki/Barack_Obama", "type": "person"},...} etc.
Find a way to use some large suitable database, like conceptnet or wikipedia (you may want to try dbpedia) or either in a downloaded form or through an API (just trying out wikipedia urls is also an OK approach).
Use a NER tool like Stanford NER

You get more points when you take a more sophisticated approach and manage to recognize a large set of phrases. However, building your own short list is also OK, although it gives fewer points.

It is also OK to combine a NER tool with a dataset like conceptnet.

Recognize and categorize generic words

Process the output of the NER recogniser and focus on common words like "went", "to", "yesterday", "he", etc. Annotate these words with the type of the word and - potentially - additional information.

How to recognize and annotate words:

Again, the simplest way to go is to create your own short dictionary like {"went": {"root": "go", "url": "http://conceptnet5.media.mit.edu/web/c/en/go", "time": "past"}, ..}
You can use conceptnet or wordnet to find words and their properties
You can use a specialized tool, for example, in the Stanford NLP toolkit

Replace ambiguos words/phrases with lists of potential candidates

Example: replace "He" or a annotated form of "He" with a list of potential candidates from the previous sentence: ["Barack Obama", "China"], replace "This" with ["He", "Grand Hyatt Beijing"].

Doing the final selection for the list is a task for the third lab.

Convert the annotated sentence to RDF and/or logic

The first sentence could be transformed to RDF like this:

"http://en.wikipedia.org/wiki/Barack_Obama", "http://conceptnet5.media.mit.edu/web/c/en/go", "https://en.wikipedia.org/wiki/China".

or better yet, to

"http://en.wikipedia.org/wiki/Barack_Obama", "myid:action", "myid:10".

"myid:10","myid:type","myid:action".

"myid:10","myid:activity","http://conceptnet5.media.mit.edu/web/c/en/go".

"myid:10","myid:time","http://conceptnet5.media.mit.edu/web/c/en/yesterday".

"myid:10","myid:location", "https://en.wikipedia.org/wiki/China".

In case you have a list of potential candidate meanings for the word, just output the list of suitable id-s in its place, like

["http://en.wikipedia.org/wiki/Barack_Obama", "https://en.wikipedia.org/wiki/China"]

What would be the optional multiarity representation? Like this:

"http://conceptnet5.media.mit.edu/web/c/en/go"(""http://en.wikipedia.org/wiki/Barack_Obama","https://en.wikipedia.org/wiki/China","http://conceptnet5.media.mit.edu/web/c/en/yesterday")

where the predicate "http://conceptnet5.media.mit.edu/web/c/en/go" has an actor, location and time arguments. Other predicates may have just one or, vice versa, a lot of arguments.

What about sentences like "to China Barack Obama went yesterday"? It is a plus if you can parse and represent these properly as well! However, if you do not manage to do this and are only able to handle very simple sentences, you will also pass.

Useful links

Different popular toolkits for NLP:

CoreNLP: the main Stanford NLP tool in the context of a larger set of Stanford NLP toolkits like stanford NER etc
NLTK: the main Python toolkit, see also this tutorial and this NER tutorial
Pattern toolkit for Python
opennlp
PyNLP for Python
NER tutorial for Linux in the context of a larger practical tutorial

Web APIs:

opencalais (free registration required)

Important general ontologies:

Passing and grading

In case you manage to take a small variety of texts and give reasonable RDF-inspired output with some object id-s determined OK, you will pass.

The grade - ie the amount of points - you get for the lab depends mostly on how wide a variety of objects and sentences you manage to handle. Hence:

using conceptnet/wordnet/wikipedia is a plus
using NER and POS tools is a plus
parsing nontrivial sentences like "Yesterday to China Barack Obama went" is a plus
Outputting a varied-arity logical formalization is also a plus.

It is not OK, however, to simply run an existing tool and present the output as is: you have to replace phrases like "Barack Obama", "China" etc with usable ID-s with extra information attached and you have to be able to output the result as triplets.

Example code version 1

The following is an extremely simplistic partial solution in Python, using no tools (bad) and not creating the final rdf (yet ...).


intxt="""Barack Obama went to China yesterday. 
He lives in Grand Hyatt Beijing. This is a superb hotel.""" 

nertable=[
  [["Barack","Obama"],"Barack Obama","ner_noun","http://en.wikipedia.org/wiki/Barack_Obama","person"],
  [["China"],"China","ner_noun","http://en.wikipedia.org/wiki/China","country"],
  [["Grand","Hyatt","Beijing"],"Grand Hyatt Beijing","ner_noun","https://en.wikipedia.org/wiki/Grand_Hyatt_Beijing","company"]
]  

postable=[
  [["went"],"go","verb","http://conceptnet5.media.mit.edu/data/5.3/c/en/go","past"],
  [["to"],"to","preposition","http://conceptnet5.media.mit.edu/data/5.3/c/en/to",None],
  [["yesterday"],"yesterday","adverb","http://conceptnet5.media.mit.edu/data/5.3/c/en/yesterday",None],
  [["this"],"this","adjective","http://conceptnet5.media.mit.edu/data/5.3/c/en/this",None]
]  

# [barack,action1,china]   "to china", "went ... yesterday"
# [action1,activity,moveto]
# [action1,time,past]

# [he,action2, grandhyattbeijing]
# [action2,activity,live_in]
# [action2,time,current]

# TODO:
#sentencetable=[
#  [["noun","verb","noun"],[[0,1,2]]]
  
def main(txt):
  splitted=split_text(txt)
  print("splitted:")
  print(splitted)
  nerred=ner_text(splitted)
  print("nerred:")
  print(nerred)
  posed=pos_text(nerred)
  print("posed:")
  print(posed)
  pretty_print(posed)
  
def ner_text(slst):
  rlst=[]
  for sent in slst:
    srlst=[]
    i=0
    while i<len(sent):
      tmp=sent_has_name_at(sent,i)
      if tmp:
        srlst.append(tmp[0])
        i=tmp[1]
      else:
        srlst.append(sent[i])
      i+=1  
    rlst.append(srlst)
  return rlst

def sent_has_name_at(sent,i):
  if not sent: return 0
  if i>=len(sent): return 0
  for known in nertable:
    phrase=known[0]
    j=0
    while j<len(phrase):
      if i+j>=len(sent): break
      if sent[i+j]!=phrase[j]:
        break
      j+=1
    if j==len(phrase):
      res=[known,i+len(phrase)-1]
      return res



def pos_text(slst):
  rlst=[]
  for sent in slst:
    srlst=[]
    i=0
    while i<len(sent):
      if type(sent[i])==type([0]): 
        srlst.append(sent[i])
        i+=1
        continue
      tmp=sent_has_pos_at(sent,i)
      if tmp:
        srlst.append(tmp[0])
        i=tmp[1]
      else:
        srlst.append(sent[i])
      i+=1  
    rlst.append(srlst)
  return rlst

def sent_has_pos_at(sent,i):
  if not sent: return 0
  if i>=len(sent): return 0
  for known in postable:
    phrase=known[0]
    j=0
    while j<len(phrase):
      if i+j>=len(sent): break
      if sent[i+j]!=phrase[j]:
        break
      j+=1
    if j==len(phrase):
      res=[known,i+len(phrase)-1]
      return res

def split_text(txt):
  sentlst=txt.replace(","," ").split(".")
  wlst=[]
  for s in sentlst:
    if not s: continue
    sp=s.replace("."," ").replace("\n"," ").split(" ")
    tmp=[]
    for w in sp:
      w1=w.strip()
      if w1: tmp.append(w1)      
    wlst.append(tmp)
  return wlst

def pretty_print(sentlst):
  for sent in sentlst:
    print("sentence: ")
    for phrase in sent:
      print("  "+str(phrase)) 

main(intxt)