Erinevus lehekülje "Knowledge representation homework 2015" redaktsioonide vahel
118. rida: | 118. rida: | ||
In case you manage to take a small variety of texts and give reasonable RDF-inspired output with some object id-s determined OK, you will pass. | In case you manage to take a small variety of texts and give reasonable RDF-inspired output with some object id-s determined OK, you will pass. | ||
− | The grade - ie the amount of points - you get for the lab depends mostly on how wide a variety of objects and sentences you manage to handle. Hence | + | The grade - ie the amount of points - you get for the lab depends mostly on how wide a variety of objects and sentences you manage to handle. Hence: * using conceptnet/wordnet/wikipedia is a plus |
+ | * using NER and POS tools is a plus | ||
+ | * parsing nontrivial sentences like "Yesterday to China Barack Obama went" is a plus | ||
+ | * Outputting a varied-arity logical formalization is also a plus. | ||
It is not OK, however, to simply run an existing tool and present the output as is: you have to replace phrases like "Barack Obama", "China" etc with usable ID-s with extra information attached and you have to be able to output the result as triplets. | It is not OK, however, to simply run an existing tool and present the output as is: you have to replace phrases like "Barack Obama", "China" etc with usable ID-s with extra information attached and you have to be able to output the result as triplets. |
Redaktsioon: 6. oktoober 2015, kell 14:41
KR homework as a part of a two-phase project
The knowledge representation (KR) homework is a first part of the project to create a small and simple system for understanding natural language (English). The second part of the project is a homework for the reasoning and deduction (RD) block. The pairs formed for the KR) homework will be used for the RD howework as well.
We assume the English texts depict fictional or non-fictional information what you would expect to get in generic worldwide news articles. There is no need to analyze complex sentences and it is OK to make mistakes.
What you have to do
In the first, KR part you have to make a command line program capable of reading a plain text file in English and printing the information read and understood in a logic-based formalism: you have to be able to output the result in RDF and uou may optionally - additionally - output the result in a varied-arity first order representation. The exact syntax is of your own choosing.
You do not have to determine the exact meaning of ambigous words and phrases: instead you should output a list of possible meanings.
The second part of the project - RD homework - will focus on removing ambiguity and on answering questions about the information in the text.
Steps and examples
The final result of the KR lab is:
- a program which takes an English text file from the command line and prints the result of parsing in two different representations
- RDF-based representation of triples
- Optionally and additionally a logic-based representation with multiple-arities predicates
- several example texts showcasing what your program is capable of and what it is not capable of
- doing an actual quick presentation of what your program does and how it is built
We will now look at the actual steps you have to do. I'll present examples in Python/json syntax.
Parse the text into sentences and words
This is a very simple programming task. Example input:
"Barack Obama went to China yesterday. He lives in Grand Hyatt Beijing. This is a superb hotel."
Result:
[["Barack","Obama","went","to","China","yesterday"],["He","lives",...],...]
Perform NER and word identification on the text and annotate text
This is an involved task. Example input:
[["Barack","Obama","went","to","China","yesterday"],
Example output (choose your own representation):
[[{"word": "Barack Obama", "ner": "http://en.wikipedia.org/wiki/Barack_Obama", "type": "person"}, "went", "to", {"word": "China", "ner": "https://en.wikipedia.org/wiki/China", "type":"country"}], ["He","lives",...],...]
The main point of the step is to identify recognized words and phrases with some ID of a known entity for which we can potentially find more information elsewhere. It is also very useful to add a type identifier like "person", "country", "organization" to a recognized phrase.
It is OK to make mistakes in identification and to assume the most popular or obvious choice: there are certainly many different Barack Obamas, just pick the obvious.
You do not have to use the Wikipedia urls: other suitable databases are also OK.
Leave unrecognized words as they are.
How to find the ID-s and types? Options from simple to fancy:
- Build your own short dictionary like {"Barack Obama": {"url": "http://en.wikipedia.org/wiki/Barack_Obama", "type": "person"},...} etc.
- Find a way to use some large suitable database, like conceptnet or wikipedia (you may want to try dbpedia) or either in a downloaded form or through an API (just trying out wikipedia urls is also an OK approach).
- Use a NER tool like Stanford NER
You get more points when you take a more sophisticated approach and manage to recognize a large set of phrases. However, building your own short list is also OK, although it gives fewer points.
It is also OK to combine a NER tool with a dataset like conceptnet.
Recognize and categorize generic words
Process the output of the NER recogniser and focus on common words like "went", "to", "yesterday", "he", etc. Annotate these words with the type of the word and - potentially - additional information.
How to recognize and annotate words:
- Again, the simplest way to go is to create your own short dictionary like {"went": {"root": "go", "url": "http://conceptnet5.media.mit.edu/web/c/en/go", "time": "past"}, ..}
- You can use conceptnet or wordnet to find words and their properties
- You can use a specialized tool, for example, in the Stanford NLP toolkit
Replace ambigous words/phrases with lists of potential candidates
Example: replace "He" or a annotated form of "He" with a list of potential candidates from the previous sentence: ["Barack Obama", "China"], replace "This" with ["He", "Grand Hyatt Beijing"].
Doing the final selection for the list is a task for the third lab.
Convert the annotated sentence to RDF and/or logic
The first sentence could be transformed to RDF like this:
"http://en.wikipedia.org/wiki/Barack_Obama", "http://conceptnet5.media.mit.edu/web/c/en/go", "https://en.wikipedia.org/wiki/China".
or better yet, to
"http://en.wikipedia.org/wiki/Barack_Obama", "myid:action", "myid:10".
"myid:10","myid:type","myid:action".
"myid:10","myid:activity","http://conceptnet5.media.mit.edu/web/c/en/go".
"myid:10","myid:time","http://conceptnet5.media.mit.edu/web/c/en/yesterday".
"myid:10","myid:location", "https://en.wikipedia.org/wiki/China".
In case you have a list of potential candidate meanings for the word, just output the list of suitable id-s in its place, like
["http://en.wikipedia.org/wiki/Barack_Obama", "https://en.wikipedia.org/wiki/China"]
What would be the optional multiarity representation? Like this:
"http://conceptnet5.media.mit.edu/web/c/en/go"(""http://en.wikipedia.org/wiki/Barack_Obama","https://en.wikipedia.org/wiki/China","http://conceptnet5.media.mit.edu/web/c/en/yesterday")
where the predicate "http://conceptnet5.media.mit.edu/web/c/en/go" has an actor, location and time arguments. Other predicates may have just one or, vice versa, a lot of arguments.
What about sentences like "to China Barack Obama went yesterday"? It is a plus if you can parse and represent these properly as well! However, if you do not manage to do this and are only able to handle very simple sentences, you will also pass.
Passing and grading
In case you manage to take a small variety of texts and give reasonable RDF-inspired output with some object id-s determined OK, you will pass.
The grade - ie the amount of points - you get for the lab depends mostly on how wide a variety of objects and sentences you manage to handle. Hence: * using conceptnet/wordnet/wikipedia is a plus
- using NER and POS tools is a plus
- parsing nontrivial sentences like "Yesterday to China Barack Obama went" is a plus
- Outputting a varied-arity logical formalization is also a plus.
It is not OK, however, to simply run an existing tool and present the output as is: you have to replace phrases like "Barack Obama", "China" etc with usable ID-s with extra information attached and you have to be able to output the result as triplets.