Erinevus lehekülje "Knowledge representation homework 2017" redaktsioonide vahel
| 4. rida: | 4. rida: | ||
| The overall project built during the second and third homework is a small and simple system for understanding natural language (English): more concretely, a [https://en.wikipedia.org/wiki/Question_answering question answering system].   | The overall project built during the second and third homework is a small and simple system for understanding natural language (English): more concretely, a [https://en.wikipedia.org/wiki/Question_answering question answering system].   | ||
| − | The knowledge representation (KR) homework is a first part of the project: you have to convert English sentences into RDF triplets or triplet-like structures, which may contain additional metainformation. | + | * The knowledge representation (KR) homework is a first part of the project: you have to convert English sentences into RDF triplets or triplet-like structures, which may contain additional metainformation. | 
| − | + | * The second part of the project is a homework for the reasoning and deduction (RD) block where your system should be actually able to answer questions with new facts coming form the texts converted in the first part of the project. | |
| − | The second part of the project is a homework for the reasoning and deduction (RD) block where your system should be actually able to answer questions with new facts coming form the texts converted in the first part of the project. | ||
| The pairs formed for the KR homework will be used for the RD howework as well. | The pairs formed for the KR homework will be used for the RD howework as well. | ||
Redaktsioon: 5. oktoober 2017, kell 20:32
KR homework as a part of a two-phase project
The overall project built during the second and third homework is a small and simple system for understanding natural language (English): more concretely, a question answering system.
- The knowledge representation (KR) homework is a first part of the project: you have to convert English sentences into RDF triplets or triplet-like structures, which may contain additional metainformation.
- The second part of the project is a homework for the reasoning and deduction (RD) block where your system should be actually able to answer questions with new facts coming form the texts converted in the first part of the project.
The pairs formed for the KR homework will be used for the RD howework as well.
We assume the English texts depict information about geography: places, their relative positions, containment inside each other, various factual knowledge about places etc. There is no need to analyze complex sentences and it is OK to make mistakes.
Administrative
Homework defence deadline: ? . Presentation afer this deadline will give half of the points. Absolute deadline is end of November: no submissions will be accepted after ?.
Work should be submitted to git, latest one day before deadline.
New groups and repositories are available: they will be the same for both this lab (2) and the next (3).
What you have to do
In the first, KR part you have to make a command line program capable of reading a plain text file in English and printing the information read and understood in a logic-based formalism: you have to be able to output the result in RDF. The exact syntax is of your own choosing.
The text should give new information about places like countries, cities, villages, houses, lakes, seas etc.
You do not have to understand complicated texts: start by inventing your own small example texts.
You do not have to determine the exact meaning of ambigous words and phrases: instead you should output a list of possible meanings.
The second part of the project - RD homework - will focus on removing ambiguity and on answering questions about the information in the text.
Steps and examples
The final result of the KR lab is:
- a program which takes an English text file from the command line and prints the result of parsing as RDF-based representation of triples
- several example texts showcasing what your program is capable of and what it is not capable of
- doing an actual quick presentation of what your program does and how it is built
There is no "standard set" of sentences you should be able to understand. Write your own small and simple texts: there should be some variety.
We will now look at the actual steps you have to do. I'll present examples in Python/json syntax.
Parse the text into sentences and words
This is a very simple programming task. Example input is from a generic news domain, not really from a geography domain:
"Barack Obama went to China yesterday. He lives in Grand Hyatt Beijing. This is a superb hotel."
Result:
[["Barack","Obama","went","to","China","yesterday"],["He","lives",...],...]
Perform NER and word identification on the text and annotate text
This is an involved task. Example input:
[["Barack","Obama","went","to","China","yesterday"],
Example output (choose your own representation):
[[{"word": "Barack Obama", "ner": "http://en.wikipedia.org/wiki/Barack_Obama", "type": "person"}, "went", "to", {"word": "China", "ner": "https://en.wikipedia.org/wiki/China", "type":"country"}], ["He","lives",...],...]
The main point of the step is to identify recognized words and phrases with some ID of a known entity for which we can potentially find more information elsewhere. It is also very useful to add a type identifier like "person", "country", "organization" to a recognized phrase.
It is OK to make mistakes in identification and to assume the most popular or obvious choice: there are certainly many different Barack Obamas, just pick the obvious.
You do not have to use the Wikipedia urls: other suitable databases are also OK.
Leave unrecognized words as they are.
How to find the ID-s and types? Options from simple to fancy:
- Build your own short dictionary like {"Barack Obama": {"url": "http://en.wikipedia.org/wiki/Barack_Obama", "type": "person"},...} etc.
- Find a way to use some large suitable database, like conceptnet or wikipedia (you may want to try dbpedia) or either in a downloaded form or through an API (just trying out wikipedia urls is also an OK approach).
- Use a NER tool like Stanford NER
You get more points when you take a more sophisticated approach and manage to recognize a large set of phrases. However, building your own short list is also OK, although it gives fewer points.
It is also OK to combine a NER tool with a dataset like conceptnet.
Recognize and categorize generic words
Process the output of the NER recogniser and focus on common words like "went", "to", "yesterday", "he", etc. Annotate these words with the type of the word and - potentially - additional information.
How to recognize and annotate words:
- Again, the simplest way to go is to create your own short dictionary like {"went": {"root": "go", "url": "http://conceptnet5.media.mit.edu/web/c/en/go", "time": "past"}, ..}
- You can use conceptnet or wordnet to find words and their properties
- You can use a specialized tool, for example, in the Stanford NLP toolkit
Replace ambiguos words/phrases with lists of potential candidates
Example: replace "He" or a annotated form of "He" with a list of potential candidates from the previous sentence: ["Barack Obama", "China"], replace "This" with ["He", "Grand Hyatt Beijing"].
Doing the final selection for the list is a task for the third lab.
Convert the annotated sentence to RDF and/or logic
The first sentence could be transformed to RDF like this:
"http://en.wikipedia.org/wiki/Barack_Obama", "http://conceptnet5.media.mit.edu/web/c/en/go", "https://en.wikipedia.org/wiki/China".
or better yet, to
"http://en.wikipedia.org/wiki/Barack_Obama", "myid:action", "myid:10".
"myid:10","myid:type","myid:action".
"myid:10","myid:activity","http://conceptnet5.media.mit.edu/web/c/en/go".
"myid:10","myid:time","http://conceptnet5.media.mit.edu/web/c/en/yesterday".
"myid:10","myid:location", "https://en.wikipedia.org/wiki/China".
In case you have a list of potential candidate meanings for the word, just output the list of suitable id-s in its place, like
["http://en.wikipedia.org/wiki/Barack_Obama", "https://en.wikipedia.org/wiki/China"]
What would be the optional multiarity representation? Like this:
"http://conceptnet5.media.mit.edu/web/c/en/go"(""http://en.wikipedia.org/wiki/Barack_Obama","https://en.wikipedia.org/wiki/China","http://conceptnet5.media.mit.edu/web/c/en/yesterday")
where the predicate "http://conceptnet5.media.mit.edu/web/c/en/go" has an actor, location and time arguments. Other predicates may have just one or, vice versa, a lot of arguments.
What about sentences like "to China Barack Obama went yesterday"? It is a plus if you can parse and represent these properly as well! However, if you do not manage to do this and are only able to handle very simple sentences, you will also pass.
Useful links
Different popular toolkits for NLP:
- Spacy toolkit for Python
- Google SyntaxNet (see in github)
- CoreNLP: the main Stanford NLP tool in the context of a larger set of Stanford NLP toolkits like stanford NER etc
- NLTK: the main Python toolkit, see also this tutorial and this NER tutorial
- Pattern toolkit for Python
- opennlp
- PyNLP for Python
- NER tutorial for Linux in the context of a larger practical tutorial
Web APIs:
- Google cloud natural language API
- opencalais (free registration required)
Important general ontologies:
Passing and grading
In case you manage to take a small variety of texts and give reasonable RDF-inspired output with some object id-s determined OK, you will pass.
The grade - ie the amount of points - you get for the lab depends mostly on how wide a variety of objects and sentences you manage to handle. Hence:
- using conceptnet/wordnet/wikipedia is a plus
- using NER and POS tools is a plus
- parsing nontrivial sentences like "Yesterday to China Barack Obama went" is a plus
It is not OK, however, to simply run an existing tool and present the output as is: you have to replace phrases like "Barack Obama", "China" etc with usable ID-s with extra information attached and you have to be able to output the result as triplets.
Example code
The following is extremely simplistic partial solution in Python 3, using no tools (bad):
- example NL extractor 1 does not create the rdf
- example NL extractor 2 is extended to create the rdf with options for pronouns (lists of possible noun values), but only in trivial cases and losing some information.
- example NL extractor 3 is extended to use adverbs and adjectives to create several triplets from one sentence. Single-element lists are also dropped, just using the single element inside.
Result of example code
The example NL extractor 3 above outputs these triplets
http://en.wikipedia.org/wiki/Barack_Obama id:action id:local_1 id:local_1 id:isactivity http://conceptnet5.media.mit.edu/web/c/en/go id:local_1 id:extrainfo http://conceptnet5.media.mit.edu/web/c/en/yesterday ['http://en.wikipedia.org/wiki/Barack_Obama', 'http://en.wikipedia.org/wiki/China'] http://conceptnet5.media.mit.edu/web/c/en/live https://en.wikipedia.org/wiki/Grand_Hyatt_Beijing https://en.wikipedia.org/wiki/Grand_Hyatt_Beijing http://conceptnet5.media.mit.edu/web/c/en/type/v/identify_as_belonging_to_a_certain_type id:local_2 id:local_2 id:isobject http://conceptnet5.media.mit.edu/web/c/en/hotel id:local_2 id:extrainfo http://conceptnet5.media.mit.edu/web/c/en/superb
where:
- every triplet has a form object-property-value, except where we have lists like ['http://en.wikipedia.org/wiki/Barack_Obama', 'http://en.wikipedia.org/wiki/China']: the list means that one of the options is correct, but we do not - yet - know, which is correct.
- id-s containing "local" like id:local_2 are invented during parsing to identify objects without an external known id: these objects typically have several properties