### TDT 4310 - Intelligent Text Analysis Project
#### Sorting japanese sentences by linguistic complexity
----
### Overview
1. Introduction and motivation
1. Background
1. Datasets
1. Methodology
1. Evaluation
1. Conclusion, and further work
---
---
| JMDict | Tatoeba / Tanaka corpus | NHK Easy News | MeCab |
|--------|-------------------------|---------------|-------|
| Open source dictionary | Multilingual sentence pairs | Easy-to-read news articles | POS and morphological analyzer |
| | | | |
---
#### TF-IDF
Extract the most meaningful words of a document
#### Sense disambiguation
Pinpoint which sense of the word is used, based on surrounding context and grammar.
----
### Japanese
#### Three writing systems
| hiragana | katakana | kanji |
|----------|----------|-------|
| | | |
10
ページの
5
行目をみなさい
Let's start from
(the)
fifth
lineonpage
10
##### Multiple readings per kanji
形 - katachi, kata, gyou, kei
##### Furigana
振
仮
名
---
#### Data ingestion, preprocessing and disambiguation
##### Tanaka Corpus
信用█為る(する){して}█と█彼(かれ)[01]█は|1█言う{言った}
##### NHK News Articles
Scrape -> Extract text -> MeCab + Furigana -> Try disambiguating with POS
Note:
Disambiguation here, is not necissarily sense ambiguation, but rather disambiguating the dictionary entry.
Could exploit the english translation to disambiguate all the way down to the word senses.
----
#### TF-IDF?
$ \text{TF-IDF} = \frac{\text{Amount of term in doc}}{\text{Amount of terms in doc}} \cdot log \frac{\text{Amount of docs}}{1 + \text{ Amount of docs containing term}} $
$ \text{TF-DF} = \frac{AVG(\text{Amount of term in doc})}{\text{Amount of terms in doc}} \cdot \frac{\text{ Amount of docs containing term}}{\text{Amount of docs}} $
Note:
TF-IDF is usually used for finding out how meaningful a word is to a document. Here, we want to do the opposite. The value should have a higher score, if it is more common across several documents.
----
#### Word difficulty
| Commonness | Dialects | Kanji | Katakana | NHK rating |
|------------|----------|-------|----------|------------|
| 25% | 10 % | 25% | 15% | 25% |
| | | | | |
----
#### Sentence difficulty
| Word difficulty sum | Hardest word | Sentence Length |
|------------|----------|-------|
| 50% | 20 % | 30% |
| | | |
---
----
----
Apart from some bugs, the system seems to be working as intended
The factors should be more strongly grounded in linguistical research
Alternatively a dataset that would make it possible to evaluate the accuracy of the implementation