238 lines
5.3 KiB
Markdown
238 lines
5.3 KiB
Markdown
|
<link rel="stylesheet" href="./static/main.css"/>
|
|||
|
|
|||
|
### TDT 4310 - Intelligent Text Analysis Project
|
|||
|
|
|||
|
#### Sorting japanese sentences by linguistic complexity
|
|||
|
|
|||
|
----
|
|||
|
|
|||
|
### Overview
|
|||
|
|
|||
|
1. Introduction and motivation
|
|||
|
1. Background
|
|||
|
1. Datasets
|
|||
|
1. Methodology
|
|||
|
1. Evaluation
|
|||
|
1. Conclusion, and further work
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
<img src="./static/graphics/jst1.png" width="18%"/>
|
|||
|
<img src="./static/graphics/jst2.png" width="18%"/>
|
|||
|
<img src="./static/graphics/jst3.png" width="18%"/>
|
|||
|
|
|||
|
<br/>
|
|||
|
<br/>
|
|||
|
|
|||
|
<footer>Motivation</footer>
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
<div style="font-size: 0.8em">
|
|||
|
|
|||
|
| JMDict | Tatoeba / Tanaka corpus | NHK Easy News | MeCab |
|
|||
|
|--------|-------------------------|---------------|-------|
|
|||
|
| Open source dictionary | Multilingual sentence pairs | Easy-to-read news articles | POS and morphological analyzer |
|
|||
|
| <img src="./static/graphics/jmdict.png" width=100%/> | <img src="./static/graphics/tatoeba.png" width=100%/> | <img src="./static/graphics/nhk.png" width=100%/> | |
|
|||
|
|
|||
|
</div>
|
|||
|
|
|||
|
<br/>
|
|||
|
<br/>
|
|||
|
|
|||
|
|
|||
|
<footer>Datasets</footer>
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
#### TF-IDF
|
|||
|
|
|||
|
Extract the most meaningful words of a document
|
|||
|
|
|||
|
<br/>
|
|||
|
|
|||
|
#### Sense disambiguation
|
|||
|
|
|||
|
Pinpoint which sense of the word is used, based on surrounding context and grammar.
|
|||
|
|
|||
|
<footer>Background</footer>
|
|||
|
|
|||
|
----
|
|||
|
|
|||
|
### Japanese
|
|||
|
|
|||
|
<div class="grid">
|
|||
|
<div class="col-9">
|
|||
|
|
|||
|
#### Three writing systems
|
|||
|
|
|||
|
| <span style="color: red;">hiragana</span> | <span style="color: green;">katakana</span> | <span style="color: blue;">kanji</span> |
|
|||
|
|----------|----------|-------|
|
|||
|
| <img src="./static/graphics/hiragana.png"/> | <img src="./static/graphics/katakana.png"/> | <img src="./static/graphics/kanji2.png"/> |
|
|||
|
|
|||
|
</div>
|
|||
|
<div class="col-3">
|
|||
|
<div class="row-2">
|
|||
|
|
|||
|
<p>
|
|||
|
10
|
|||
|
<span style="color: green;">ページ</span>
|
|||
|
<span style="color: red;">の</span>
|
|||
|
5
|
|||
|
<span style="color: blue;">行目</span>
|
|||
|
<span style="color: red;">をみなさい</span>
|
|||
|
</p>
|
|||
|
|
|||
|
<p style="font-size: 0.8em;">
|
|||
|
<span style="color: red;">Let's start from</span>
|
|||
|
(the)
|
|||
|
fifth
|
|||
|
<span style="color: blue;">line</span>
|
|||
|
<span style="color: red;">on</span>
|
|||
|
<span style="color: green;">page</span>
|
|||
|
10
|
|||
|
</p>
|
|||
|
|
|||
|
|
|||
|
##### Multiple readings per kanji
|
|||
|
|
|||
|
形 - katachi, kata, gyou, kei
|
|||
|
|
|||
|
</div>
|
|||
|
<div class="row-1">
|
|||
|
|
|||
|
<br/>
|
|||
|
|
|||
|
##### Furigana
|
|||
|
|
|||
|
<ruby>
|
|||
|
振 <rp>(</rp><rt>furi</rt><rp>)</rp>
|
|||
|
仮 <rp>(</rp><rt>ga</rt><rp>)</rp>
|
|||
|
名 <rp>(</rp><rt>na</rt><rp>)</rp>
|
|||
|
<ruby>
|
|||
|
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
|
|||
|
<footer>Background</footer>
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
#### Data ingestion, preprocessing and disambiguation
|
|||
|
|
|||
|
<br/>
|
|||
|
|
|||
|
##### Tanaka Corpus
|
|||
|
|
|||
|
<p>
|
|||
|
信用█為る(する){して}█と█彼(かれ)[01]█は|1█言う{言った}
|
|||
|
</p>
|
|||
|
|
|||
|
<br/>
|
|||
|
|
|||
|
##### NHK News Articles
|
|||
|
|
|||
|
Scrape -> Extract text -> MeCab + Furigana -> Try disambiguating with POS
|
|||
|
|
|||
|
<footer>Methodology</footer>
|
|||
|
|
|||
|
Note:
|
|||
|
|
|||
|
Disambiguation here, is not necissarily sense ambiguation, but rather disambiguating the dictionary entry.
|
|||
|
|
|||
|
Could exploit the english translation to disambiguate all the way down to the word senses.
|
|||
|
|
|||
|
----
|
|||
|
|
|||
|
#### TF-IDF?
|
|||
|
|
|||
|
<br/>
|
|||
|
|
|||
|
<div>
|
|||
|
|
|||
|
$ \text{TF-IDF} = \frac{\text{Amount of term in doc}}{\text{Amount of terms in doc}} \cdot log \frac{\text{Amount of docs}}{1 + \text{ Amount of docs containing term}} $
|
|||
|
|
|||
|
</div>
|
|||
|
<br/>
|
|||
|
<div class="fragment" data-fragment-index="0">
|
|||
|
|
|||
|
$ \text{TF-DF} = \frac{AVG(\text{Amount of term in doc})}{\text{Amount of terms in doc}} \cdot \frac{\text{ Amount of docs containing term}}{\text{Amount of docs}} $
|
|||
|
|
|||
|
</div>
|
|||
|
|
|||
|
<footer>Methodology</footer>
|
|||
|
|
|||
|
Note:
|
|||
|
|
|||
|
TF-IDF is usually used for finding out how meaningful a word is to a document. Here, we want to do the opposite. The value should have a higher score, if it is more common across several documents.
|
|||
|
|
|||
|
----
|
|||
|
|
|||
|
#### Word difficulty
|
|||
|
|
|||
|
| Commonness | Dialects | Kanji | Katakana | NHK rating |
|
|||
|
|------------|----------|-------|----------|------------|
|
|||
|
| 25% | 10 % | 25% | 15% | 25% |
|
|||
|
| <img width="200px" src="./static/graphics/curves/common.png"> | <img width="200px" src="./static/graphics/curves/dialect.png"> | <img width="200px" src="./static/graphics/curves/kanji.png"> | <img width="200px" src="./static/graphics/curves/katakana.png"> | <img width="200px" src="./static/graphics/curves/nhk.png"> |
|
|||
|
|
|||
|
<footer>Methodology</footer>
|
|||
|
|
|||
|
----
|
|||
|
|
|||
|
#### Sentence difficulty
|
|||
|
|
|||
|
| Word difficulty sum | Hardest word | Sentence Length |
|
|||
|
|------------|----------|-------|
|
|||
|
| 50% | 20 % | 30% |
|
|||
|
| <img width="200px" src="./static/graphics/curves/wordsum.png"> | | <img width="200px" src="./static/graphics/curves/sentence_length.png"> |
|
|||
|
|
|||
|
|
|||
|
<footer>Methodology</footer>
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
<div class="columns">
|
|||
|
<div>
|
|||
|
<img width="80%" src="./static/graphics/examples/test1.png"/>
|
|||
|
</div>
|
|||
|
<div>
|
|||
|
<img width="90%" src="./static/graphics/examples/test2.png"/>
|
|||
|
</div>
|
|||
|
</dic>
|
|||
|
|
|||
|
<footer>Evaluation</footer>
|
|||
|
|
|||
|
----
|
|||
|
|
|||
|
<div class="columns">
|
|||
|
<div>
|
|||
|
<img width="90%" src="./static/graphics/examples/book1.png"/>
|
|||
|
</div>
|
|||
|
<div>
|
|||
|
<img width="100%" src="./static/graphics/examples/book2.png"/>
|
|||
|
</div>
|
|||
|
</dic>
|
|||
|
|
|||
|
<footer>Evaluation</footer>
|
|||
|
|
|||
|
----
|
|||
|
|
|||
|
|
|||
|
<ul>
|
|||
|
<div>
|
|||
|
<li>Apart from some bugs, the system seems to be working as intended</li>
|
|||
|
</div>
|
|||
|
<div class="fragment" data-fragment-index="0">
|
|||
|
<li>The factors should be more strongly grounded in linguistical research</li>
|
|||
|
</div>
|
|||
|
<div class="fragment" data-fragment-index="1">
|
|||
|
<li>Alternatively a dataset that would make it possible to evaluate the accuracy of the implementation</li>
|
|||
|
</div>
|
|||
|
<div class="fragment" data-fragment-index="2">
|
|||
|
<li>More data left unused.</li>
|
|||
|
</div>
|
|||
|
</ul>
|
|||
|
|
|||
|
<footer>Conclusion, and further work</footer>
|