TDT4310-project-sorted-japa.../project_slides/main.md

238 lines
5.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<link rel="stylesheet" href="./static/main.css"/>
### TDT 4310 - Intelligent Text Analysis Project
#### Sorting japanese sentences by linguistic complexity
----
### Overview
1. Introduction and motivation
1. Background
1. Datasets
1. Methodology
1. Evaluation
1. Conclusion, and further work
---
<img src="./static/graphics/jst1.png" width="18%"/>
<img src="./static/graphics/jst2.png" width="18%"/>
<img src="./static/graphics/jst3.png" width="18%"/>
<br/>
<br/>
<footer>Motivation</footer>
---
<div style="font-size: 0.8em">
| JMDict | Tatoeba / Tanaka corpus | NHK Easy News | MeCab |
|--------|-------------------------|---------------|-------|
| Open source dictionary | Multilingual sentence pairs | Easy-to-read news articles | POS and morphological analyzer |
| <img src="./static/graphics/jmdict.png" width=100%/> | <img src="./static/graphics/tatoeba.png" width=100%/> | <img src="./static/graphics/nhk.png" width=100%/> | |
</div>
<br/>
<br/>
<footer>Datasets</footer>
---
#### TF-IDF
Extract the most meaningful words of a document
<br/>
#### Sense disambiguation
Pinpoint which sense of the word is used, based on surrounding context and grammar.
<footer>Background</footer>
----
### Japanese
<div class="grid">
<div class="col-9">
#### Three writing systems
| <span style="color: red;">hiragana</span> | <span style="color: green;">katakana</span> | <span style="color: blue;">kanji</span> |
|----------|----------|-------|
| <img src="./static/graphics/hiragana.png"/> | <img src="./static/graphics/katakana.png"/> | <img src="./static/graphics/kanji2.png"/> |
</div>
<div class="col-3">
<div class="row-2">
<p>
<span style="color: green;">ページ</span>
<span style="color: red;"></span>
<span style="color: blue;">行目</span>
<span style="color: red;">をみなさい</span>
</p>
<p style="font-size: 0.8em;">
<span style="color: red;">Let's start from</span>
(the)
fifth
<span style="color: blue;">line</span>
<span style="color: red;">on</span>
<span style="color: green;">page</span>
10
</p>
##### Multiple readings per kanji
形 - katachi, kata, gyou, kei
</div>
<div class="row-1">
<br/>
##### Furigana
<ruby>
<rp>(</rp><rt>furi</rt><rp>)</rp>
<rp>(</rp><rt>ga</rt><rp>)</rp>
<rp>(</rp><rt>na</rt><rp>)</rp>
<ruby>
</div>
</div>
</div>
<footer>Background</footer>
---
#### Data ingestion, preprocessing and disambiguation
<br/>
##### Tanaka Corpus
<p>
信用█為る(する){して}█と█彼(かれ)[01]█は|1█言う{言った}
</p>
<br/>
##### NHK News Articles
Scrape -> Extract text -> MeCab + Furigana -> Try disambiguating with POS
<footer>Methodology</footer>
Note:
Disambiguation here, is not necissarily sense ambiguation, but rather disambiguating the dictionary entry.
Could exploit the english translation to disambiguate all the way down to the word senses.
----
#### TF-IDF?
<br/>
<div>
$ \text{TF-IDF} = \frac{\text{Amount of term in doc}}{\text{Amount of terms in doc}} \cdot log \frac{\text{Amount of docs}}{1 + \text{ Amount of docs containing term}} $
</div>
<br/>
<div class="fragment" data-fragment-index="0">
$ \text{TF-DF} = \frac{AVG(\text{Amount of term in doc})}{\text{Amount of terms in doc}} \cdot \frac{\text{ Amount of docs containing term}}{\text{Amount of docs}} $
</div>
<footer>Methodology</footer>
Note:
TF-IDF is usually used for finding out how meaningful a word is to a document. Here, we want to do the opposite. The value should have a higher score, if it is more common across several documents.
----
#### Word difficulty
| Commonness | Dialects | Kanji | Katakana | NHK rating |
|------------|----------|-------|----------|------------|
| 25% | 10 % | 25% | 15% | 25% |
| <img width="200px" src="./static/graphics/curves/common.png"> | <img width="200px" src="./static/graphics/curves/dialect.png"> | <img width="200px" src="./static/graphics/curves/kanji.png"> | <img width="200px" src="./static/graphics/curves/katakana.png"> | <img width="200px" src="./static/graphics/curves/nhk.png"> |
<footer>Methodology</footer>
----
#### Sentence difficulty
| Word difficulty sum | Hardest word | Sentence Length |
|------------|----------|-------|
| 50% | 20 % | 30% |
| <img width="200px" src="./static/graphics/curves/wordsum.png"> | | <img width="200px" src="./static/graphics/curves/sentence_length.png"> |
<footer>Methodology</footer>
---
<div class="columns">
<div>
<img width="80%" src="./static/graphics/examples/test1.png"/>
</div>
<div>
<img width="90%" src="./static/graphics/examples/test2.png"/>
</div>
</dic>
<footer>Evaluation</footer>
----
<div class="columns">
<div>
<img width="90%" src="./static/graphics/examples/book1.png"/>
</div>
<div>
<img width="100%" src="./static/graphics/examples/book2.png"/>
</div>
</dic>
<footer>Evaluation</footer>
----
<ul>
<div>
<li>Apart from some bugs, the system seems to be working as intended</li>
</div>
<div class="fragment" data-fragment-index="0">
<li>The factors should be more strongly grounded in linguistical research</li>
</div>
<div class="fragment" data-fragment-index="1">
<li>Alternatively a dataset that would make it possible to evaluate the accuracy of the implementation</li>
</div>
<div class="fragment" data-fragment-index="2">
<li>More data left unused.</li>
</div>
</ul>
<footer>Conclusion, and further work</footer>