TDT4310-project-sorted-japa.../project_report/main.tex

245 lines
20 KiB
TeX
Raw Normal View History

2024-04-26 00:49:01 +02:00
\documentclass{article}[a4, 12pt]
\usepackage{ntnu-report}
\usepackage{amsmath}
\usepackage{xeCJK}
\setCJKmainfont{Noto Sans CJK JP}
\usepackage{booktabs}
\usepackage{array}
\usepackage{ruby}
\date{April 2023}
\title{TDT4130 - Text Analysis Project}
\addbibresource{references.bib}
\begin{document}
\include{./titlepage.tex}
\newpage
This article aims to explore the use of natural language processing to order Japanese sentences by their linguistic complexity.
In this paper, we provide an overview of the Japanese language and related work in the field, followed by a description of the architecture of our system.
We detail the datasets used, the methodology employed, and the evaluation of our system's performance.
\section{Introduction}
The problem we address in this article arose while developing a mobile dictionary app called Jisho-Study-Tool \citep{jst}. We faced a challenge when we needed to link example sentences to words in the dictionary, and arrange them in order. To overcome this challenge, we have utilized techniques and algorithms from natural language processing. In this article, we present our approach to solving this problem.
\section{Background}
\subsection{Japanese Language}
Japanese is a language that is very different from English. It employs three writing systems, namely, hiragana, katakana, and kanji. While hiragana and katakana has the same set of characters with different scripts, kanji is a logographic system that is heavily influenced by Chinese characters. Hiragana is generally utilized for native Japanese words, grammatical particles, and verb endings, whereas Katakana is used for loanwords from foreign languages, technical terms, and onomatopoeia. The common term for both of these is \textit{kana}. Kanji, on the other hand, tends to be used for words of Chinese origin such as nouns, adjectives, and verbs. Despite each word typically having a canonical way to be written, the language permits alternative ways of using the writing systems, sometimes for practical purposes, as well as for certain nuances or exceptional cases.
Kanji can have multiple pronunciations, which are usually classified into onyomi and kunyomi. While the difference between these are irrelevant for this article, the fact that there are several pronounciations for a single word brings some challenges. Additionally, the language has many homonyms, which are disambiguated through context when speaking and by using kanji when writing. These homonyms presents both advantages and disadvantages for this project. On one hand, it poses challenges when trying to disambiguate words due to the many words with the same pronunciation. On the other hand, it provides some dimensions lacking from english, that we can utilize to disambiguate the words. For example, some datasets include kana on top of some kanji, named \textit{furigana}. These are written in hiragana to aid in reading the kanji. We can use this in combination with a dictionary to further narrow down which sense of the word is being used.
\subsection{Word sense disambiguation}
Word sense disambiguation refers to the process of determining which specific meaning or usage of a word is being employed in a given context. This provides important semantic information that is useful in various natural language processing applications. In our case, it helps us gather statistics on the frequency of different word senses and identify common words. There are several algorithms for word sense disambiguation, but in this article, we will utilize a more traditional approach.
\subsection{TF-IDF}
Term frequency-inverse document frequency, commonly known as TF-IDF, is a popular text vectorization technique that converts raw text into a usable vector. This method combines two important concepts - Term Frequency (TF) and Document Frequency (DF) - to produce a comprehensive representation of text data.
Term frequency refers to the number of times a specific term appears in a document, which helps to determine the importance of that term within the document. By considering the term frequency of every word in a corpus, we can represent the text data as a matrix whose rows correspond to the number of documents and columns correspond to the number of distinct terms found across all documents. Document frequency, on the other hand, measures how many documents contain a specific term, providing insight into the commonality of a particular word across the entire corpus. Finally, the inverse document frequency (IDF) is a weight assigned to each term, which aims to reduce the importance of a term if its occurrences are distributed across all documents. IDF can be calculated using a formula that takes into account the total number of documents and the number of documents containing a particular term.
\section{Related work}
\subsection{Automatic Text Difficulty Classifier}
This article describes a system designed to assess the complexity of Portuguese texts, which is intended to provide language learners with texts that correspond to their skill level. To accomplish this, the system extracts 52 features that are grouped into seven categories: parts-of-speech (POS), syllables, words, chunks and phrases, averages and frequencies, and additional features. The system combines these features to calculate a value that represents the text's level of difficulty. The approach of using several features of different kinds is similar to the way we do it in this project. \citep{portuguese}
\subsection{Jisho.org}
Jisho is an online Japanese-English dictionary that offers a wide range of features for searching words, kanji, and example sentences. To accomplish this, Jisho integrates various data sources, including the Japanese-Multilingual Dictionary (JMDict) and the Tanaka Corpus, which will be explained further later on. One of the useful features of Jisho is the ability to provide example sentences to illustrate how a word is used in context. To achieve this, Jisho has employed a similar approach to their data aggregation, although not exactly the same. Although the source of their product is closed, some of the tools used in the process are publicly available. During the development of this project, their kana-romaji translator \citep{ve} has proven to be a valuable tool. Unfortunately, Jisho usually only provides one or two sentences per sense if any, so it is not as useful as a comparison.
\subsection{Surrounding Word Sense Model for Japanese All-words Word Sense Disambiguation}
This paper proposes a surrounding word
sense model (SWSM) that uses the distribution
of word senses that appear near ambiguous words for unsupervised all-words
word sense disambiguation in Japanese.
It is based around the idea that words with the same senses will often appear with the same surrounding words.
By utilizing dictionary data in addition to WORDNET-WALK, they have created an engine which is more accurate than existing supervising. models.
This could be used in combination with this project to make it more accurate in the future.
\citep{swsm}
\section{Architecture}
\subsection{Datasets}
\subsubsection{JMDict}
JMDict is a publicly available Japanese to multilingual dictionary developed by Jim Breen and his associates at the Electronic Dictionary Research and Development Group (EDRDG). The dictionary has various types of information such as kanji, readings, word senses, and more. It also includes rare information like different newspaper indices for the different word senses, and the origins of loan words. This resource is valuable to us since it provides a predetermined wordlist that we can use to link our sample sentences. Additionally, JMDict can be utilized as a query tool to examine relationships between words and senses. \citep{jmdict}
\subsubsection{Tanaka corpus}
The Tanaka corpus is a compendium of sentences that includes an English-translated version for most of them. This compilation was put together by Asuhito Tanaka, who is a professor at Hyogo University. Originally, the corpus was created by assigning the task of collecting 300 sentence pairs to Professor Tanaka's students. After several years, they had collected 212,000 sentence pairs. In 2002, the EDRDG started to work on creating links to the entries in JMDict. In 2006, them maintanership of the corpus was incorporated into the Tatoeba project. The current version of the corpus released by the EDRDG comes preprocessed with lemmatizations, furigana, and other supplementary data.
\citep{tanaka-corpus}
\subsubsection{NHK Easy News}
JMDict contains a wealth of information on the frequency of words in the Japanese language. However, some of these statistics are derived from Japanese newspapers, which are renowned for being challenging even for learners in the advanced stages.
Fortunately, Japan's state media, NHK, publishes a newspaper that is designed for learners. This is a valuable resource since we can be certain that every word in this corpus is suitable for learners. Therefore, we will utilize this corpus to construct a new index, which we can use to determine whether a word is suitable for learners.
\subsection{Methodology}
\subsubsection{Data ingestion}
The first task was to ingest and preprocess the data from the different sources.
For this, we chose to use an SQL database, because it provides us with an easy way of storing temporary result and quickly retrieving entries for complex queries.
By reading the document type definition \citep{xmldtd} of the JMDict XML-file, we were able to construct most of the schema of the database. Some parts of the schema was never used, so there is a bit of data loss in this process.
NHK News publishes an official index of the last articles from the past year at \url{http://www3.nhk.or.jp/news/easy/news-list.json}. We have used this to be able to download those articles, and then scrape them for content with an HTML parser. Afterwards, this was also put into the SQL database.
The sentences from the Tanaka Corpus were ingested in a similar manner.
\subsubsection{Word Sense Disambiguation}
Both corpora contain elements that can facilitate the disambiguation process.
The Tatoeba sentences are already partially annotated with lemmatizations, furigana (which denotes the spelling of the kanji), and at times, even the JMDict identifier. However, the sense disambiguation process is limited to a specific entry. Here, we could have used SWSM in an attempt to further disambiguate the word to its senses listed as listed in the dictionary.
The NHK Easy news corpus doesn't have these kinds of annotations. To solve this problem, we use a combination of the furigana from the corpus, MeCab to analyze the words and get POS tags, and a prioritized list of how to search for the correct meaning of the word. We created a mapping from the MeCab part-of-speech tags to the JMDict tags. The first word that fits based on its existing level of commonality data, and which is also the most likely match is chosen as the match. If no matches are found, the word is not added to the list of connected entries.
This approach may have a limitation where it could make some frequently used words appear even more frequent than they actually are. As a result, some words that are commonly used, but not as much as their similar counterparts, could be wrongly classified as very rare because they don't seem to appear in the NHK Corpora.
\subsubsection{TF-IDF}
TF-IDF is often used as a tool to estimate how meaningful a word is for one document in a corpus.
However, here we want the opposite measure. We are not looking for the words that give the documents most of its meaning, but rather the words which are more common across several documents.
If a token only has a high frequency in one of the documents, then there is a high chance that the word is field specific to this document only.
Because of this, we are going to change the formula to give us the averaged term frequency times the document frequency.
\begin{align*}
AVG(TF) &= \frac{AVG \left(\text{Occurences of term in document} \right) }{\text{Amount of terms in document}} \\[2ex]
DF &= \frac{\text{Count of documents where term exists}}{\text{Document Count}} \\[2ex]
\text{TF-DF} &= AVG(TF) \cdot DF \\
\end{align*}
We then went over the NHK Easy News corpus and collected the ``TF-DF'' values from here. These were then normalized to be in $[0, 1]$
\subsubsection{Determining word and sentence difficulty}
At this point, there are a lot of potential factors available to work with. To organize sentences properly, we need to determine how hard the words and sentences are to understand by aggregating some of these factors. We have picked a few factors that we believe are useful to determine the difficulty values, but the chosen curvatures and weights are just based on trial and error, and educated guesses.
Figure \ref{fig:wordfactors} shows how the different factors contribute to a words difficulty.
The sentence factors are listed in Figure \ref{fig:sentencefactors}.
\newcommand{\curveDiagramWidth}{0.15\linewidth}
\begin{figure}[H]
\begin{tabular}{ m{3cm} m{1cm} l m{7cm} }
\toprule
Factor & \% & Curve & Notes and reasoning \\
\midrule
$\frac{\sum \text{difficulty}(\text{word})}{\text{length of sentence}}$
& 50\%
& \includegraphics[width=\curveDiagramWidth]{graphics/curves/wordsum.png}
& This is the aggregated value based on the calculation in Figure \ref{fig:wordfactors}. As the values should be decently curved already, they are left unaffected. We also believe that this should have a lot more effect on the sentence than the other two factors. \\
\midrule
$\text{max}(\text{difficulty}(\text{word}))$
& 20\%
&
& The hardest word in the sentence can be the word that makes the whole sentence useless for a learner. Because of that, we make the hardest word in the sentence its own factor. \\
\midrule
Sentence length
& 30\%
& \includegraphics[width=\curveDiagramWidth]{graphics/curves/sentence_length.png}
& Until a sentence reaches around 12 words, it should be regarded as quite easy. But once it surpasses that, it becomes more difficult. \\
\bottomrule
\end{tabular}
\caption{Contributing factors to a sentences difficulty}
\label{fig:sentencefactors}
\end{figure}
\begin{figure}[H]
\begin{tabular}{ m{3cm} m{1cm} c m{7cm} }
\toprule
Factor & \% & Curve & Notes and reasoning \\
\midrule
Common ratings
& 25\%
& \includegraphics[width=\curveDiagramWidth]{graphics/curves/common.png}
& The different existing ratings of the word are summed together and linearly squished into $[0, 1]$. If the entry is included in more than one or two indices, it can be assumed that it is quite a common word, and should be marked as very easy. \\
\midrule
Dialects
& 10\%
& \includegraphics[width=\curveDiagramWidth]{graphics/curves/dialect.png}
& This is the sum of all readings which are marked as dialect. If a word has more than roughly 30\% dialect readings, we assume that it is a very dialect specific word. This should increase its difficulty. \\
\midrule
Most difficult kanji
& 25\%
& \includegraphics[width=\curveDiagramWidth]{graphics/curves/kanji.png}
& The input here is the elementary school grade in which the kanji is thaught, where grade 7 is the rest of the \ruby{常用}{jouyou}\ kanji \citep{jouyou}, and 8 are everything else. Usually, grade 1-6 means that the word is easy, grade 7 is intermediate-difficult, and 8 is extremely difficult. There is an edge case here, where a word has a set of really difficult kanji, but they are usually not used. These come pretagged as such, and are removed from the calculation. \\
\midrule
Katakana word
& 15\%
& \includegraphics[width=\curveDiagramWidth]{graphics/curves/katakana.png}
& If a word only contains katakana, there is a good chance that it is a loanword from english. This is usually a clear cut case, but some words have alternative kanji that are rarely used. Examples might include \href{https://jisho.org/word/\%E9\%A0\%81}{\ruby{}{ぺーじ} (page)} and \href{https://jisho.org/word/\%E3\%82\%B3\%E3\%83\%BC\%E3\%83\%92\%E3\%83\%BC}{\ruby{珈琲}{コーヒ}(coffee)}. Because of this, we use a hard limit at 50\% for how many readings are katakana only. \\
\midrule
NHK Easy News Frequency Rating
& 25\%
& \includegraphics[width=\curveDiagramWidth]{graphics/curves/nhk.png}
& In order to get rid of the words that are document specific, we make the S-curve mark the lower valued words as difficult, but quickly remove their difficulty if they appear more often \\
\bottomrule
\end{tabular}
\caption{Contributing factors to a words difficulty}
\label{fig:wordfactors}
\end{figure}
\newpage
\section{Evaluation and conclusion}
\subsection{Evaluation}
Despite being unable to measure the accuracy of the results, the first impression was quite good.
Here is an example of the sentences connected to the word テスト (test)
\begin{figure}[H]
\begin{minipage}{0.49\linewidth}
\includegraphics[width=\linewidth]{graphics/examples/test1.png}
\end{minipage}
\begin{minipage}{0.49\linewidth}
\includegraphics[width=\linewidth]{graphics/examples/test2.png}
\end{minipage}
\caption{Example sentences for the word ``test'', with the easiest and hardest difficulty levels}
\end{figure}
From this example, it seems to work quite well, with the one big exception being the particles. These are small suffixes which only exists to indicate the grammatical meaning of the word before it. While these are probably some of the absolutely most common pieces in the japanese language, they have been marked as very difficult. However, the easier words have been marked green, while the harder ones have gotten an orange color.
Here is another example for \ruby{}{ほん} (book), which has several senses. We have turned on debug information, to see the contributing factors.
\begin{figure}[H]
\begin{minipage}{0.49\linewidth}
\includegraphics[width=\linewidth]{graphics/examples/book1.png}
\end{minipage}
\begin{minipage}{0.49\linewidth}
\includegraphics[width=\linewidth]{graphics/examples/book2.png}
\end{minipage}
\caption{Example sentences for the word ``book'', with the easiest and hardest difficulty levels}
\end{figure}
Here we can see the internal details as to why the particles have been so difficult. They seem to be marked as the most difficult on the kanji scale. This is a bug, since the kanji system is supposed to filter out anything that is not a kanji. Unfortunately, while we spent quite a lot of time on trying to fix this, we can not figure out why it is acting as it does, and we are soon reaching the deadline of the project.
\subsection{Conclusion}
While the system performs well by our random samples, there are still some impurities to be researched further. There are also some bugs left to be fixed.
There are many other factors that we haven't explored yet which could be useful. For example, many sentences in the Tatoeba Corpus are already labeled with tags, some of which could indicate whether a sentence is difficult or not. We could also look to the automatic text difficulty classifier project for additional ideas on which factors to consider.
We also think more research is necessary to establish the correct weighting for different factors and which curves should be used. This requires examining which factors of a word are the most important for determining its level of difficulty. This is crucial for ensuring that the sorting system works correctly. Additionally, we need to investigate how to handle sentences with unfamiliar words to ensure they are sorted in a reasonable way.
\nocite{*}
\customphantomsection{Bibliography}
\printbibliography{}
\end{document}