Files
jadb/docs/database.md
h7x4 6364457d9e
All checks were successful
Build and test / build (push) Successful in 7m56s
docs/database: add some notes about elementId embeddings
2026-04-08 19:07:48 +09:00

1.9 KiB

Database

Here are some choices that have been made when designing the schema

JMdict_{Reading,Kanji}Element.elementId and JMdict_Sense.senseId

The elementId/senseId field acts as a unique identifier for each individual element in these tables. It is a packed version of the (entryId, orderNum) pair, where the first number is given 7 digits and the second is given 2 digits (max count found so far is 40). Since entryId already is a field in the table, it would technically have been fine to store the orderNum as a separate field, but it is easier to be able to refer to the entries without a composite foreign key in other tables.

(NOTE: entryId is now inferred from elementId within sqlite using a generated column, so saying it is "stored in a separate field" might be a stretch)

In addition, the reading element id's are added with 1000000000 to make them unique from the kanji element id's. This reduces the amount of space needed for indices in some locations, because you can simply filter out each part with > or <.

We used to generate the elementId separately from orderNum as a sequential id, but it lead to all values shifting whenever the data was updated, leading to very big diffs. Making it be a unique composite of data coming from the source data itself means that the values will be stable across updates.

Due to the way the data is structured, we can use the elementId as the ordering number as well.

JMdict_EntryScore

The JMdict_EntryScore table is used to store the score of each entry, which is used for sorting search results. The score is calculated based on a number of variables.

The table is automatically generated from other tables via triggers, and should be considered as a materialized view.

There is a score row for every single entry in both JMdict_KanjiElement and JMdict_ReadingElement, split by the type field.