docs: init

2026-04-01 16:48:40 +09:00
parent 2ad1e038f1
commit ede57a7a00
4 changed files with 62 additions and 23 deletions
@@ -18,26 +18,4 @@ Note that while the license for the code is MIT, the data has various licenses.
 | **Tanos JLPT levels:** | https://www.tanos.co.uk/jlpt/ |
 | **Kangxi Radicals:**   | https://ctext.org/kangxi-zidian |

-## Implementation details
-
-### Word search
-
-The word search procedure is currently split into 3 parts:
-
-1. **Entry ID query**:
-
-Use a complex query with various scoring factors to try to get list of
-database ids pointing at dictionary entries, sorted by how likely we think this
-word is the word that the caller is looking for. The output here is a `List<int>`
-
-2. **Data Query**:
-
-Takes the entry id list from the last search, and performs all queries needed to retrieve
-all the dictionary data for those IDs. The result is a struct with a bunch of flattened lists
-with data for all the dictionary entries. These lists are sorted by the order that the ids
-were provided.
-
-3. **Regrouping**:
-
-Takes the flattened data, and regroups the items into structs with a more "hierarchical" structure.
-All data tagged with the same ID will end up in the same struct. Returns a list of these structs.
+See [docs/overview.md](./docs/overview.md) for notes and implementation details.
@@ -0,0 +1,13 @@
+# Lemmatizer
+
+The lemmatizer is still quite experimental, but will play a more important role in the project in the future.
+
+It is a manual implementation of a [Finite State Transducer](https://en.wikipedia.org/wiki/Morphological_dictionary#Finite_State_Transducers) for morphological parsing. The FST is used to recursively remove affixes from a word until it (hopefully) deconjugates into its dictionary form. This iterative deconjugation tree will then be combined with queries into the dictionary data to determine if the deconjugation leads to a real known word.
+
+Each separate rule is a separate static object declared in `lib/util/lemmatizer/rules`.
+
+There is a cli subcommand for testing the tool interactively, you can run
+
+```bash
+dart run jadb lemmatize -w '食べさせられない'
+```
@@ -0,0 +1,27 @@
+# Overview
+
+This is the documentation for `jadb`. Since I'm currently the only one working on it, the documentation is more or less just notes to myself, to ensure I remember how and why I implemented certain features in a certain way a few months down the road. This is not a comprehensive and formal documentation for downstream use, neither for developers nor end-users.
+
+- [Word Search](./word-search.md)
+- [Lemmatizer](./lemmatizer.md)
+
+## Project structure
+
+- `lib/_data_ingestion` contains all the code for reading data sources, transforming them and compiling them into an SQLite database. This is for the most part isolated from the rest of the codebase, and should not be depended on by any code used for querying the database.
+- `lib/cli` contains code for cli tooling (e.g. argument parsing, subcommand handling, etc.)
+- `lib/const_data` contains database data that is small enough to warrant being hardcoded as dart constants.
+- `lib/models` contains all the code for representing the database schema as Dart classes, and for converting between those classes and the actual database.
+- `lib/search` contains all the code for searching the database.
+- `lib/util/lemmatizer` contains the code for lemmatization, which will be used by the search code in the future.
+- `migrations` contains raw SQL files for creating the database schema.
+
+## SQLite naming conventions
+
+> [!WARNING]
+> All of these conventions are actually not enforced yet, it will be fixed at some point.
+
+- Indices are prefixed with `IDX__`
+- Crossref tables are prefixed with `XREF__`
+- Trigger names are prefixed with `TRG__`
+- Views are prefixed with `VW__`
+- All data sources should have a `<datasource>_Version` table, which contains a single row with the version of the data source used to generate the database.
@@ -0,0 +1,21 @@
+# Word search
+
+The word search procedure is currently split into 3 parts:
+
+1. **Entry ID query**:
+
+Use a complex query with various scoring factors to try to get list of
+database ids pointing at dictionary entries, sorted by how likely we think this
+word is the word that the caller is looking for. The output here is a `List<int>`
+
+2. **Data Query**:
+
+Takes the entry id list from the last search, and performs all queries needed to retrieve
+all the dictionary data for those IDs. The result is a struct with a bunch of flattened lists
+with data for all the dictionary entries. These lists are sorted by the order that the ids
+were provided.
+
+3. **Regrouping**:
+
+Takes the flattened data, and regroups the items into structs with a more "hierarchical" structure.
+All data tagged with the same ID will end up in the same struct. Returns a list of these structs.