20 lines
667 B
Markdown
20 lines
667 B
Markdown
|
- [ ] Fix pulling data sources
|
||
|
- [X] JMDict
|
||
|
- [X] Ingest data
|
||
|
- [ ] Tatoeba
|
||
|
- [X] Ingest data
|
||
|
- [ ] Disambiguate and connect to JMDict senses
|
||
|
- [ ] NHK News
|
||
|
- [X] Ingest data
|
||
|
- [ ] Disambiguate
|
||
|
This should be done through a combination of mecab and
|
||
|
leveshtein to the sense glossary (although, please mention
|
||
|
in the report that it might be bad dropping the ones still
|
||
|
ambiguous, because there might be a pattern to it. Single words
|
||
|
might have lots and lots of similar glosses, and be marked as
|
||
|
very rare as a result)
|
||
|
- [ ] TF IDF
|
||
|
- [ ] Test out weight combinations
|
||
|
Some notes:
|
||
|
- Sentence length cost should probably increase exponentially.
|