TDT4310-project-sorted-japa.../todo.md

20 lines
667 B
Markdown

- [ ] Fix pulling data sources
- [X] JMDict
- [X] Ingest data
- [ ] Tatoeba
- [X] Ingest data
- [ ] Disambiguate and connect to JMDict senses
- [ ] NHK News
- [X] Ingest data
- [ ] Disambiguate
This should be done through a combination of mecab and
leveshtein to the sense glossary (although, please mention
in the report that it might be bad dropping the ones still
ambiguous, because there might be a pattern to it. Single words
might have lots and lots of similar glosses, and be marked as
very rare as a result)
- [ ] TF IDF
- [ ] Test out weight combinations
Some notes:
- Sentence length cost should probably increase exponentially.