667 B
667 B
- Fix pulling data sources
- JMDict
- Ingest data
- Tatoeba
- Ingest data
- Disambiguate and connect to JMDict senses
- NHK News
- Ingest data
- Disambiguate
This should be done through a combination of mecab and
leveshtein to the sense glossary (although, please mention
in the report that it might be bad dropping the ones still
ambiguous, because there might be a pattern to it. Single words
might have lots and lots of similar glosses, and be marked as
very rare as a result)
- TF IDF
- Test out weight combinations
Some notes:
- Sentence length cost should probably increase exponentially.