TDT4310-project-sorted-japa.../todo.md

- [ ] Fix pulling data sources
- [X] JMDict
  - [X] Ingest data
- [ ] Tatoeba
  - [X] Ingest data
  - [ ] Disambiguate and connect to JMDict senses
- [ ] NHK News
  - [X] Ingest data
  - [ ] Disambiguate
		This should be done through a combination of mecab and
		leveshtein to the sense glossary (although, please mention
		in the report that it might be bad dropping the ones still
		ambiguous, because there might be a pattern to it. Single words
		might have lots and lots of similar glosses, and be marked as
		very rare as a result)
	- [ ] TF IDF
- [ ] Test out weight combinations
	  Some notes:
  - Sentence length cost should probably increase exponentially.
Initial commit 2024-04-26 00:46:44 +02:00			`- [ ] Fix pulling data sources`
			`- [X] JMDict`
			`- [X] Ingest data`
			`- [ ] Tatoeba`
			`- [X] Ingest data`
			`- [ ] Disambiguate and connect to JMDict senses`
			`- [ ] NHK News`
			`- [X] Ingest data`
			`- [ ] Disambiguate`
			`This should be done through a combination of mecab and`
			`leveshtein to the sense glossary (although, please mention`
			`in the report that it might be bad dropping the ones still`
			`ambiguous, because there might be a pattern to it. Single words`
			`might have lots and lots of similar glosses, and be marked as`
			`very rare as a result)`
			`- [ ] TF IDF`
			`- [ ] Test out weight combinations`
			`Some notes:`
			`- Sentence length cost should probably increase exponentially.`