Just published:
Supplementing CEFR-graded vocabulary lists for language learners by leveraging information on dictionary views, corpus frequency, part-of-speech, and polysemy
A machine-learning method to suggest word candidates for CEFR-graded vocabulary lists.
https://doi.org/10.1057/s41599-025-05446-y
- We compare 4 machine-learning algorithms: Regression trees, ordinal logistic regression, random forests, & naïve Bayes
- All are better than a random baseline (approx. double the accuracy).
- From these we use random forests (2k trees) to impute the #CEFR level of previously unlabeled words