The readability index is an index indicating the level of text. It can be used in various fields, such as book recommendation, writing ability evaluation, personalized recommendation, online bot detection and even in fake news analysis. The traditional readability models utilize simple linguistic features with simple regression models. In very recent years, readability research utilizing deep learning models has been conducted. However, in Korea, readability research is very scarce and there are even no public datasets or automated baseline models while English readability research has. The existing Korean readability indexes were developed using a simple regression model, evaluated with very small data and even do not evaluated with the evaluation metrics.
Therefore, we propose a novel Korean readability index model, KRIT, that considers both grammatical structure and lexical meaning based on transformer encoder with transformer-based pretrained language model, BERT, for Korean. For the dataset, we used 25,449 sentences from Korean textbook data, written for ages 8-16, grouped into 4 grade-level classes. We compared the performance of KRIT with the existing Korean or English readability model and demonstrated that our proposed model outperforms other baselines with the accuracy of 0.746 and MAE 0.327. According to our knowledge, it is a first attempt to use deep learning NLP techniques, pretrained word embedding and transformer encoder architecture, for Korean readability assessment and evaluated with enough data.