Wolf Garbe
Wolf Garbe CEO and co-founder of SeekStorm

Korean spelling correction: Symspell을 이용한 한글 맞춤법 교정

Korean spelling correction: Symspell을 이용한 한글 맞춤법 교정

Photo by Tsuyuri Hara

Korean spelling correction and word segmentation: Symspell을 이용한 한글 맞춤법 교정

Introduction to Spelling Correction

Spelling correction is useful in all languages: for correcting spelling errors when typing a query, correcting OCR errors or speech-to-text errors. Traditional spelling correction methods work with deletes, transposes, replaces, and inserts. Replaces and inserts are expensive and language-dependent. That’s already expensive in the English alphabet with 26 letters, but becomes infeasible for Chinese with 70,000 Unicode Han characters and Japanese with 50,000 Kanji.

Our Symmetric Delete spelling correction algorithm - SymSpell reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. Opposite to other algorithms, with SymSpell only deletes are required, no transposes + replaces + inserts. Transposes + replace + inserts of the input term are transformed into deletes of the dictionary term. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

The speed comes from the inexpensive delete-only edit candidate generation and the pre-calculation. An English average 5 letter word has about 3 million possible spelling errors within a maximum edit distance of 3, but SymSpell needs to generate only 25 deletes to cover them all, both at pre-calculation and at lookup time. Magic!

SymSpell has been released as Open Source under the permissive MIT license and can be found at the Github repository.

Dictionary quality is paramount for correction quality.

For English dictionary, to achieve this two data sources were combined by intersection: Google Books Ngram data which provides representative word frequencies (but contains many entries with spelling errors), and SCOWL — Spell Checker Oriented Word Lists which ensures genuine English vocabulary (but contained no word frequencies required for ranking of suggestions within the same edit distance).

Korean spelling correction: Symspell을 이용한 한글 맞춤법 교정

The beauty of Open source is that it is not a one-way road.

When we released SymSpell 9 years ago we couldn’t imagine how popular it will become: more than 2000 stars in GitHub and ports in almost all computing languages. While porting to other computing languages is one thing, compiling dictionaries in other human languages takes a native speaker to ensure the best correction quality. So we were delighted to see a port of SymSpell for the Korean language by 김희규, together with a detailed description in two accompanying blog posts:

Symspellpy-ko

Symspell을 이용한 한글 맞춤법 교정

Symspell을 이용한 한글 맞춤법 교정 2 — 복합어와 띄어쓰기 교정

Symspellpy-ko offers also a solution to phoneme decomposition, a specific challenge that the Korean language poses for word correction.

This brings SymSpell’s high-performance spelling correction to Korean language and Korea as the fourth-largest economy in Asia and will improve not only our service.

Word segmentation

Word segmentation inserts spaces between words when they are missing. In Latin languages that is a relatively rare task: for inserting spaces if a query or a chat message has been hastily written without spaces, for correcting OCR errors or speech-to-text errors, or when processing URLs or composite words for NLP. But Chinese and Japanese word segmentation play a very important role when it comes to natural language processing and information retrieval because words are written without spaces between them. To provide indexation and search we have first to segment the continuous text into words. While in Latin languages we have usually only to word segment the queries to insert missing spaces, in Chinese and Japanese we have to word segment all the documents before indexing them, and then the queries at search time. For indexing large document repositories in Chinese and Japanese the performance of word segmentations matters a lot.

That is not the case for modern Korean, which is written with spaces between words.

Rating: