Sungjoo Ha

Segmenting Korean Online Text 온라인 문어체 한국어 띄어쓰기

Many Korean online texts are not properly segmented. One can train a structured prediction model to automatically segment a given text. Since many of such online texts contain slangs and unconventional spelling, it is not easy to come up with a good corpus that represents such data but is properly segmented.

For this project, I've used the data from Rigveda Wiki as the training corpus. The rationale was that this particular corpus would contain many unconventional spelling and slangs but still maintain relatively proper segmentation. This intuition proved correct and the model trained on this corpus was surprisingly effective when it comes to dealing with such unconventional data.

Examples

The prediction algorithm does not make use of the current segmentation information. It assumes that every character is concatenated without any spaces in between.

The original text is followed by the predicted segmentation. While not perfect, it is surprisingly good considering the rampant misspelling and use of slangs.

It does fail miserably when it has to deal with wildly malformed data, though.

Methods

I've used Wapiti's implementation of linear-chain CRF. The training was done using Korean Wikipedia and Rigveda Wiki as training corpus.

I have written a short thoughts about this project on Text Segmentation (Korean).

Acknowledgement

I thank RocketPunch for supporting computational resources.