Segmenting Korean Online Text 온라인 문어체 한국어 띄어쓰기

Many Korean online texts are not properly segmented. One can train a structured prediction model to automatically segment a given text. Since many of such online texts contain slangs and unconventional spelling, it is not easy to come up with a good corpus that represents such data but is properly segmented.

For this project, I've used the data from Rigveda Wiki as the training corpus. The rationale was that this particular corpus would contain many unconventional spelling and slangs but still maintain relatively proper segmentation. This intuition proved correct and the model trained on this corpus was surprisingly effective when it comes to dealing with such unconventional data.

Examples

The prediction algorithm does not make use of the current segmentation information. It assumes that every character is concatenated without any spaces in between.

- 그앉아서먹는게 아예없서젓단 얘기가있어서
- 그 앉아서 먹는게 아예 없서 젓단 얘기가 있어서
- 마망! 티께팅성공하댜!
- 마망! 티께팅 성공하댜!
- 이와이즈미가핵멋잇고왕귀엽답ㄴ대
- 이와이즈미가 핵 멋잇고 왕귀엽답 ㄴ대
- 보드가박살날정도로 어디에 박은거야ㄷㄷ괜찮?
- 보드가 박살날 정도로 어디에 박은거야 ㄷㄷ 괜찮?

The original text is followed by the predicted segmentation. While not perfect, it is surprisingly good considering the rampant misspelling and use of slangs.

- 엘에이ㅠ.ㅠㅠㅠㅠㅠㅠㅠ솔딕히 정ㅁ말 이러케 힘든 콘서트는 달라스 일꺼라 샌각핵는데 더 힘든 엘에이가 이써꼭ㄱ 구ㅠㅜ근데ㅠㅝ찬열이 너무 이뻐ㅓ
- 엘에이ㅠ.ㅠㅠㅠㅠㅠㅠㅠ 솔딕히 정ㅁ 말이 러케 힘든 콘서트는 달라스 일꺼라 샌각 핵는데 더 힘든 엘에이가이 써꼭 ㄱ구ㅠㅜ 근데 ㅠㅝ 찬열이 너무 이뻐ㅓ

It does fail miserably when it has to deal with wildly malformed data, though.

Methods

I've used Wapiti's implementation of linear-chain CRF. The training was done using Korean Wikipedia and Rigveda Wiki as training corpus.

I have written a short thoughts about this project on Text Segmentation (Korean).

Acknowledgement

I thank RocketPunch for supporting computational resources.