Finding Synonyms in Korean Online Text 온라인 문어체 유의어 찾기

There are many variations of a word that mean the same thing but has a different form. People tend to use such variations intentionally and unintentionally in online texts. It is possible to find such variations by building a semantic representation of words.

Analysis

I've applied semantic representation extraction algorithm on corpus from ClubMix. Clubmix is a social network for night life and supports tweet-like short text timeline. Results are modified to make it relevant to the discussion at hand.

ㅠㅠ has similar words such as
- ㅜㅜ
- ㅠ
- ㅠㅡㅠ
- ㅠㅠㅠ
- ㅠㅜ
^^ has similar words such as
- ^_^
- ^^*
- ^.^
- ^.^
- ^-^
쩐다 has similar words such as
- 쩌러
- 개쩐다
- 쩐당
- 쩔어
- 개쩜

This method yields particularly interesting results when applied to club related words.

엔비 has similar words such as
- 엔비투
- 엠비
- nb
- 홍대엔비
- 홍비

A peculiar expression "홍비" appears. Exploring this particular word leads to other insights.

홍비 has similar words such as
- 홍렘
- 강비
- 홍쿤
- 강렘
- 수매스
- 화쿤
- 금강비

Notice that "홍-" probably refers to 홍대. This can be inferred from another prefix "강-" which probably refers to 강남. It is interesting how people abbreviate a location using the regional prefix together with its name.

Also, note another type of prefix "화-", "수-", and "금-". This refers to the day of the week. This tells us that people also incorporate temporal prefix with the name of a location.

Methods

I have used word2vec to learn a semantic embedding of words. This embedding gives me a way to measure a similarity between different word forms. Given two latent vector representations, you can compute the similarity using the inner product of two vectors to yield a cosine distance.

Acknowledgement

I thank RocketPunch for supporting computational resources and providing the corpus data.