Sungjoo Ha

Finding Synonyms in Korean Online Text 온라인 문어체 유의어 찾기

There are many variations of a word that mean the same thing but has a different form. People tend to use such variations intentionally and unintentionally in online texts. It is possible to find such variations by building a semantic representation of words.

Analysis

I've applied semantic representation extraction algorithm on corpus from ClubMix. Clubmix is a social network for night life and supports tweet-like short text timeline. Results are modified to make it relevant to the discussion at hand.

This method yields particularly interesting results when applied to club related words.

A peculiar expression "홍비" appears. Exploring this particular word leads to other insights.

Notice that "홍-" probably refers to 홍대. This can be inferred from another prefix "강-" which probably refers to 강남. It is interesting how people abbreviate a location using the regional prefix together with its name.

Also, note another type of prefix "화-", "수-", and "금-". This refers to the day of the week. This tells us that people also incorporate temporal prefix with the name of a location.

Methods

I have used word2vec to learn a semantic embedding of words. This embedding gives me a way to measure a similarity between different word forms. Given two latent vector representations, you can compute the similarity using the inner product of two vectors to yield a cosine distance.

Acknowledgement

I thank RocketPunch for supporting computational resources and providing the corpus data.