There are many variations of a word that mean the same thing but has a different form. People tend to use such variations intentionally and unintentionally in online texts. It is possible to find such variations by building a semantic representation of words.
I've applied semantic representation extraction algorithm on corpus from ClubMix. Clubmix is a social network for night life and supports tweet-like short text timeline. Results are modified to make it relevant to the discussion at hand.
ㅠㅠ has similar words such as
^^ has similar words such as
쩐다 has similar words such as
This method yields particularly interesting results when applied to club related words.
A peculiar expression "홍비" appears. Exploring this particular word leads to other insights.
Notice that "홍-" probably refers to 홍대. This can be inferred from another prefix "강-" which probably refers to 강남. It is interesting how people abbreviate a location using the regional prefix together with its name.
Also, note another type of prefix "화-", "수-", and "금-". This refers to the day of the week. This tells us that people also incorporate temporal prefix with the name of a location.
I have used word2vec to learn a semantic embedding of words. This embedding gives me a way to measure a similarity between different word forms. Given two latent vector representations, you can compute the similarity using the inner product of two vectors to yield a cosine distance.
I thank RocketPunch for supporting computational resources and providing the corpus data.