Encoding in Python for Korean, utf-8-sig

Sun Dec 06 2020•2 min read

To many data analyses practitioners using non-western characters, encoding issue gives them never-ending headaches. This is because the same characters, seen from the viewers' perspectives, aren't interpreted as the same by the machines.

Why issue?

Due to the early de-facto standard by Microsoft Windows and Office systems, euc-kr or cp-949 has been the norm in Korean Language. While the two are technically different, cp-949 is the superset of the other, many can safely use euc-kr in many cases.

However, this very de-facto standard makes the machine confusing as these are not in the realm of unicode system.

The most widely used unicode encoding, utf-8 can't recognize those made with these, and complains.

While this could be a problem for cross-platform users or those who use Microsoft Excel-produced files (e.g., csv, xls or xlsx) in Python, files created in non-Windows systems are inherently made with utf-8 encoding.

How to solve?

So, the problem is the way Windows and Microsoft Office suite handles the Korean character encodings. Then, the solution seems to be simple. Can we just universally encode with utf-8 and use in Windows system? Not so really. If you've happened to open csv file made with Python in Microsoft Excel on Windows, you couldn't recognize what the document says.

The solution from Python perspective is to use utf-8-sig encoding option. Microsoft Excel can correctly read documents in this option.

Then, is any solution available in the opposite way? (making documents in Excel to be readily recognizable in Python)

Sure!

In Excel, save documents in utf-8 options rather than usual options. On the other hand, in notepad, change the encoding from ANSI to UTF-8, which is equivalent to utf-8-sig.