Convert UTF-16 to UTF-8 in Python

Play with UTF-16 is not easy like UTF-8

dc

19 Jul 2020 • 阅读时间 1 分钟

Search in a UTF-16 encoded file

I got a CSV file from one service, and want to search some word in this file, but I got something wrong when read lines in Python

f=open('the-file.csv')
lines=f.readlines()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

But I can open this file with Microsoft Excel, it looks fine.

Finally I found the start of file is a signature, \xFFFE, after Google, this is a BOM (byte order mark), it is similar with UTF-8 BOM( EF,BB,BF). It means it is encoded by UTF-16, and it is UTF-16-LE, little-endian.

There are some methods to process this file, they are similar.

open with encoding

f = open('the-file.csv', encoding='utf-16-le')
lines = f.readlines()

decoding by yourself

f = open('ths-file.csv', 'rb')
data = f.read().decode('utf-16-le')

remove zeros

After read file to memory, I found it is unexpected, even I encoded Unicode to UTF-8.

\ufeffL\x00i\x00v\x00e\x00 \x00B\x00a\x00s\x00i\x00c\x00 \x00D\x00a\x00t\x00a\x00\t\x00\n

This is a Unicode string, since we have decoded by UTF-16, and in Python3, all str are Unicode.

When I print it, it looks fine.

Live Basic Data

I found an answer on Stackoverflow, he said UTF-16 use two bytes to encode a character, so if the content is ASCII, another \x00 will be followed each characters.

After I removed all \x00, it looks fine.