Convert UTF-16 to UTF-8 in Python
Play with UTF-16 is not easy like UTF-8
Search in a UTF-16 encoded file
I got a CSV file from one service, and want to search some word in this file, but I got something wrong when read lines in Python
f=open('the-file.csv')
lines=f.readlines()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
But I can open this file with Microsoft Excel, it looks fine.
Finally I found the start of file is a signature, \xFFFE
, after Google, this is a BOM (byte order mark), it is similar with UTF-8 BOM( EF,BB,BF
). It means it is encoded by UTF-16, and it is UTF-16-LE, little-endian.
There are some methods to process this file, they are similar.
open with encoding
f = open('the-file.csv', encoding='utf-16-le')
lines = f.readlines()
decoding by yourself
f = open('ths-file.csv', 'rb')
data = f.read().decode('utf-16-le')
remove zeros
After read file to memory, I found it is unexpected, even I encoded Unicode to UTF-8.
\ufeffL\x00i\x00v\x00e\x00 \x00B\x00a\x00s\x00i\x00c\x00 \x00D\x00a\x00t\x00a\x00\t\x00\n
This is a Unicode string, since we have decoded by UTF-16, and in Python3, all str
are Unicode.
When I print
it, it looks fine.
Live Basic Data
I found an answer on Stackoverflow, he said UTF-16 use two bytes to encode a character, so if the content is ASCII, another \x00
will be followed each characters.
After I removed all \x00
, it looks fine.