unicode - Python encoding conversion -

- July 15, 2010

i wrote python script processes csv files non-ascii characters, encoded in utf-8. encoding of output broken. so, in input:

"d\xc4\x9bjin hornictv\xc3\xad"

i in output:

"d\xe2\x99\xafjin hornictv\xc2\xa9\xc6\xaf"

can suggest encoding error might come from? have seen similar behaviour previously?

edit: i'm using csv standard library unicodewriter class featured in docs. use python version 2.6.6.

edit 2: code reproduce behaviour:

#!/usr/bin/env python #-*- coding:utf-8 -*-  import csv pymarc import marcreader # pymarc package available pypi: http://pypi.python.org/pypi/pymarc/2.71 unicodewriter import unicodewriter # unicodewriter from: http://docs.python.org/library/csv.html  def getrow(tag, record):   if record[tag].is_control_field():     row = [tag, record[tag].value()]   else:     row = [tag] + record[tag].subfields   return row  inputfile = open("input.mrc", "r") outputfile = open("output.csv", "wb") reader = marcreader(inputfile, to_unicode = true) writer = unicodewriter(outputfile, delimiter = ",", quoting = csv.quote_minimal)  record in reader:   if bool(record["001"]):     tags = [field.tag field in record.get_fields()]     tags.sort()     tag in tags:       writer.writerow(getrow(tag, record))  inputfile.close() outputfile.close()

the input data available here (large file).

it seems adding force_utf8 = true argument marcreader constructor solved problem:

reader = marcreader(inputfile, to_unicode = true, force_utf8 = true)

according inspection of source code (via inspect) like:

string.decode("utf-8", "strict")

Search This Blog

Sohocode

unicode - Python encoding conversion -

Comments

Post a Comment

Popular posts from this blog

sql server - python to mssql encoding problem -

android - Cannot use ndk-build on Cygwin -

c# - BasicHttpBinding equivalent CustomBinding using WCF Client and PHP WebService -