unicode - Python encoding conversion -
i wrote python script processes csv files non-ascii characters, encoded in utf-8. encoding of output broken. so, in input:
"d\xc4\x9bjin hornictv\xc3\xad"
i in output:
"d\xe2\x99\xafjin hornictv\xc2\xa9\xc6\xaf"
can suggest encoding error might come from? have seen similar behaviour previously?
edit: i'm using csv
standard library unicodewriter
class featured in docs. use python version 2.6.6.
edit 2: code reproduce behaviour:
#!/usr/bin/env python #-*- coding:utf-8 -*- import csv pymarc import marcreader # pymarc package available pypi: http://pypi.python.org/pypi/pymarc/2.71 unicodewriter import unicodewriter # unicodewriter from: http://docs.python.org/library/csv.html def getrow(tag, record): if record[tag].is_control_field(): row = [tag, record[tag].value()] else: row = [tag] + record[tag].subfields return row inputfile = open("input.mrc", "r") outputfile = open("output.csv", "wb") reader = marcreader(inputfile, to_unicode = true) writer = unicodewriter(outputfile, delimiter = ",", quoting = csv.quote_minimal) record in reader: if bool(record["001"]): tags = [field.tag field in record.get_fields()] tags.sort() tag in tags: writer.writerow(getrow(tag, record)) inputfile.close() outputfile.close()
the input data available here (large file).
it seems adding force_utf8 = true
argument marcreader
constructor solved problem:
reader = marcreader(inputfile, to_unicode = true, force_utf8 = true)
according inspection of source code (via inspect
) like:
string.decode("utf-8", "strict")
Comments
Post a Comment