中文编码在任何一种编程语言里都是很坑的事情, python也难逃劫难!
最近项目需要使用python 对比两个json文件的某些字段是否有diff
涉及到用json.load()加载一个json文件,
但是如果文件中有中文乱码的时候, json.load会加载失败, 需要过滤掉这些非法的字符
我选择的方式是: 删除乱码字符,或者转换乱码字符为可读字符
{"result":[{"dadualseganalysis":null,"info":null,"normal":{"danormalanalysis":{"ckdykzquery":null,"diyuquery":null,"dummy":0,"ecquery":null,"gssqa":null,"ip2regi
on":null,"newsrssequery":null,"qaquery":null,"qtquery":null,"quality":null,"rare":null,"reqextension":null,"rqtquery":null,"rssequery":null,"synquery":null,"tieb
atips":null,"time_new":null,"timequery":null,"zhidaotips":null},"disp_data_query_ex":null,"diyuqeryanalysis":null,"domainanalysis":null,"dummy":0,"guanlianquerya
naysisy":null,"jiucuoqueryanaysisy":null,"omitquery":null,"queries":null,"zhidaqueryanalysis":null},"orig_query":"\u7a7f ^O?"}]}
红色高亮的部分是中文乱码, 最愁人的是前一个字是正常的中文字符, 后面一个是乱码
此时json.load加载会报错, 所以需要处理非法字符
ps: \u7a7f 是 中文的unicode编码形式
python中处理不同编码建议的方式是:
#!/usr/bin/python # _*_ coding: utf-8 -*- import os,sys import json import codecs reload(sys) sys.setdefaultencoding("utf-8") filename="123.txt" desname="123.utf8" ff = open(filename, 'r') data = ff.read() # convert to unicode from gb18030 # 原始编码是 gb18030, decode() 是把某种编码转换为 unicode编码 data=data.decode("gb18030",'ignore') # delete illegal char, 删除非法字符 # 因为以unicode编码的中文字符都是 '\u' 开头的, 不会有 '\x' delstr = `data`.replace('\\x','')</span> # 重新生成python对象 filterstr=eval(delstr) #转换为 utf-8编码 data=filterstr.encode('utf-8') #写入目标文件 desf=open(desname,"w") desf.write(data) desf.close() # convert from unicode to utf-8