最近看到一个很奇怪的case。我们从http://graph.facebook.com/163892916997124拿到了下面的数据:
"category": "Local business",
"category_list": [
"id": "2500",
"name": "Local Business"
"checkins": 63,
"is_community_page": true,
"is_published": true,
"location": {
"street": "",
"city": "Tokyo",
"state": "Tokyo",
"country": "Japan",
"zip": "",
"latitude": 35.691766967305,
"longitude": 139.69812396398
"talking_about_count": 2,
"were_here_count": 131,
"id": "163892916997124",
"name": "\u65b0\u5bbf\u533a\udbb8\udc51\udbba\udd68Takano Fruits parlour",
"link": "http://www.facebook.com/pages/\u0025E6\u002596\u0025B0\u0025E5\u0025AE\u0025BF\u0025E5\u00258C\u0025BATakano-Fruits-parlour/163892916997124",
"likes": 5
其中的"\u65b0\u5bbf\u533a\udbb8\udc51\udbba\udd68Takano Fruits parlour"JSON字符串,将其转化成真正的字符串(处理转义),然后保存到MySQL中,这时却出现了很奇怪的问题:
Exception in thread "main" java.sql.SQLException: Incorrect string value: '\xF3\xBE\x81\x91\xF3\xBE...' for column 'name' at row 1
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1078)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4120)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4052)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2503)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2664)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2815)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2155)
at com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1379)
at json.test.MySQLTest.main(MySQLTest.java:55)
下面是示例代码(为简单起见,就不处理异常了):
String json = IOUtils.toString(new URL(
"http://graph.facebook.com/163892916997124").openStream(),
"UTF-8");
Page page = new Gson().fromJson(json, Page.class);
byte[] bytes = page.name.getBytes(UTF8);
System.out.println(DatatypeConverter.printHexBinary(bytes)); // E696B0E5AEBFE58CBAF3BE8191F3BEA5A854616B616E6F20467275697473207061726C6F7572
Class.forName(DRIVER);
Connection conn = DriverManager.getConnection(URL, USER, PASSWORD);
PreparedStatement statement = conn
.prepareStatement("insert into test(name) values (?)");
statement.setString(1, page.name);
statement.execute();
后面,我又做了一个实验,我把"\u65b0\u5bbf\u533a\udbb8\udc51\udbba\udd68Takano Fruits parlour"对应的字符串保存到文件test.csv中(文件编码是UTF8),然后使用"load data local infile "test.csv" into table test character set utf8;"来试试看能不能导入。结果如下:
mysql> load data local infile "/home/mstr/workspace/jsontest/test.csv" into table test character set utf8;
Query OK, 1 row affected, 1 warning (0.00 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 0
mysql> show warnings;
+---------+------+----------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+----------------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xF3\xBE\x81\x91\xF3\xBE...' for column 'name' at row 1 |
+---------+------+----------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> select * from test;
+-----------+
| name |
+-----------+
| 新宿区 |
+-----------+
1 row in set (0.00 sec)
虽然数据导入了,但是有一个warning,跟上文Java的异常一致。而且 最后只导入了前3个字符"新宿区"(\u65b0\u5bbf\u533a)。
导致出现这个问题的原因在于"\u65b0\u5bbf\u533a\udbb8\udc51\udbba\udd68Takano Fruits parlour"使用了BMP之外的字符。"
\u65b0
\u5bbf
\u533a
\udbb8\udc51
\udbba\udd68
"包含了5个字符,3个BMP字符,2个非BMP字符(不同字符用颜色来区别)。对于非BMP字符,JSON使用的是UTF-16编码(
RFC 4627 2.5. Strings
),因此我们可以算出,这五个字符的Unicode编码分别是"U+65b0 U+5bbf U+533a U+FE051 U+FE968"。后两个non-BMP字符的UTF8编码分别需要4个字节。这个字符串对应的UTF8编码是 "
E696B0 E5AEBF E58CBA F3BE8191 F3BEA5A8 "。
由于MySQL的utf8最大只能使用3个字节,而上述要保存的数据存在4个字节的UTF8字符。因此数据就无法插入了。不过,
load data可以把数据导入,只是会把不合法的字符串和之后的所有字符串都忽略。也就是说,输入是"\u65b0\u5bbf\u533a\udbb8\udc51\udbba\udd68Takano Fruits parlour"对应的UTF8编码,但是 最后只导入了前3个字符"新宿区"(
\u65b0\u5bbf\u533a
在MySQL中,对于这样的数据,我们要使用的类型应该是 utf8mb4 (要求MySQL 5.5)。对于5.5之前版本的,根据MySQL的文档 来看,是不支持non-BMP字符的。。。
在此感谢Yu同学发现和一起讨论了这个问题。