Category: IT Reviews

Unicode and ISO639 ISO3166

useful links:

  1. http://www.fileformat.info/info/unicode/category/index.htm  
  2. http://www.unicode.org/  
  3. http://www.unicode.org/charts/PDF/U4E00.pdf  
  4. http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_symbol_characters_web_page  
  5. http://www.unicode.org/charts/symbols.html#CombiningDiacriticalMarks 

 

印地语使用国家:缅甸、泰国、柬埔寨、老挝。
http://zh.wikipedia.org/zh-cn/%E4%BB%A5%E4%BA%BA%E5%8F%A3%E6%8E%92%E5%88%97%E7%9A%84%E8%AF%AD%E8%A8%80%E5%88%97%E8%A1%A8

葡萄牙语使用国家:葡萄牙、巴西、安哥拉、中国澳门、西班牙、莫桑比克和东帝汶。

http://www.iso.org/iso/country_codes/iso_3166_code_lists/english_country_names_and_code_elements.htm
http://www.loc.gov/standards/iso639-2/langhome.html

http://msdn.microsoft.com/en-us/library/ms533052(VS.85,loband).aspx

Briefly, language codes consist of a primary code and a possibly empty series of subcodes:

        language-code = primary-code ( "-" subcode )*

Here are some sample language codes:

"en": English
"en-US": the U.S. version of English.
"en-cockney": the Cockney version of English.
"i-navajo": the Navajo language spoken by some Native Americans.
"x-klingon": The primary tag "x" indicates an experimental language tag

Two-letter primary codes are reserved for [ISO639] language abbreviations. Two-letter codes include fr (French), de (German), it (Italian), nl (Dutch), el (Greek), es (Spanish), pt (Portuguese), ar (Arabic), he (Hebrew), ru (Russian), zh (Chinese), ja (Japanese), hi (Hindi), ur (Urdu), and sa (Sanskrit).

Any two-letter subcode is understood to be a [ISO3166] country code.
 

fav urls

http://blog.strutta.com/blog/six-degrees-of-youtube

http://acko.net/

http://www.strutta.com/

http://www.ogre3d.org/download/source
http://acko.net/blog/making-worlds-part-1-of-spheres-and-cubes

 

正则 多语言 总结

1. GBK (GB2312/GB18030)
\x00-\xff GBK双字节编码范围
\x20-\x7f ASCII
\xa1-\xff 中文 gb2312
\x80-\xff 中文 gbk

2. UTF-8 (Unicode)

\u4e00-\u9fa5 (中文)
\x3130-\x318F (韩文)
\xAC00-\xD7A3 (韩文)
\u0800-\u4e00 (日文)

这里是几个主要非英文语系字符范围:

2E80~33FFh:中日韩符号区。收容康熙字典部首、中日韩辅助部首、注音符号、日本假名、韩文音符,中日韩的符号、标点、带圈或带括符文数字、月份,以及日本的假名组合、单位、年号、月份、日期、时间等。

3400~4DFFh:中日韩认同表意文字扩充A区,总计收容6,582个中日韩汉字。

4E00~9FFFh:中日韩认同表意文字区,总计收容20,902个中日韩汉字。

A000~A4FFh:彝族文字区,收容中国南方彝族文字和字根。

AC00~D7FFh:韩文拼音组合字区,收容以韩文音符拼成的文字。

F900~FAFFh:中日韩兼容表意文字区,总计收容302个中日韩汉字。

FB00~FFFDh:文字表现形式区,收容组合拉丁文字、希伯来文、阿拉伯文、中日韩直式标点、小符号、半角符号、全角符号等。
 

韩文是大于[u9fa5]的字符

preg_replace("/([x80-xff])/","",$str);
preg_replace("/([u4e00-u9fa5])/","",$str);

UTF-8 中文3个字节,俄文、韩文占2个字节,字母占1个字节

 

  •     $re['utf-8'] = "/[\x01-\x7f]|[\xc2-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xff][\x80-\xbf]{3}/";  
  •     $re['gb2312'] = "/[\x01-\x7f]|[\xb0-\xf7][\xa0-\xfe]/";  
  •     $re['gbk'] = "/[\x01-\x7f]|[\x81-\xfe][\x40-\xfe]/";  
  •     $re['big5'] = "/[\x01-\x7f]|[\x81-\xfe]([\x40-\x7e]|\xa1-\xfe])/"

http://www.fileformat.info/info/unicode/category/index.htm
http://www.unicode.org/
http://www.unicode.org/charts/PDF/U4E00.pdf
http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_symbol_characters_web_page