Tag Archive: xml

XML 文件处理利器 VTD-XML

最近项目中涉及到解析处理庞大的 XML 文件。对于 XML 文件的解析在 PHP 中可以用正则表达式或者其中的 XML dom 库,在 JAVA 中可以用 DOM 的方式来解析或者比较高效的 SAX。但是发现更好的 VTD-XML, 它完美支持 Xpath 查询。
VTD-XML 项目主页上称自己为 XML 处理的未来,非常适合 SOA 和云计算中大量 XML 文件的处理。
列举 VTD-XML 的各个优点:
世界上最节省内存的 XML 解析器
世界上最快的 XML 解析器, 比 DOM 快 5-12 倍
世界上最快的 Xpath 解析
世界上唯一支持增量更新的 XML 解析器
可以用来 Xpath 来查询 256G 大小的 XML 文件的解析器
支持 JAVA C C++ C#

VTD-XML Xpath 查询的例子:

VTDGen vg = new VTDGen();
if (vg.parseFile("blog.eood.cn.xml", true)) {
	VTDNav vn = vg.getNav();
	File fo = new File("blog.eood.cn.xml");
	FileOutputStream fos = new FileOutputStream(fo);
	AutoPilot ap = new AutoPilot(vn);
	XMLModifier xm = new XMLModifier(vn);
	// test if the element which has a child a content = ACONTENT and child b content = BCONTENT exist
	ap.selectXPath("/a/*[child::a[.='ACONTENT'] and child::b[.='BCONTENT']]");

	if(ap.evalXPath()!=-1){
		System.out.println("================ Existed.");
	}
	xm.output(fos);
	fos.close();
}

When relate to the following codes:

$p = xml_parser_create();
xml_parse_into_struct($p, $xml, $vals, $index);
xml_parser_free($p);

The arrow symbol are missing, this is Bug of libxml2 with php,

when using PHP <= 5.2.6 with libxml2 >= 2.6.32.

This is due to an intentional change in the behaviour of libxml2 after version 2.6.32. Some sites suggest reverting to libxml2-2.6.30 – while this works as a temporary solution, it is no longer necessary or advisable.

PHP 5.2.7 or higher works with the new behaviour of libxml2 ( see: http://bugs.php.net/bug.php?id=45996 ). Simply upgrading PHP corrects the problem.

Now replace these special characters with HTML Entity. And can be rightly treated by libxml2.

Do the following repacement before parser the xml can resolve this problem:

$xml =str_replace(“<”,”<”,$xml);
$xml =str_replace(“>”,”>”,$xml);
$xml =str_replace(“&”,”&”,$xml);

How to Parse XML File in PHP

Actually there are four relatively simple ways to read an XML file:

My personal favorites are:

  • <font style="background-color: #eeeeee" face="Consolas">SimpleXML</font> when parsing relatively small XML files without the need to modifiy them
  • <font style="background-color: #eeeeee" face="Consolas">XMLReader</font> when parsing large XML files