Office文件的解析

引用： http://www.langye.com/a/2013417/237.shtml

【题外话】

这是这个系列的最后一篇文章了，为了不让自己觉得少点什么，顺便让自己感觉完美一些，就再把OOXML说一下吧。不过说实话，OOXML真的太容易解析了，而且这方面的文档包括成熟的开源类库也特别特别特别的多，所以我就稍微说一下，文章中引用了不少的链接，感兴趣的话可以深入了解下。

【系列索引】

Office文件的奥秘——.NET平台下不借助Office实现Word、Powerpoint等文件的解析(一)
获取Office二进制文档的DocumentSummaryInformation以及SummaryInformation
Office文件的奥秘——.NET平台下不借助Office实现Word、Powerpoint等文件的解析(二)
获取Word二进制文档（.doc）的文字内容（包括正文、页眉、页脚、批注等等）
Office文件的奥秘——.NET平台下不借助Office实现Word、Powerpoint等文件的解析(三)
详细介绍Office二进制文档中的存储结构，以及获取PowerPoint二进制文档（.ppt）的文字内容
Office文件的奥秘——.NET平台下不借助Office实现Word、Powerpoint等文件的解析(完)
介绍Office Open XML文档（.docx、.pptx）如何进行解析以及解析Office文件常见开源类库

【文章索引】

初见Office Open XML(OOXML)
OOXML文档属性的解析
Word 2007文件的解析
PowerPoint 2007文件的解析
常见Office文档（Word、PowerPoint、Excel）文件的开源类库
相关链接

【一、初见Office Open XML(OOXML)】

先来看一段微软官方对Office Open XML的说明（详细见http://office.microsoft.com/zh-cn/support/HA010205815.aspx?CTT=3）：

可以看到，与Windows 复合文档不同的是，OOXML生来就是开放的，而且由于基于zip+xml的格式，使得读取变得更容易，如果仅是为了抽取文字，我们甚至不需要读取文档的任何参数！

如果您之前不了解OOXML的话，我们可以把手头docx、pptx以及xlsx文件的扩展名改为zip，然后用压缩软件打开看看。

打开的这三个文件分别是docx、pptx和xlsx，我们可以看到，目录结构清晰可见，所以我们只需要使用读取zip的类库读取zip文件，然后再解析xml文件即可。对于使用.NET Framework 3.0及以上的，可以直接使用.NET自带的Package类（System.IO.Packaging，在WindowsBase.dll中）进行解压，个人感觉如果只是读取zip流中的文件流或内容，WindowsBase中的Package还是很好用的。如果用于.NET CF或者2.0甚至以下的CLR可以使用SharpZipLib（支持CLR 1.1、2.0、4.0，官方网站http://www.icsharpcode.net/），也可以使用DotNetZip（支持CLR 2.0，官方网站http://dotnetzip.codeplex.com/），个人感觉后者的License更友好些。

比如我们使用自带的Package打开OOXML文件：

 
        View Code 
       
        #region 字段 
       
        protected  
        FileStream m_stream; 
       
        protected  
        Package m_package; 
       
        #endregion 
       
        #region 构造函数 
       
        /// <summary> 
       
        /// 初始化OfficeOpenXMLFile 
       
        /// </summary> 
       
        /// <param name="filePath">文件路径</param> 
       
        public  
        OfficeOpenXMLFile(String filePath) 
       
        { 
       
        try 
       
        { 
       
        this 
        .m_stream =  
        new  
        FileStream(filePath, FileMode.Open, FileAccess.Read); 
       
        this 
        .m_package = Package.Open( 
        this 
        .m_stream); 
       
        this 
        .ReadProperties(); 
       
        this 
        .ReadCoreProperties(); 
       
        this 
        .ReadContent(); 
       
        } 
       
        finally 
       
        { 
       
        if  
        ( 
        this 
        .m_package !=  
        null 
        ) 
       
        { 
       
        this 
        .m_package.Close(); 
       
        } 
       
        if  
        ( 
        this 
        .m_stream !=  
        null 
        ) 
       
        { 
       
        this 
        .m_stream.Close(); 
       
        } 
       
        } 
       
        } 
       
        #endregion

【二、OOXML文档属性的解析】

OOXML文件的文档属性其实存在于docProps目录下，比较重要的有三个文件

app.xml：记录文档的属性，内容类似之前的DocumentSummaryInformation。
core.xml：记录文档核心的属性，比如创建时间、最后修改时间等等，内容类似之前的SummaryInformation。
thumbnail.*：文档的缩略图，不同文件存储的是不同的格式，比如Word为emf，Excel为wmf，PowerPoint为jpeg。

我们只需要遍历XML文件中所有的子节点就可以读出所有的属性，为了好看，这里还用的Windows复合文件中的名称：

 
        View Code 
       
        #region 常量 
       
        private  
        const  
        String PropertiesNameSpace = 
        "http://schemas.openxmlformats.org/officeDocument/2006/extended-properties" 
        ; 
       
        private  
        const  
        String CorePropertiesNameSpace = 
        "http://schemas.openxmlformats.org/package/2006/metadata/core-properties" 
        ; 
       
        #endregion 
       
        #region 字段 
       
        protected  
        Dictionary<String, String> m_properties; 
       
        protected  
        Dictionary<String, String> m_coreProperties; 
       
        #endregion 
       
        #region 属性 
       
        /// <summary> 
       
        /// 获取DocumentSummaryInformation 
       
        /// </summary> 
       
        public  
        override  
        Dictionary<String, String> DocumentSummaryInformation 
       
        { 
       
        get 
       
        { 
       
        return  
        this 
        .m_properties; 
       
        } 
       
        } 
       
        /// <summary> 
       
        /// 获取SummaryInformation 
       
        /// </summary> 
       
        public  
        override  
        Dictionary<String, String> SummaryInformation 
       
        { 
       
        get 
       
        { 
       
        return  
        this 
        .m_coreProperties; 
       
        } 
       
        } 
       
        #endregion 
       
        #region 读取Properties 
       
        private  
        void  
        ReadProperties() 
       
        { 
       
        if  
        ( 
        this 
        .m_package ==  
        null 
        ) 
       
        { 
       
        return 
        ; 
       
        } 
       
        PackagePart part =  
        this 
        .m_package.GetPart( 
        new  
        Uri( 
        "/docProps/app.xml" 
        , UriKind.Relative)); 
       
        if  
        (part ==  
        null 
        ) 
       
        { 
       
        return 
        ; 
       
        } 
       
        XmlDocument doc =  
        new  
        XmlDocument(); 
       
        doc.Load(part.GetStream()); 
       
        XmlNodeList nodes = doc.GetElementsByTagName( 
        "Properties" 
        , PropertiesNameSpace); 
       
        if  
        (nodes.Count < 1) 
       
        { 
       
        return 
        ; 
       
        } 
       
        this 
        .m_properties =  
        new  
        Dictionary<String, String>(); 
       
        foreach  
        (XmlElement element  
        in  
        nodes[0]) 
       
        { 
       
        this 
        .m_properties.Add(element.LocalName, element.InnerText); 
       
        } 
       
        } 
       
        #endregion 
       
        #region 读取CoreProperties 
       
        private  
        void  
        ReadCoreProperties() 
       
        { 
       
        if  
        ( 
        this 
        .m_package ==  
        null 
        ) 
       
        { 
       
        return 
        ; 
       
        } 
       
        PackagePart part =  
        this 
        .m_package.GetPart( 
        new  
        Uri( 
        "/docProps/core.xml" 
        , UriKind.Relative)); 
       
        if  
        (part ==  
        null 
        ) 
       
        { 
       
        return 
        ; 
       
        } 
       
        XmlDocument doc =  
        new  
        XmlDocument(); 
       
        doc.Load(part.GetStream()); 
       
        XmlNodeList nodes = doc.GetElementsByTagName( 
        "coreProperties" 
        , CorePropertiesNameSpace); 
       
        if  
        (nodes.Count < 1) 
       
        { 
       
        return 
        ; 
       
        } 
       
        this 
        .m_coreProperties =  
        new  
        Dictionary<String, String>(); 
       
        foreach  
        (XmlElement element  
        in  
        nodes[0]) 
       
        { 
       
        this 
        .m_coreProperties.Add(element.LocalName, element.InnerText); 
       
        } 
       
        } 
       
        #endregion

【三、Word 2007文件的解析】

Word文件（.docx）主要的内容基本都存在于word目录下，比较重要的有以下的内容

document.xml：记录Word文档的正文内容
footer*.xml：记录Word文档的页脚
header*.xml：记录Word文档的页眉
comments.xml：记录Word文档的批注
endnotes.xml：记录WOrd文档的尾注

这里我们只读取Word文档的正文内容，由于OOXML文档在存储文字时也是嵌套结构存储的，比如对于Word而言，<w:p></w:p>之间存储的是段落，段落中会嵌套着<w:t></w:t>，而这个存储的是文字。除此之外<w:tab/>是Tab符号，<w:br w:type="page"/>是分页符等等，所以我们需要写一个方法递归处理这些标签：

 
        View Code 
       
        /// <summary> 
       
        /// 抽取Node中的文字 
       
        /// </summary> 
       
        /// <param name="node">XmlNode</param> 
       
        /// <returns>Node中的文字</returns> 
       
        public  
        static  
        String ReadNode(XmlNode node) 
       
        { 
       
        if  
        ((node ==  
        null 
        ) || (node.NodeType != XmlNodeType.Element)) 
        //如果node为空 
       
        { 
       
        return  
        String.Empty; 
       
        } 
       
        StringBuilder nodeContent =  
        new  
        StringBuilder(); 
       
        foreach  
        (XmlNode child  
        in  
        node.ChildNodes) 
       
        { 
       
        if  
        (child.NodeType != XmlNodeType.Element) 
       
        { 
       
        continue 
        ; 
       
        } 
       
        switch  
        (child.LocalName) 
       
        { 
       
        case  
        "t" 
        : 
        //正文 
       
        nodeContent.Append(child.InnerText.TrimEnd()); 
       
        String space = ((XmlElement)child).GetAttribute( 
        "xml:space" 
        ); 
       
        if  
        ((!String.IsNullOrEmpty(space)) && (space ==  
        "preserve" 
        )) nodeContent.Append( 
        ' ' 
        ); 
       
        break 
        ; 
       
        case  
        "cr" 
        : 
        //换行符 
       
        case  
        "br" 
        : 
        //换页符 
       
        nodeContent.Append(Environment.NewLine); 
       
        break 
        ; 
       
        case  
        "tab" 
        : 
        //Tab 
       
        nodeContent.Append( 
        "\t" 
        ); 
       
        break 
        ; 
       
        case  
        "p" 
        : 
        //段落 
       
        nodeContent.Append(ReadNode(child)); 
       
        nodeContent.Append(Environment.NewLine); 
       
        break 
        ; 
       
        default 
        : 
        //其他情况 
       
        nodeContent.Append(ReadNode(child)); 
       
        break 
        ; 
       
        } 
       
        } 
       
        return  
        nodeContent.ToString(); 
       
        }

然后我们从根标签开始读取就可以了

 
        View Code 
       
        #region 常量 
       
        private  
        const  
        String WordNameSpace = 
        "http://schemas.openxmlformats.org/wordprocessingml/2006/main" 
        ; 
       
        #endregion 
       
        #region 字段 
       
        private  
        String m_paragraphText; 
       
        #endregion 
       
        #region 属性 
       
        /// <summary> 
       
        /// 获取文档正文内容 
       
        /// </summary> 
       
        public  
        String ParagraphText 
       
        { 
       
        get  
        {  
        return  
        this 
        .m_paragraphText; } 
       
        } 
       
        #endregion 
       
        #region 读取内容 
       
        protected  
        override  
        void  
        ReadContent() 
       
        { 
       
        if  
        ( 
        this 
        .m_package ==  
        null 
        ) 
       
        { 
       
        return 
        ; 
       
        } 
       
        PackagePart part =  
        this 
        .m_package.GetPart( 
        new  
        Uri( 
        "/word/document.xml" 
        , UriKind.Relative)); 
       
        if  
        (part ==  
        null 
        ) 
       
        { 
       
        return 
        ; 
       
        } 
       
        StringBuilder content =  
        new  
        StringBuilder(); 
       
        XmlDocument doc =  
        new  
        XmlDocument(); 
       
        doc.Load(part.GetStream()); 
       
        XmlNamespaceManager nsManager =  
        new  
        XmlNamespaceManager(doc.NameTable); 
       
        nsManager.AddNamespace( 
        "w" 
        , WordNameSpace); 
       
        XmlNode node = doc.SelectSingleNode( 
        "/w:document/w:body" 
        , nsManager); 
       
        if  
        (node ==  
        null 
        ) 
       
        { 
       
        return 
        ; 
       
        } 
       
        content.Append(NodeHelper.ReadNode(node)); 
       
        this 
        .m_paragraphText = content.ToString(); 
       
        } 
       
        #endregion

【四、PowerPoint 2007文件的解析】

PowerPoint文件（.pptx）主要的内容都存在于ppt目录下，而幻灯片的信息则又在slides子目录下，这里边幻灯片按照slide + 页序号 +.xml的名称进行存储，我们挨个顺序读取就可以。不过需要注意的是，由于字符串比较的问题，如“slide10.xml”<"slide2.xml"，所以如果你按顺序读取的话可能会出现页码错乱的情况，所以我们可以先进行排序然后再挨个页面从根标签读取就可以了。

 
        #region 常量 
       
        private  
        const  
        String PowerPointNameSpace = 
        "http://schemas.openxmlformats.org/presentationml/2006/main" 
        ; 
       
        #endregion 
       
        #region 字段 
       
        private  
        StringBuilder m_allText; 
       
        #endregion 
       
        #region 属性 
       
        /// <summary> 
       
        /// 获取PowerPoint幻灯片中所有文本 
       
        /// </summary> 
       
        public  
        String AllText 
       
        { 
       
        get  
        {  
        return  
        this 
        .m_allText.ToString(); } 
       
        } 
       
        #endregion 
       
        #region 构造函数 
       
        /// <summary> 
       
        /// 初始化PptxFile 
       
        /// </summary> 
       
        /// <param name="filePath">文件路径</param> 
       
        public  
        PptxFile(String filePath) : 
       
        base 
        (filePath) { } 
       
        #endregion 
       
        #region 读取内容 
       
        protected  
        override  
        void  
        ReadContent() 
       
        { 
       
        if  
        ( 
        this 
        .m_package ==  
        null 
        ) 
       
        { 
       
        return 
        ; 
       
        } 
       
        this 
        .m_allText =  
        new  
        StringBuilder(); 
       
        XmlDocument doc =  
        null 
        ; 
       
        PackagePartCollection col =  
        this 
        .m_package.GetParts(); 
       
        SortedList<Int32, XmlDocument> list =  
        new  
        SortedList<Int32, XmlDocument>(); 
       
        foreach  
        (PackagePart part  
        in  
        col) 
       
        { 
       
        if  
        (part.Uri.ToString().IndexOf( 
        "ppt/slides/slide" 
        , StringComparison.OrdinalIgnoreCase) > -1) 
       
        { 
       
        doc =  
        new  
        XmlDocument(); 
       
        doc.Load(part.GetStream()); 
       
        String pageName = part.Uri.ToString().Replace( 
        "/ppt/slides/slide" 
        , 
        "" 
        ).Replace( 
        ".xml" 
        ,  
        "" 
        ); 
       
        Int32 index = 0; 
       
        Int32.TryParse(pageName,  
        out  
        index); 
       
        list.Add(index, doc); 
       
        } 
       
        } 
       
        foreach  
        (KeyValuePair<Int32, XmlDocument> pair  
        in  
        list) 
       
        { 
       
        XmlNamespaceManager nsManager =  
        new  
        XmlNamespaceManager(doc.NameTable); 
       
        nsManager.AddNamespace( 
        "p" 
        , PowerPointNameSpace); 
       
        XmlNode node = pair.Value.SelectSingleNode( 
        "/p:sld" 
        , nsManager); 
       
        if  
        (node ==  
        null 
        ) 
       
        { 
       
        continue 
        ; 
       
        } 
       
        this 
        .m_allText.Append(NodeHelper.ReadNode(node)); 
       
        } 
       
        } 
       
        #endregion

 
        #region 常量 
       
        private  
        const  
        String PowerPointNameSpace = 
        "http://schemas.openxmlformats.org/presentationml/2006/main" 
        ; 
       
        #endregion 
       
        #region 字段 
       
        private  
        StringBuilder m_allText; 
       
        #endregion 
       
        #region 属性 
       
        /// <summary> 
       
        /// 获取PowerPoint幻灯片中所有文本 
       
        /// </summary> 
       
        public  
        String AllText 
       
        { 
       
        get  
        {  
        return  
        this 
        .m_allText.ToString(); } 
       
        } 
       
        #endregion 
       
        #region 构造函数 
       
        /// <summary> 
       
        /// 初始化PptxFile 
       
        /// </summary> 
       
        /// <param name="filePath">文件路径</param> 
       
        public  
        PptxFile(String filePath) : 
       
        base 
        (filePath) { } 
       
        #endregion 
       
        #region 读取内容

Office文件的解析

Office文件的解析

你可能感兴趣的:(Office文件的解析)