Converting Word files to plain text with IronPython

29 May 2010 4:39 | 0 comments | 0 pingbacks | ,

It’s been some time since I had a need to do Word scripting, because I’m working mostly with Linux and Mac.

My first attempt was trying to use pywin32 and COM.

The problem was that the documents I was trying to convert were in Russian. So I wanted the output to be in UTF-8.

While SaveAs method in Word has Encoding parameter, it wasn’t quite clear, how would I specify it from pywin32.

So, my next attempt was to use IronPython, since it has native .NET interface with Office. The biggest advantage of this approach was the fact that you can do dir() on all objects and methods in IronPython shell.

After some googling on encodings, and IronPython Word scripting, here is the script I came up with.

import sys
import clr
import System
from System.IO import DirectoryInfo, Path

clr.AddReference("Microsoft.Office.Interop.Word")
import Microsoft.Office.Interop.Word as Word

def convert_files(doc_path):
    directory = DirectoryInfo(doc_path)
    files = directory.GetFiles("*.doc")
    for file_info in files:
        doc_to_text(doc_path, file_info.Name)
    return

def doc_to_text(folder, filename):
    wa = Word.ApplicationClass()
    file, ext = filename.split('.')
    document = wa.Documents.Open(Path.Combine(folder, filename),
                                 ReadOnly=True)
    document.SaveAs(Path.Combine(folder, file+'.txt'),
                    FileFormat=Word.WdSaveFormat.wdFormatDOSText,
                    Encoding=65001)
    document.Close()

if __name__ == "__main__":
    if len(sys.argv) == 2:
        convert_files(sys.argv[1])
    else:
        print "Requires folder name as an argument"

According to MSDN encoding 65001 corresponds to UTF-8.

Add post to:   Delicious Reddit Slashdot Digg Technorati Google
Make comment

Comments

No comments for this post

Required. 30 chars of fewer.

Required.