Word Document to Asciidoc Conversion

Posted by: Paul Rayner on February 14, 2013

I had content in Word documents that I needed to convert to Asciidoc for our book. Here are the steps I found to work best:

  1. Save Word doc as HTML
  2. Encode as UTF-8
  3. Use pandoc to convert from HTML to AsciiDoc
  4. Use Sublime Text 2 search and replace (using some regular expressions) to strip out crazy things
  5. Use Sublime Text 2 to perform any remaining formatting

Save Word doc as HTML

Open the document in Word, and then save as a web page. Select the “Save only Display Information into HTML” option when saving. Exit from Word (and wave it goodbye as you do!).

Encode as UTF-8

Open the html file in Sublime Text 2. Avert your eyes at the horror that is Word-formatted HTML. Reopen with encoding UTF-8 and save the file:

"Sublime Text 2 Reopen with Encoding"

If I don’t recode as UTF-8, then the next step will fail with the error:

pandoc: Cannot decode byte '\x6f': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream

Use Pandoc to convert from HTML to AsciiDoc

Run pandoc. For example, the following command takes ConventionSheet.htm and converts it to the AsciiDoc file file.asc:

pandoc -f html -t asciidoc -o file.asc ConventionSheet.htm

Use Sublime Text 2 search and replace (using some regular expressions) to strip out crazy things

Weird single quotes need to go:

"Sublime Text 2 Replace backtick with single quote"

If you had reviewing turned on in Word, then reviewer comments and changes will likely be present in the HTML. Remove these using a search and replace with the following Regex in the search field:

\[line-through\]\*(.+)\*

When matched lines cross line breaks then you can use the single line option (?s) in your regex for search and replace:

(?s)\[line-through\]\*.(.*?)\*

Use Sublime Text 2 to perform any remaining AsciiDoc formatting

Monospace any regex or other special characters (these will cause problems for the AsciiDoc parser) in the document.

Edit the AsciiDoc document as you wish! Note that GitHub now natively displays AsciiDoc files (using AsciiDoctor behind the scenes), just as it does for Markdown.

Paul Rayner

About Paul Rayner

Paul is a seasoned design coach and leadership mentor, helping teams ignite their design skills via Domain-Driven Design (DDD) and Behavior-Driven Development (BDD). He gets teams unstuck through intensive coaching workshops and hands-on pair programming, combined with focused one-on-one leadership mentoring. His company Virtual Genius, provides training and coaching in collaborative design for agile teams. Paul actively serves the community: teaching classes in BDD and DDD, contributing to OSS, and co-leading the DDD Denver Meetup group.

Look for him speaking at user groups and at local and international conferences. Paul is from Perth, Australia, but chooses to live, work and play with his wife and two children, in Denver, Colorado. He tweets with an Australian accent at @ThePaulRayner and blogs at thepaulrayner.com