Convert Long Word Documents into Markdown

For use in your local Obsidian wiki where you store all the collected knowledge about your literary universe. You do have a personal wiki, don't you?

Convert Long Word Documents into Markdown

As you are well aware, I write novels in an ongoing anthology series that takes place in a shared universe where technology has progressed unchecked. The books share characters, organizations, places, and more between them, which means that anytime I use something from a previous book, I have to cross-reference and make sure I’m not creating a continuity error.

Consider a problem like: Have I used the name Terrance Bottomfeather before? To find out, I’d have to open eight large Word documents and run a search, which is tedious. I’d much rather have the text from all eight novels inside Obsidian and, therefore, searchable all at once.

I’m sure this is a common problem.

Convert Word Documents to Markdown

The first step is taking your .docx file and turning it into a .md file. You can do this in Ubuntu with pandoc. Once you have it installed, you just have to feed it the right arguments:

pandoc -f docx -t markdown vise-manor.docx -o vise-manor.md

You can read up on what the options mean, but basically it’s saying take this docx file and output it to an md file. That’s the official scientific explanation.

You’ll get a huge file as a result:

dverastiqui@Ginsberg:~/tmp$ ll
total 1436
drwxr-xr-x 1 dverastiqui dverastiqui   4096 Sep 29 10:01 ./
drwxr-xr-x 1 dverastiqui dverastiqui   4096 Sep 29 09:59 ../
-rw-r--r-- 1 dverastiqui dverastiqui 615787 Jan 20  2022 vise-manor.docx
-rw-r--r-- 1 dverastiqui dverastiqui 815643 Sep 29 10:01 vise-manor.md

You can use less or head to verify the conversion.

dverastiqui@Ginsberg:~/tmp$ head vise-manor.md
ALSO BY DANIEL VERASTIQUI

*Xronixle*

*Veneer*

*Perion Synthetics*

*Por Vida*

Now, you could drop this file directly into Obsidian, but that’s a bit unwieldy. Since the content is already arranged in chapters, let’s just go ahead and chop it up.

Cut Markdown Files into Chapters

This part calls again for pandoc and some minimal CLI scripting. Essentially, we’re going to be converting the markdown file into an EPUB archive and then unpacking the EPUB to convert the xhtml chapters into markdown. Look, it doesn’t have to make sense. Just know that it works.

You may get some warnings, but you can ignore them, especially this one:

dverastiqui@Ginsberg:~/tmp$ pandoc -f markdown -t epub -o vise-manor.epub vise-manor.md
[WARNING] This document format requires a nonempty <title> element.
  Please specify either 'title' or 'pagetitle' in the metadata,
  e.g. by using --metadata pagetitle="..." on the command line.
  Falling back to 'vise-manor'
[WARNING] This document format requires a nonempty <title> element.
  Please specify either 'title' or 'pagetitle' in the metadata,
  e.g. by using --metadata pagetitle="..." on the command line.
  Falling back to 'vise-manor'

dverastiqui@Ginsberg:~/tmp$ ll
total 1840
drwxr-xr-x 1 dverastiqui dverastiqui   4096 Sep 29 10:09 ./
drwxr-xr-x 1 dverastiqui dverastiqui   4096 Sep 29 09:59 ../
-rw-r--r-- 1 dverastiqui dverastiqui 615787 Jan 20  2022 vise-manor.docx
-rw-r--r-- 1 dverastiqui dverastiqui 412195 Sep 29 10:09 vise-manor.epub
-rw-r--r-- 1 dverastiqui dverastiqui 815643 Sep 29 10:01 vise-manor.md

Now unzip the EPUB archive.

dverastiqui@Ginsberg:~/tmp$ unzip vise-manor.epub
Archive:  vise-manor.epub
 extracting: mimetype
  inflating: META-INF/container.xml
  inflating: META-INF/com.apple.ibooks.display-options.xml
  inflating: EPUB/content.opf
  inflating: EPUB/toc.ncx
  inflating: EPUB/nav.xhtml
  inflating: EPUB/text/title_page.xhtml
  inflating: EPUB/styles/stylesheet1.css
  inflating: EPUB/text/ch001.xhtml
  inflating: EPUB/text/ch002.xhtml
  inflating: EPUB/text/ch003.xhtml
  inflating: EPUB/text/ch004.xhtml
  inflating: EPUB/text/ch005.xhtml
  inflating: EPUB/text/ch006.xhtml
  inflating: EPUB/text/ch007.xhtml
--OUTPUT REMOVED--

Did you see that? Converting to EPUB automatically separated everything into chapters. They’re in XHTML format right now, but we can fix that.

dverastiqui@Ginsberg:~/tmp$ for chapter in EPUB/text/*.xhtml; do pandoc -f html -t markdown -o ${chapter/html/md} ${chapter}; done;

dverastiqui@Ginsberg:~/tmp$ ll EPUB/text/
total 2364
drwxr-xr-x 1 dverastiqui dverastiqui  4096 Sep 29 10:12 ./
drwxr-xr-x 1 dverastiqui dverastiqui  4096 Sep 29 10:10 ../
-rw-r--r-- 1 dverastiqui dverastiqui  2229 Sep 29  2023 ch001.xhtml
-rw-r--r-- 1 dverastiqui dverastiqui  1181 Sep 29 10:12 ch001.xmd
-rw-r--r-- 1 dverastiqui dverastiqui  1036 Sep 29  2023 ch002.xhtml
-rw-r--r-- 1 dverastiqui dverastiqui   305 Sep 29 10:12 ch002.xmd
-rw-r--r-- 1 dverastiqui dverastiqui 14026 Sep 29  2023 ch003.xhtml
-rw-r--r-- 1 dverastiqui dverastiqui 12360 Sep 29 10:12 ch003.xmd
-rw-r--r-- 1 dverastiqui dverastiqui 14818 Sep 29  2023 ch004.xhtml
-rw-r--r-- 1 dverastiqui dverastiqui 13056 Sep 29 10:12 ch004.xmd
--OUTPUT REMOVED--

Unzipping the EPUB got us chapters in XHTML format. Now we have duplicated those into XMD format. The next step is to rename all those files.

dverastiqui@Ginsberg:~/tmp$ for i in EPUB/text/*.xmd; do mv -- "$i" "${i%.xmd}.md"; done

dverastiqui@Ginsberg:~/tmp$ ll EPUB/text/
total 2364
drwxr-xr-x 1 dverastiqui dverastiqui  4096 Sep 29 10:15 ./
drwxr-xr-x 1 dverastiqui dverastiqui  4096 Sep 29 10:10 ../
-rw-r--r-- 1 dverastiqui dverastiqui  1181 Sep 29 10:12 ch001.md
-rw-r--r-- 1 dverastiqui dverastiqui  2229 Sep 29  2023 ch001.xhtml
-rw-r--r-- 1 dverastiqui dverastiqui   305 Sep 29 10:12 ch002.md
-rw-r--r-- 1 dverastiqui dverastiqui  1036 Sep 29  2023 ch002.xhtml
-rw-r--r-- 1 dverastiqui dverastiqui 12360 Sep 29 10:12 ch003.md
-rw-r--r-- 1 dverastiqui dverastiqui 14026 Sep 29  2023 ch003.xhtml
-rw-r--r-- 1 dverastiqui dverastiqui 13056 Sep 29 10:12 ch004.md
-rw-r--r-- 1 dverastiqui dverastiqui 14818 Sep 29  2023 ch004.xhtml

Those MD files are now ready to move into Obsidian. Just make sure you know where you want them and use the mv command.

dverastiqui@Ginsberg:~/tmp$ mv EPUB/text/*.md /od/writing/Obsidian Wiki/Vinestead Universe/zFull Text/vise-manor/

dverastiqui@Ginsberg:/od/writing/Obsidian Wiki/Vinestead Universe/zFull Text$ ll vise-manor/
total 940
drwxrwxrwx 1 dverastiqui dverastiqui   512 Apr 19 13:40 ./
drwxrwxrwx 1 dverastiqui dverastiqui   512 Sep 21 19:37 ../
-rwxrwxrwx 1 dverastiqui dverastiqui  1181 Apr 19 13:40 ch001.md*
-rwxrwxrwx 1 dverastiqui dverastiqui   305 Apr 19 13:40 ch002.md*
-rwxrwxrwx 1 dverastiqui dverastiqui 12360 Apr 19 13:40 ch003.md*
-rwxrwxrwx 1 dverastiqui dverastiqui 13056 Apr 19 13:40 ch004.md*

Now you should be able to open Obsidian and see the new content.

Congrats! You did it! Enjoy searching your previous novels with ease!

References

  • https://medium.com/geekculture/how-to-easily-convert-word-to-markdown-with-pandoc-4d60878ccc64
  • https://linuxconfig.org/how-to-rename-multiple-files-on-linux
  • https://stackoverflow.com/questions/33889814/how-do-i-split-a-markdown-file-into-separate-files-at-the-heading