How to Convert Long Word Documents into Markdown for Import into Obsidian

I love Obsidian. It’s a desktop wiki with local storage, meaning the pages in the app are just markdown files on your computer. If you’re running Linux or Ubuntu for Windows, you can do a lot of quick file manipulation by dropping into a command line interface, including creating new files. Have a dozen markdown files about your favorite episodes of One Piece? You can create a folder under the wiki root, drop those files in there, and boom, they’re in your wiki, all ready to be automatically indexed and cross-referenced.

Using this feature, we can quickly pull into the 120K+ words from your novels, split them into chapters, and make them searchable and interlinkable pages in Obsidian. And it’s not that hard, either. It barely took me several tries to get it right!

Thanks for reading Notes From the Author! Subscribe for free to receive new posts and support my work.

Why Do This?

I write novels in an ongoing anthology series that takes place in a shared universe where technology has progressed unchecked. The books share characters, organizations, places, and more between them, which means that anytime I use something from a previous book, I have to cross-reference and make sure I’m not creating a continuity error.

Consider a problem like: Have I used the name Terrance Bottomfeather before?To find out, I’d have to open eight large Word documents and run a search, which is tedious. I’d much rather have the text from all eight novels inside Obsidian and therefore searchable all at once.

I’m sure this a common problem.

Convert Word Documents to Markdown

The first step is taking your .docx file and turning it into a .md file. You can do this in Ubuntu with pandoc. Once you have it installed, you just have to feed it the right arguments:

pandoc -f docx -t markdown vise-manor.docx -o vise-manor.md

You can read up on what the options mean, but basically it’s saying take this docx file and output it to an md file. That’s the official scientific explanation.

You’ll get a huge file as a result:

dverastiqui@Ginsberg:~/tmp$ ll
total 1436
drwxr-xr-x 1 dverastiqui dverastiqui   4096 Sep 29 10:01 ./
drwxr-xr-x 1 dverastiqui dverastiqui   4096 Sep 29 09:59 ../
-rw-r--r-- 1 dverastiqui dverastiqui 615787 Jan 20  2022 vise-manor.docx
-rw-r--r-- 1 dverastiqui dverastiqui 815643 Sep 29 10:01 vise-manor.md

You can use less or head to verify the conversion.

dverastiqui@Ginsberg:~/tmp$ head vise-manor.md
ALSO BY DANIEL VERASTIQUI

*Xronixle*

*Veneer*

*Perion Synthetics*

*Por Vida*

Now, you could drop this file directly into Obsidian, but that’s a bit unwieldy. Since the content is already arranged in chapters, let’s just go ahead and chop it up.

Cut Markdown Files into Chapters

This part calls again for pandoc and some minimal CLI scripting. Essentially, we’re going to be converting the markdown file into an EPUB archive and then unpacking the EPUB to convert the xhtml chapters into markdown. Look, it doesn’t have to make sense. Just know that it works.

You may get some warnings, but you can ignore them, especially this one:

dverastiqui@Ginsberg:~/tmp$ pandoc -f markdown -t epub -o vise-manor.epub vise-manor.md
[WARNING] This document format requires a nonempty <title> element.
  Please specify either 'title' or 'pagetitle' in the metadata,
  e.g. by using --metadata pagetitle="..." on the command line.
  Falling back to 'vise-manor'
[WARNING] This document format requires a nonempty <title> element.
  Please specify either 'title' or 'pagetitle' in the metadata,
  e.g. by using --metadata pagetitle="..." on the command line.
  Falling back to 'vise-manor'

dverastiqui@Ginsberg:~/tmp$ ll
total 1840
drwxr-xr-x 1 dverastiqui dverastiqui   4096 Sep 29 10:09 ./
drwxr-xr-x 1 dverastiqui dverastiqui   4096 Sep 29 09:59 ../
-rw-r--r-- 1 dverastiqui dverastiqui 615787 Jan 20  2022 vise-manor.docx
-rw-r--r-- 1 dverastiqui dverastiqui 412195 Sep 29 10:09 vise-manor.epub
-rw-r--r-- 1 dverastiqui dverastiqui 815643 Sep 29 10:01 vise-manor.md

Now unzip the EPUB archive.

dverastiqui@Ginsberg:~/tmp$ unzip vise-manor.epub
Archive:  vise-manor.epub
 extracting: mimetype
  inflating: META-INF/container.xml
  inflating: META-INF/com.apple.ibooks.display-options.xml
  inflating: EPUB/content.opf
  inflating: EPUB/toc.ncx
  inflating: EPUB/nav.xhtml
  inflating: EPUB/text/title_page.xhtml
  inflating: EPUB/styles/stylesheet1.css
  inflating: EPUB/text/ch001.xhtml
  inflating: EPUB/text/ch002.xhtml
  inflating: EPUB/text/ch003.xhtml
  inflating: EPUB/text/ch004.xhtml
  inflating: EPUB/text/ch005.xhtml
  inflating: EPUB/text/ch006.xhtml
  inflating: EPUB/text/ch007.xhtml
--OUTPUT REMOVED--

Did you see that? Converting to EPUB automatically separated everything into chapters. They’re in XHTML format right now, but we can fix that.

dverastiqui@Ginsberg:~/tmp$ for chapter in EPUB/text/*.xhtml; do pandoc -f html -t markdown -o ${chapter/html/md} ${chapter}; done;

dverastiqui@Ginsberg:~/tmp$ ll EPUB/text/
total 2364
drwxr-xr-x 1 dverastiqui dverastiqui  4096 Sep 29 10:12 ./
drwxr-xr-x 1 dverastiqui dverastiqui  4096 Sep 29 10:10 ../
-rw-r--r-- 1 dverastiqui dverastiqui  2229 Sep 29  2023 ch001.xhtml
-rw-r--r-- 1 dverastiqui dverastiqui  1181 Sep 29 10:12 ch001.xmd
-rw-r--r-- 1 dverastiqui dverastiqui  1036 Sep 29  2023 ch002.xhtml
-rw-r--r-- 1 dverastiqui dverastiqui   305 Sep 29 10:12 ch002.xmd
-rw-r--r-- 1 dverastiqui dverastiqui 14026 Sep 29  2023 ch003.xhtml
-rw-r--r-- 1 dverastiqui dverastiqui 12360 Sep 29 10:12 ch003.xmd
-rw-r--r-- 1 dverastiqui dverastiqui 14818 Sep 29  2023 ch004.xhtml
-rw-r--r-- 1 dverastiqui dverastiqui 13056 Sep 29 10:12 ch004.xmd
--OUTPUT REMOVED--

Unzipping the EPUB got us chapters in XHTML format. Now we have duplicated those into XMD format. The next step is to rename all those files.

dverastiqui@Ginsberg:~/tmp$ for i in EPUB/text/*.xmd; do mv -- "$i" "${i%.xmd}.md"; done

dverastiqui@Ginsberg:~/tmp$ ll EPUB/text/
total 2364
drwxr-xr-x 1 dverastiqui dverastiqui  4096 Sep 29 10:15 ./
drwxr-xr-x 1 dverastiqui dverastiqui  4096 Sep 29 10:10 ../
-rw-r--r-- 1 dverastiqui dverastiqui  1181 Sep 29 10:12 ch001.md
-rw-r--r-- 1 dverastiqui dverastiqui  2229 Sep 29  2023 ch001.xhtml
-rw-r--r-- 1 dverastiqui dverastiqui   305 Sep 29 10:12 ch002.md
-rw-r--r-- 1 dverastiqui dverastiqui  1036 Sep 29  2023 ch002.xhtml
-rw-r--r-- 1 dverastiqui dverastiqui 12360 Sep 29 10:12 ch003.md
-rw-r--r-- 1 dverastiqui dverastiqui 14026 Sep 29  2023 ch003.xhtml
-rw-r--r-- 1 dverastiqui dverastiqui 13056 Sep 29 10:12 ch004.md
-rw-r--r-- 1 dverastiqui dverastiqui 14818 Sep 29  2023 ch004.xhtml

Those MD files are now ready to move into Obsidian. Just make sure you know where you want them and use the mv command.

dverastiqui@Ginsberg:~/tmp$ mv EPUB/text/*.md /od/writing/Obsidian Wiki/Vinestead Universe/zFull Text/vise-manor/

dverastiqui@Ginsberg:/od/writing/Obsidian Wiki/Vinestead Universe/zFull Text$ ll vise-manor/
total 940
drwxrwxrwx 1 dverastiqui dverastiqui   512 Apr 19 13:40 ./
drwxrwxrwx 1 dverastiqui dverastiqui   512 Sep 21 19:37 ../
-rwxrwxrwx 1 dverastiqui dverastiqui  1181 Apr 19 13:40 ch001.md*
-rwxrwxrwx 1 dverastiqui dverastiqui   305 Apr 19 13:40 ch002.md*
-rwxrwxrwx 1 dverastiqui dverastiqui 12360 Apr 19 13:40 ch003.md*
-rwxrwxrwx 1 dverastiqui dverastiqui 13056 Apr 19 13:40 ch004.md*

Now you should be able to open Obsidian and see the new content.

Congrats! You did it! Enjoy searching your previous novels with ease!

References

  • https://medium.com/geekculture/how-to-easily-convert-word-to-markdown-with-pandoc-4d60878ccc64
  • https://linuxconfig.org/how-to-rename-multiple-files-on-linux
  • https://stackoverflow.com/questions/33889814/how-do-i-split-a-markdown-file-into-separate-files-at-the-heading