Saturday, March 10, 2012

Converting from Markdown

I have a ZIP file containing a framework for Ema. I'll be using to redesign and refactor my current novel. It's a blank, with some templates/examples, but otherwise empty and ready to be filled in. This takes the idea of a Wiki-fied beat sheet referencing scenes and converts it in to a single Markdown file that can be processed with pandoc in to something you can tweak later -- or directly in to ebook.

Now, it is possible to convert from your beat sheet -- with each scene marked with a clear identifier -- to a single Markdown document listing all of your scenes in order. This is why the wiki beat sheet uses WikiWords for all links except links to scenes. I also recommended you include chapter titles or scene breaks at the beginning of each of the scene files.

(The pandoc documentation generates an eBook using a chapter per file -- this is required to get pandoc's table of contents working -- but because scenes may be reordered in the beat sheet before the chapters are finalized, we can't just start out with this technique. Now, "pandoc" will still work to convert to other formats, but if you want to convert directly to an ebook you'll need to split the file in to chapters after we stitch the scenes together.)

The magic happens by way of "regular expressions". These allow you to perform replacement operations leveraging parts of your earlier text.

Using Notepad++ and CMD.exe

(I used Notepad++ in my example here primarily because when it installs, it has an easy context-menu item to launch it. (And it is free and open-source.) WriteMonkey is a Windows-native Markdown editor that has regular expression support, so you should be able to use it instead of Notepad++ relatively easily. Unfortunately, MarkdownPad -- for all its flash -- totally lacks any replace support, let alone regular expression support.)

The editors bundled with Windows do not support regular expressions. Notepad++ is a free multi-file editor with regular expression support. It's also open-source (GPL license). Similar techniques should work with any number of editors.

First save a copy of your "BeatSheet.txt" to "MakeWhole.bat". Open "MakeWhole.bat" in Notepad++. You will be making changes to the file, and you do not want to lose your beat sheet.

Next you need to perform a Search -> "Replace...", then set the "search mode" to "regular expression".
In the "Find What" field: ^.*([{](.*)[}]).*$
In the "Replace To" field: type "\2.txt" >> WholeDocument.txt
Unfortunately, I didn't see a way to remove the non-title lines, so they will need to be removed by hand. Delete every line that doesn't begin with "type".

These "TYPE" commands will result in appending the files to the "WholeDocument.txt" file.

If you think you may (accidentally or not) run the script more than once, you should change the ">>" on the first line to ">". This will cause the script to overwrite the file when it starts, then append to it. (Without this change, or a "del WholeDocument.txt" added to the beginning of the file, it will keep appending the files resulting in a WholeDocument.txt file which is less than usable.)

We're not quite done, though. Ema creates files with spaces turned in to underscores, so right now the files will all be not found.

This is also easy to fix using regular expressions.
In the "Find What" field: "(.*) (.*)"
In the the "Replace To" field: "\1_\2" 
Note that each time you perform this replace, it only removes a single space from the filenames, so you may need to run it more than once.

This should leave you with a "MakeWhole.bat" file which will work to create a WholeDocument.txt file.

One idiosyncrasy of this method is that if you do not end your files with a blank line, the first octothorpe/hash (#) in the file will but up against the last paragraph of the previous file. CMD.exe doesn't have an easy method to add an empty line to the file, but you can create a file with a single blank line, then "type empty.txt" >> WholeDocument.txt between each of the scene lines if that is an issue for you.

There isn't an easy way to  split the chapters out of the WholeDocument.txt using just an editor and CMD.exe. Hey, it is Windows, you have to expect some pain. -- Or you can install CoreUtils for Windows and use csplit, as mentioned in the "Using GnuWin packages" section.

Using GnuWin packages

The GnuWin packages are free/libre software packages to provide some tools which are standard on every operating system except Windows.

We will need to install CoreUtils for Windows and sed for Windows.We do not install a shell, so these will run from CMD.exe and create a BAT file to assist.
sed -n -e 'y/ /_/' -e 's/^.*\({\(.*\)}\).*$/cat "\2.txt"\nc:\\path\\to\\echo.exe/p' < BeatSheet.txt > MakeWhole.bat
MakeWhole.bat > WholeDocument.txt
This automatically adds a newline between files. This gives you one file called "WholeDocument.txt" and it splits the chapters out in files called ChapterXX (no extension -- with XX replaced with numbers starting with 00.)

Now, I've not tested it, however I know that you will need the absolute path to the CoreUtils "echo.exe" command before it will work. Without an absolute path it will use the CMD.exe internal "echo" command (even if you specify echo.exe -- it will actually echo "exe"), Unfortunately, I do not know the path the GnuWin packages install to, so I can not provide the path. I have substituted \\path\\to\\echo.exe instead. You need to replace it with the path to "echo.exe" included in the CoreUtils for Windows package with the backslashes doubled.

If you want to split the WholeDocument.txt file in to chapters, it is as simple as:
csplit -z -b %02d.txt -f Chapter WholeDocument.txt '/^# [A-Za-z0-9]/' '{*}'
This will split the file in to chapters and keep the files with a ".txt" extension so Windows can open them.

Using GNU tools (Linux, Mac OS X, or Cygwin)
If using standard GNU-style tools from Linux, Mac OS X or Cygwin you would use (these should two commands on two lines -- they wrap in this display):
sed -n -e 'y/ /_/' -e 's/^.*\({\(.*\)}\).*$/\2.txt/p' < BeatSheet.txt | (while read FILE ; do cat "$FILE" ; echo ; done) > WholeDocument.txt
csplit -z -f Chapter WholeDocument.txt '/^# [A-Za-z0-9]/' '{*}'
This automatically adds a newline between files. This gives you one file called "WholeDocument.txt" and it splits the chapters out in files called ChapterXX (no extension -- with XX replaced with numbers starting with 00.)

Using the pandoc technique to create an ebook should now give you the table of contents, as expected.

If using ikiwiki instead of Ema

I am a big fan of ikiwiki, but the syntax is different from Ema. Since we normally use WikiWords to link documents most of the time the differences will be invisible.

(It is expected that if you're using ikiwiki, you're using Linux, Mac OS X, Cygwin or another environment which is similar. -- ikiwiki doesn't want to run in Windows.)

First, you'll want to convert the draft/empty Ema-formatted files to ikiwiki standards. To start off, we need to rename them. I like to use mmv, which would work like so:
mmv *.txt '#1.mdwn'
The difference between Ema is ikiwiki uses double-square-brackets for links to other wiki-pages [[like so]] and allows spaces in filenames. Ema uses {curly braces}, and converts spaces to underscores. Since we normally expect WikiWords to work, the only place this shows up as an issue is the beat sheet.

It is left to your discretion as to whether you want to rename my "Scene_Title.txt" (now .mdwn) file using 'mv' or 'mmv'. It is just one file, and there's just one underscore/space, so it is trivial one way or the other.

Converting the Ema-style scene titles to ikiwiki is done through a simple "sed" command:
 sed -i -e 's/{\(.*\)}/\[\[\1\]\]/g' *.mdwn
That performs an ïn-place change between the Ema-style wiki-link and the ikiwiki-style wiki-link.

Now for the differences in the compilation stage:
sed -n -e 's/^.*\(\[\[\(.*\)\]\]\).*$/\2.mdwn/p' < BeatSheet.mdwn | (while read FILE ; do cat "$FILE" ; echo ; done) > WholeDocument.mdwn
csplit -z -f Chapter WholeDocument.mdwn '/^# [A-Za-z0-9]/' '{*}'
The two differences are again, the spaces are left unmolested, and the extension is "mdwn" instead of "txt".


  1. So, in practice the WikiWord usage isn't nearly as nice as just explicitly marking things to link to. The problem comes with referring to characters as their first name, then needing to artificially make a WikiWord. This leads to common prefixes, and ultimately things like ChSteven instead of {Steven} or ((Steven)). ('Ch' for 'character'.)

    This means the BeatSheet doesn't cleanly denote filenames without changing our regular expressions.

    We need to be explicit about the important lines starting with a number, and we make sure we only care about things before the first '|' character.

    For sed:

    sed -n -e 'y/ /_/' -e 's/^[0-9]\+[.][^|]*{\([^|]*\)}[^|]*[|].*$/\1.txt /p' < BeatSheet.txt

    For editors with regular expression support:

    In the "Find What" field:


    In the "Replace To" field:

    type "\1.txt" >> WholeDocument.txt

  2. Hmm... It looks like the '#' marker for a scene break won't work as expected if converting to an ePUB from Pandoc. You'll get a more expected result if you do:

    sed -i -e 's/^[#]$/* * */ WholeDocument.txt

    Or in an editor supporting regular expressions, load WholeDocument.txt and...

    In the "Find What" field:


    In the "Replace To" field:

    * * *

    It is possible that the '#' as a scene separator would have an issue with your Markdown processor. This solution would work for that, as well.