Panflute: Part 1
Panflute Filter: Translate¶
To translate Markdown text to Russian, we want to look for Para (paragraph) components, and extract the text from each paragraph as a string.
cat shepherd.md | pandoc -t json -f gfm -s | filters/translate.py
The translate.py
panflute filter determines what text to translate
and calls the Google Cloud Translate API to do the translating.
We use the Google Cloud Translate API using a Python client library
that they provide, rather than fussing with assembling our own
payloads and using the requests
library.
filters/translate.py
:
from panflute import * from google.cloud import translate def translate_document(elem, doc): ... def main(doc=None): return run_filter(translate_document, doc=doc) if __name__ == "__main__": main()
When we run this, it passes each document element through
translate_document()
.
Two things we wnat to do up front:
-
Extract links (these are problematic for translation, we deal with them by removing the link from the inline text and including a list at the bottom of the document of the (translated) original link text, plus the link itself.
-
Search for paragraphs of text, or headings, and replace text with a translation.
To do the first, we define a function called strip_links()
,
which will strip out all the links and add them to the bottom
of the document (more on this in a minute).
We pass this function to elem.walk()
to apply it to every
element underneath the current element. This means we can
write the strip_links()
function in a similar way to the
translate_document()
function: look for specific types of
elements, and modify them accordingly.
def translate_document(elem, doc): # Walk it and look for links first elem.walk(strip_links) ...
Another thing we want to do is translate paragraph text from Russian to English. We use stringify to turn a list of Str and Space objects into a regular string:
def translate_document(elem, doc): # Walk it and look for links first elem.walk(strip_links) if type(elem)==Para: english = stringify(doc) translation = english_to_X(english) ...
The english_to_X()
function (more on this shortly) will use
the Google Cloud Translate API to translate English to our language of choice.
This requires an API key with Google Cloud (see Setup).
The translate function will return unicode text. This must be turned back into structured text that pandoc understands, so we have to wrap words in Str() and replace spaces with Space objects.
To do this easily with a string in panflute, use convert_text()
:
english = stringify(elem) rooskie = english_to_russian(english) new_elems = convert_text(rooskie,input_format='markdown') return new_elems
To get Google Cloud Translate API working properly,
you must export an environment variable $GOOGLE_APPLICATION_CREDENTIALS
that points to a JSON file with your API keys
EACH TIME YOU RUN THE SCRIPT!!!
############################################ ############################################ ############## ############## ############## NOTE ############## ############## ############## ## ## This step is required for the translate ## API calls to work. Don't leave this out. ## ############################################ ############################################ export GOOGLE_APPLICATION_CREDENTIALS=/path/to/project-key-00000.json