People's Archive of Rural India (PARI) has a rich database of translated content, created by a superb team of translators that is led by Smita Khator. Talented linguists and writers such as Qamar Tabrez and Medha Khale translate PARI's English content into major Indian languages.

PARI's translation report (Google Data Studio)
Translation rate has increased rapidly over the years. The team is getting bigger and diverse. (Google Data Studio)

Can we build a translation memory to make it easier for new translators to use the same terms and abide by the idiom set by previous translators?

Special thanks to linguist, Dr. Qamar Tabrez, PARI's prolific Hindi and Urdu translator for this superb idea.

Can we build a translation memory to make it easier for new translators to use the same terms and abide by the idiom set by previous translators? 

First let's cache language paragraphs on local storage. The query is quite heavy and we don't want to hit the database frequently during our experiments.  For shicks & giggles I thought it'd be a good idea to do this in Java. It isn't. Don't try this at home. Do it in Python. Much simpler.

siddadel/PARI-NLP-Java
Contribute to siddadel/PARI-NLP-Java development by creating an account on GitHub.
A folder for PARI's 14 languages. Inside the folder are text paragraphs in that language

Now as have a directory of text files with paragraphs of text.

Now create csv file from these files to juxtapose paragraphs that correspond to translations of each other.

Please see the entire code here:

siddadel/PARI-Data-Science
Contribute to siddadel/PARI-Data-Science development by creating an account on GitHub.
Python code for juxtaposing translated paragraphs

Now all that remains is to create a Wagtail plugin or even a Chrome plugin for the web user to find the relevant paragraphs where one can find the perfect translation of a word.

Some questions that I wish to answer are:

  • Are 4550 Hindi/Urdu paragraph pairs a large enough dataset to train a transformer network for transliterating/translating between Urdu and Hindi?They are after all two registers of the same language.

  • Can this dataset be used to compute a kind of distance between languages that can corroborate the lexcical and syntactic similarities between languages of the same linguistic group?