- PARI's articles, faces, albums and images are from rural India. We want to see how much of rural India has PARI covered?
- Specifically, which districts of the country have NOT been covered?
- Is there a geographical pattern to our coverage, is there a geographical pattern to our misses?
- Is there a geographical pattern to the Water Crisis, covered extensively by PARI?
Please go to this Github Repository to read the entire notebook
To do this we need to map our coverage on a map of India. But we have a problem. The names of location in our database and the names of location in the map from the lovely folks at Datameet have spelling differences. There is therefore a key mismatch in many cases and a lot of important locations are dropped.
The Transliteration Problem
At PARI we have articles from thousands of locations. Location names in India is a peculiar phenomenon. Here are a few issues that one sees with a location database:
Indian names are transliterated in a diverse way by people. This is particularly true for Roman transliteration. Although there are international standards for transliteration. Most folks apply the Sir Roger Dowler technique -- Siraj ud Daulah -- that is transliterate by ear. The problem is that a number of Phd theses pervade the space between a Marathi ear and a Punjabi tongue. I mean, each Indian linguistic group has their own perception of what lettes in the English alphabet sound like.
Odia tongue says Buldhana, the Punjabi ear hears Buldana. The Malayalali tongue says Pathanamtitta, the Malaylayali ear hears Pattanamtitta
Yours truly has transliterated both with his right Marathi and left Konkani ear. So I don't exactly know what they said. Rest assured Buldhana is in Maharashtra and what I say matters. But then people from Buldhana might disagree. They do.
And then of course, a number of Bengali tongues scream and shout on whether it is Nilgiris or The Nilgiris (but then that is not a transliteration problem. Or is it? Let's not get into a debate on that. Not with the Bengalis at any rate)
In addition to Phd theses, between the Punjabi tongue and the Marathi ear also stands the ubiquitous and all powerful Indian bureaucracy. Bureaucrats, as has been shown by paleontologists, are also anatomically, modern homo sapiens. And hence must have their have their own cultural-linuistic backgrounds. Every major institution also has its own cupboards, look-up tables, lexicons, ready reckoners, dialects, jargon and argots. Joke.
Suffice to say that the Census of India calls Darjeeling what the Election Commission calls Darjiling. The Meterological Department calls Purulia when the Ministry of Human Resources Development calls Puruliya. It is another matter that the HRD ministry might soon be called Department of Education. I digress.
Allahabad becomes Prayagraj. Gurgaon becomes Gurugram.
- And then there is the problem of taxonomy. Places migrate from districts to other districts. They graduate from a mere district to a full-blown division. Gerrymandering. Redistricting. All that good stuff.
- Besides millions of small hamlets where Dalits and Adivasis live are not recorded at all. They are dismissed as settlements of ostracized communities. How does one spell their names?
Fusion of Tables
If the schwa-conscious Survey of India calls a place Chamarajanagar in the district map of India but the entomologically inclined and etymologically insensitive Department of Horticulture calls it Chamrajnagar, then --the district that was once a part of the larger Mysore district and was formerly called "Sri Arikottara"-- has no chance of appearing on the map of gardens in India. Esepcially if the map is drafted by a machine or a careless cartographer.
Levenshtein Distance
One automated way of matching a string like "Darjeeling" with "Darjiling" is by using edit distance algo called Levenshtein distance. A threshold of 0.8 seems to work okay.
District 1 | District 2 | State 1 | State 2 | Distance | Decision | |
---|---|---|---|---|---|---|
0 | Kalimpong | West Bengal | 0.000000 | no match | ||
1 | Nainital | Uttarakhand | 0.000000 | no match | ||
2 | Kanchipuram | Kancheepuram | Tamil Nadu | Tamil Nadu | 0.869565 | match |
3 | Bardhaman | Barddhaman | West Bengal | West Bengal | 0.947368 | match |
4 | Howrah | Haora | West Bengal | West Bengal | 0.666667 | match |
... | ... | ... | ... | ... | ... | ... |
161 | Ashok Nagar | Ashoknagar | Madhya Pradesh | Madhya Pradesh | 0.952381 | match |
162 | Reva | Rewa | Madhya Pradesh | Madhya Pradesh | 0.750000 | match |
163 | Mahisagar | Mahesana | Gujarat | Gujarat | 0.705882 | match |
164 | Kondagaon | Chhattisgarh | 0.000000 | no match | ||
165 | Ramanagara | Chamrajnagar | Karnataka | Karnataka | 0.727273 | match |
166 rows × 6 columns