Foreword

In the previous post,

Problem StatementGiven a pair of news items — Ni and Nj, we want to know if Ni and Nj are talkingabout the same event/thing/news.Say we have a large set N of 10,000 news items. And create 10,000C2 pairs of(Ni, Nj) such that i,j are between 0, 10,000. Our problem statement is to comeup with a fu…

Siddharth AdelkarAdelkar Kerfuffle

I had proposed that we frame our news item grouping and similarity problem as a Binary Classification problem where news items are classified using various techniques of Semantic Textual Similarity (STS).
I had propose 2 broad approaches:

Supervised machine learning approach

Here where we are required to create a large binary labelled training dataset. This dataset establishes the correlation between various features of pairs of news content and the final label assigned to the pair — similar (1) or dissimilar (0)
We are required to make feature engineering choices. That is to say, that we train the machine by feeding it
Cosine difference (and other comparative distances) between the document vectors of the news content pair respectively
Cosine difference (and other comparative distances) between the sentence embeddings of the title.
Intersection set or some weighted score of the named entities (such as orgs, people, etc..) recognized by such commercial software as Spacy.io

Heuristic “Jugaad” approach

Here we don’t employ machine learning. Why? Because we don’t have enough labelled training data. In stead we take the features that we identified above — cosine distance between doc vectors, weighted score of the common named entities and perhaps other metrics such as the levenshtein (edit) distance between the titles

-- and we engineer a jugaad equation and threshold — or logical function — out of all these metrics, such that this threshold comparison or function returns 0 (dissimilar) or 1 (similar).

if (['content_similarity']>0.95) & (['size_intersection']>2) & (['weighted_score']>100) then return 1 #SIMILAR else reutrn 0 #DISSIMILAR
There are 2 main differences between the two approaches

In the second approach our coder jugaads a heuristic function that return boolean for similarity/dissimilariy. In this approach the human coder peruses the data available to her and designs such a heuristic through trial and error. In the first approach the machine essentially automates the human jugaad and constructs this function.
So that the machine is able to beat the human being in perusing the data, the machine needs training data with input/label rows as shown in fig. 1. . Approach 1. hence needs a huge training dataset. This training dataset needs to have correct labels (“Similar” or “Dissimilar”) assigned to pairs of news content. These labels need to be human supervised and accurate. This is a mammoth undertaking that can only be done with incentives for the human data entry participants.

The Data Wants to Know! - vagaries of journalism.

In this post, I propose that what I formulated as a binary classification — labels = 0 or 1 — is in fact a multi-label classification problem.
In modern newsrooms, a large number of reports come from news agencies and syndicated content. This is why, many a times, we will see the same PTI or Reuters — both news agencies — report carried by Moneycontrol and Times of India.
There is however, a catch — the titles or “headlines” of a news items is the mandate of the editor of the newspaper. Thus it is possible that identical reports with nearly the same content will have totally different titles. Additionally,
Secondly, it is well within the purview of the newspaper — and they often exercise this prerogative — to change only a few phrases in the news agency copy before carrying it. Thus the exact same report will appear as distinct strings to the computer. (Here is where the metric levenshtein distance comes handy)
Thirdly, we might not want to show various versions of the same Reuters/PTI report to the user.
Lastly, on the same day two newspapers might refer to the same event — fed raising interest rate as — “Fed raises interest rates, causing the market to crash” , “Copper rises after interest rates hiked”
I propose that we introduce nuances to the similarity level of articles. This way, the news app ui will be able to make better decisions pertaining to User Experience on the exact kind of similarity. That is, news app ui may chose to act differently to same articles from news agency vs. those that are different takes on the same event.
When I fetch news data from news api, I see mainly 4 kinds of relationships between article pairs.

UNRELATED:

Joe Biden's victory will endanger nation and destroy American greatness: Donald Trump

TikTok booms in Southeast Asia as it picks path through political minefields

RELATED

There is typically a chronology here.

NEET, JEE 2020 debate: Centre risking lives of students by holding Sept exams, says Mamata. 10 updates - Livemint	Jharkhand CM Hemant Soren writes to Ramesh Pokhriyal, seeks postponement of NEET, JEE exams
Will make America world#39;s manufacturing superpower; tax credits to bring back jobs from China: Donald Trump	Republicans rally behind Trump, say Biden no longer has any principles

PARAPHRASES

At inauguration of Covid care centres: Ajit Pawar, Fadnavis share dais for the first time since failed govt formation attempt	Ajit Pawar, Devendra Fadnavis together at inauguration of Covid-19 hospital
Amazon orders 1,800 Mercedes electric vans for Europe deliveries	Amazon orders 1,800 Mercedes-Benz electric vans for European deliveries - Reuters
Six Opposition-ruled states seek postponement, move SC over NEET, JEE	6 oppn-ruled states file review petition in SC against order to hold NEET, JEE exams - Hindustan Times

IDENTICAL

J&J's Janssen to begin Phase II COVID-19 vaccine trials next week in Spain - Reuters	Johnson & Johnson's Janssen to begin Phase II COVID-19 vaccine trials next week in Spain
Gold steady as economic worries counter stronger Treasury yields - Reuters India	Gold steady as economic worries counter stronger Treasury yields

Another scheme of classification

Another scheme of classification would be to classify a pair of news items with their real life journalistic phenomena. For example

Unrelated
Same real life event different outlets
Same Agency Report in different outlets
Identical reports in the same outlet
Chronologically related articles
Topically related articles

This might be better for the App UI to make decisions.