Skip to content

Instantly share code, notes, and snippets.

@kaustubhhiware
Created September 16, 2017 16:21
Show Gist options
  • Save kaustubhhiware/c98025bce236b8908928b4f94336d860 to your computer and use it in GitHub Desktop.
Save kaustubhhiware/c98025bce236b8908928b4f94336d860 to your computer and use it in GitHub Desktop.

All that is English may be Hindi: Enhancing language identification through automatic ranking of likeliness of word borrowing in social media

  • Why am I reading this paper?

It is relevant to my Natural Language Processing term project.

  • What is it about?

Methods to identify code-borrowing , using Online Social Media(OSM).

  • What is code-borrowing?

Code-borrowing refers (in scope of NLP) when a word from another language is used in the native language because of the lack of a proper word in the native language, or the native language's alternative is not used often. Example: 'Class jayega ?' [Will you go to class] Here, there is an alternative for class in Hindi(native language) , kaksha, but it is used less frequently.

On the other hand, code-mixing (again, in this scope) is when someone mixes 2 languages, for which alternatives exist in both languages, and are used frequently enough. How is it different from code borowing? Ask a native speaker, who speaks only the native language, if a particular word,(say class), can be used in native conversations.

  • Excerpts and Findings

  • It is possible for code-mixed words to be gradually become code-borrowed, over the duration of years.

  • Spearman's rank correlation obtained was 0.62, double the most competitive baseline(0.26).

  • Hindi is considered native language, and Enlish is the foreign language.

  • Existing baseline metric:

Value of log( FL2 / Fl1 )

is considered as baseline metric. FL2 denotes the frequency of the L1 transliterated form of the word w in the standard L1 newspaper corpus FL1 denotes the frequency of the L 1 translation of the word w in the same newspaper corpus

More positive the value of this metric is for a word w, the higher is the likeliness of its being borrowed.

Ranking – Based on the values obtained from the above metric for a set of target words, we rank these words; words with high positive values feature at the top of the rank list and words with high negative values feature at the bottom of the list.

  • Proposed metric:

All the words of the languages must be tagged. The different tags that a word can have are: L1, L2, NE (Named Entity) and Others.

A tweet level tag is created: 1. L1 : Almost every word (> 90%) in the tweet is tagged as L1. 2. L2 : Almost every word (> 90%) in the tweet is tagged as L2. 3. CML1 : Code-mixed tweet but majority (i.e., > 50%) of the words are tagged as L1. 4. CML2 : Code-mixed tweet but majority (i.e.,> 50%) of the words are tagged as L2 . 5. CMEQ: Code-mixed tweet having very similar number of words tagged as L1 and L2 respectively. 6. Code Switched: There is a trail of L1 words followed by a trail of L2 words or vice versa.

Unique User Ratio:

UL1 (UL2 , UCML1) is the number of unique users who have used the word w in a L1 (L2 , CML1 ) tweet at least once.

Unique Tweet Ratio:

TL1 (TL2, TCML1) is the total number of L1(L2, CML1) tweets which contain the word w.

Unique Phrase Ratio:

Pl1 / Pl2

PL1 (PL2 ) is the number of L1 (L2) phrases which contain the word w

Ranking – We prepare a separate rank list of thetarget words based on each of the three proposed metrics – U U R, U T R and U P R. The higher the value of each of this metric the higher is the likeliness of the word w to be borrowed and higher up it is in the rank list.

  • Dataset:

  • Hashtag specific tweets were crawled.

  • Time lines of users who used code mixing were crawled.

  • Evaluation criteria:

(i) how well the U U R, U T Rand U P R based ranking of the hlws set, the mws set and the f ull set correlate with the ground truth ranking in comparison to the rank given by the baseline metric, (ii) how well the different rank ranges obtained from our metric align with the ground truth as compared to the baseline metric, (iii) whether there are some systematic effects of the age group of the survey participants on the rank correspondence, (iv) how metrics if computed from the tweets of users who (a) rarely mix languages, (b) almost always mix languages and (c) are in between (a) and (b), align with the ground truth.

Rank ranges: We split each of the three rank lists (UUR, ground truth and baseline) into five different equal-sized ranges as follows – (i) surely borrowed (SB) containing top 20% words from each list, (ii) likely borrowed (LB) containing the next 20% words from each list, (iii) borderline (BL) constituting the subsequent 20% words from each list, (iv) likely mixed (LM) comprising the next 20% words from each list and (v) surely mixed (SM) having the last 20% words from each rank list.

  • Language preference factor

Online survey

The multiple choice question had the following three options and the participants were asked to select the one they preferred the most and found more natural – (i) a Hindi sentence with the target word as the only English word, (ii) the same Hindi sentence in (i) but with the target word replaced by its Hindi translation and (iii) none of the above two options.

For each target word, compute a language preference factor (LPF) defined as (Count En − Count Hi ), where Count Hi refers to the number of survey participants who preferred the sentence containing the Hindi translation of the target word while Count En refers to the number of survey participants who preferred the sentence containing the target word itself.

Linguists define broadly three forms of borrowing, (i) cultural, (ii) core, and (iii) therapeutic borrowings.

In cultural borrowing, a foreign word gets borrowed into native language to fill a lexical gap. This is because there is no equivalent native language word present to represent the same foreign word concept. For instance, the English word ‘computer’ has been borrowed in many Indian languages since it does not have a corresponding term in those languages.

Therapeutic borrowing refers to borrowing of words to avoid taboo and homonomy in the native language.

It would be useful to study and classify various other linguistic phenomena closely related to core borrowing, such as: (i) loanword, where a form of a foreign word and its meaning or one component of its meaning gets borrowed, (ii) calques, where a foreign word or idiom is translated into existing words of native language, and (iii) semantic loan, where the word in the native language already exists but an additional meaning is borrowed from another language and added to existing meaning of the word.

That is all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment