29 August 2017

Be simpler!

"Dejargonizer" will make the scientific text clearer for a wide audience

Elizaveta Ivtushok, N+1

Israeli scientists have presented a program that automatically determines the intelligibility of a scientific text. Their De-Jargonizer algorithm divides words into three frequency groups and uses a simple formula to calculate how much the text can be understood by a wide audience. After testing the program on the articles of the journal PLoS, scientists found out that the annotations of some articles may contain up to 27 percent of the rarely encountered words of scientific vocabulary. An article describing how the program works is available for reading in PLoS One, and you can try out the algorithm on the website.

The curse of knowledge is one of the forms of cognitive distortion in which a competent person faces difficulties when trying to explain something to his ignorant interlocutor, due to the fact that he cannot put himself in his place and imagine that it is possible not to know. For example, scientists who publish articles in peer-reviewed journals and give lectures may encounter this cognitive distortion: the use of specific vocabulary may lead to a certain topic remaining incomprehensible to listeners and readers. Research shows that in order to understand the text, the reader must be familiar with 98 percent of all words encountered, while natural science texts and computer-related literature may contain about a quarter of words of specific scientific vocabulary.

The authors of the new work presented De-Jargonizer, a program that processes a scientific text and gives the author information about the percentage of the content of words of specific vocabulary and rare words, as well as an indicator (in the form of points) of whether the text taken can be understood by a wide audience. To do this, researchers have created a large (500 thousand unique occurrences) corpus of scientific articles. The words in this corpus were divided into three groups: frequency (2000 of the most common words of the English language and their word-formation forms), rare (words of lower frequency) and jargon (words of scientific vocabulary).

De-Jargonizer.png

An example of the algorithm's operation based on an abstract (I) and a short summary (II) of an article from the journal PLOS. Rare words are highlighted in yellow, words of scientific vocabulary are highlighted in red.

The algorithm is fully operational, equipped with a user-friendly interface and is available to the general public. De-Jargonizer uses the corpus given to it in order to determine the frequency of each word in the text and assign it to one of three groups (frequency, rare or jargonisms) and provide the author with information about the percentage of words of each type in his text. Based on this, the algorithm then determines the accessibility of the text to a wide audience and gives the sum of points from 0 to 100.

The authors tested De-Jargonizer on 500 articles from various journals of the publishing house PLOS, specializing in texts of various scientific topics. The researchers took an abstract and a short summary written for a wide audience (lay summary). The results showed that annotations of biology texts contain up to 10 percent of words of specific vocabulary, while summaries for a wide audience contain about eight percent. This result shows that, although the text written for a wide audience contains less scientific jargon, it is far from being understandable (the text must contain up to two percent of the new vocabulary in order to be understandable).

The problem of the curse of knowledge is one of the most common shortcomings of academic writing. Automatic detection of text flaws can help scientists avoid misunderstandings when communicating with a wide, even scientific, audience. The authors plan to periodically update the corpus used by the algorithm, as well as to include other languages in it.

Portal "Eternal youth" http://vechnayamolodost.ru  29.08.2017


Found a typo? Select it and press ctrl + enter Print version