Шеметев А. А.

 
Шеметев А. А.
(г. Санкт-Петербург, РФ)


TECHNICAL ANALYSIS OF THE GERMAN LANGUAGE AND ITS ECONOMIC SIGNIFICANCE BY METHODS OF DATA ANALYSIS TO IMPROVE THE QUALITY OF TEACHING

German is an important language of our time. Mathematical linguistics issues in this aspect affect areas such as the general use of words and the context of speeches for a certain period. This article analyzes news, Twitter and blogs in German using the methods of mathematical linguistics. The text data of the German language is provided by the John Hopkins University and is available online [1].

The first step was merging all the blogs, twitter and news in German in one .txt file.

From the position of the mathematical linguistics any text is a mix of characters which contain words, which, in their turn, create unigrams, bigrams, trigrams, quatrograms, pentagrams and grams of more advanced level.

In order to understand what the news, blogs and Twitter were approximately about – one has to omit the most common words that have no particular meaning or direct sense. The usual German words that are wise to be omitted are represented in the inbuilt R function vocabulary.

Then, the meaning keywords that would represent the meaning of everything told in the news, blogs and Twitter over a specific period (bad words and some other usual meaningless words are additionally omitted in prior by the author).

Now we see that half of the German language is covered by circa 2000 words; circa ¾ of the German language is covered by less than 16 000 words; and 100% of the language is covered by the 125 000 words.

We see the learners of the German language may wisely concentrate on the most frequently used words of the language and ignore the words that are not used often in order to improve the quality of mastering the language. Economically this will demand less human-hours of work in order to study the material, hence, this may save time for each learner.


REFERENCES

1.  The German language news, twitter and blogs database [electronic resource]: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip (date of access: 26.09.2020).

2.  Alexander Shemetev Word Recognition. – NY: RPUBS [electronic resource]: https://rpubs.com/alexshemetev/Word_Recognition_Midterm_Report (date of access: 26.09.2020).

 

Комментариев нет:

Отправить комментарий