PuntuEUS Observatory

THE PUNTUEUS OBSERVATORY

is an initiative of the PuntuEUS Foundation and its aim is to measure the situation enjoyed by Basque language on the Internet. The analysis conducted by this Observatory is made once a year, and the results of the study are published in this website as well as the year’s report.

ANALYSES

the presence of the Basque language on the Internet and also the presence of the Internet across the Basque Country, quantitatively and qualitatively, by measuring three spheres: Situation of the .EUS domain, The Internet in the Basque Country and The situation of the Basque language on the Internet.

THE MAIN GOAL

of the PuntuEUS Observatory is to offer to the whole Basque society a useful tool to assist in specifying the strategy and policies needed to reinforce the presence that Basque language has on the Internet.

SITUATION OF THE .EUS DOMAIN

Quantitative and qualitative analysis of the domain: number of domain names and their distribution in terms of territory, type of organisation and level of domain penetration.

Filter Categories
All
2017
2016
2015

THE INTERNET IN THE BASQUE COUNTRY

THE BASQUE LANGUAGE ON THE INTERNET IN THE BASQUE COUNTRY

The presence of Basque on the Internet, in the main TLDs and in the social networks

DOCUMENTS

METHODOLOGY

This study not only analyses how far the .EUS domain and the main Internet domains have penetrated the Basque Country, it also examines the presence of Basque and the other dominant languages.

To do this, the entire content of the websites corresponding to the domains was analysed and classified by language. That way, it will be possible to know how much content in each of the Basque Country’s domains is in Basque, Spanish, English, French and other languages. Two strategies were used to conduct this domain analysis:

- Domain level crawling: First of all, the html website content corresponding to the domain is automatically downloaded using crawling techniques, taking those created in JavaScript into consideration as well. With this crawling process the redirects that may be within a domain are managed in an intelligent way. Parking pages are also detected and blocked. After this crawling process, the text is extracted from the gathered html content and the languages in it are automatically identified by means of language model statistics. The language model used is capable of identifying all the texts that may be in a multilingual text. This strategy produces a lot of traffic when large websites are processed. That is why it is only used for processing low-content websites.

- Domain level web searches: The idea behind this strategy is to make use of Web browsers (Google, Bing, etc.) in measuring how much presence a language has in a website. By running a search comprising the most significant words in a language (language filter words) in a specific domain in the web browsers, we can calculate the number of content items in the language. That way we avoid having to download website content. That is why we use this strategy to process large-sized websites. We do not apply this to small-sized websites because many websites with little content are not fully indexed in the browsers. To confirm that the language filter words have functioned correctly, we classify -by means of statistical models- the first results returned by the browsers according to language; the purpose is to check the number of pages returned by the web browsers, in accordance with the language filters.

It goes without saying that this measuring process that needs to be completed is very complex, and that is why even if the precision of the two strategies is very high, there is a margin of error. At the end of the day, the measuring process comprises a number of steps, and each of these steps has a small error rate, which accumulates throughout the whole chain. According to our calculations, the precision of the results of the measurement is between 70% and 80%.

We are incorporating improvements into the system to reduce this error rate in the yearly measurements, and these improvements are making their presence felt in the results. In the 2017 analysis, for example, more redirects were taken into consideration in the crawling process, and many domain parking pages were automatically blocked. We have, however, used a new statistical language model to be able to identify pieces of texts in Basque in multilingual texts, and we have also taken websites with very short texts into consideration.

In addition to these automatic strategies, the presence of Basque in the case of the .EUS domains was measured manually, thus achieving a higher level of precision.

With respect to domain distribution, the following domains were analysed:

- gTLDs or generic Top Level Domains: .EUS, .COM, .NET, .INFO, .ORG and .BIZ

- ccTLD or country code Top Level Domains: The .ES and .FR domains were analysed. In the case of the .EUS domain, various pieces of data needed to be able to make language analyses are not public, and this is indicated throughout the corresponding analysis.

Sponsors