PuntuEUS Observatory

THE PUNTUEUS OBSERVATORY

is an initiative of the PuntuEUS Foundation and its aim is to measure the situation enjoyed by Basque language on the Internet. The analysis conducted by this Observatory is made once a year, and the results of the study are published in this website as well as the year’s report.

ANALYSES

the presence of the Basque language on the Internet and also the presence of the Internet across the Basque Country, quantitatively and qualitatively, by measuring three spheres: Situation of the .EUS domain, The Internet in the Basque Country and The situation of the Basque language on the Internet.

THE MAIN GOAL

of the PuntuEUS Observatory is to offer to the whole Basque society a useful tool to assist in specifying the strategy and policies needed to reinforce the presence that Basque language has on the Internet.

SITUATION OF THE .EUS DOMAIN

Quantitative and qualitative analysis of the domain: number of domain names and their distribution in terms of territory, type of organisation and level of domain penetration.

THE INTERNET IN THE BASQUE COUNTRY

THE BASQUE LANGUAGE ON THE INTERNET IN THE BASQUE COUNTRY

The presence of Basque on the Internet, in the main TLDs and in the social networks

DOCUMENTS

METHODOLOGY

This study not only analyses how far the .EUS domain and the main Internet domains have penetrated the Basque Country, it also examines the presence of Basque and the other dominant languages.

To do this, the entire content of the websites corresponding to the domains was analysed and classified by language. That way, it will be possible to know how much content in each of the Basque Country’s domains is in Basque, Spanish, English, French and other languages. Two strategies were used to conduct this domain analysis:

- Domain level crawling: Firstly, the HTML website content corresponding to the domain is automatically downloaded, taking those created in JavaScript into consideration as well. Then the text information is extracted from the HTML content and its language is automatically identified by means of language model statistics. This strategy guarantees an accuracy of 0.77. Even if this strategy is implemented in parallel with the domain processing, it can lead to a lot of traffic when large websites are processed. That is why this strategy is only used for processing low-content websites.

- Domain level web searches: This strategy based on Internet search engines (e.g. Google, Bing) is used to process large websites because it generates much less traffic than the crawling process. However, many low-content websites are not fully indexed by the search engines. By running a search comprising the most significant words in each language in each domain, we can calculate the number of content items in the language. The number of pages that have significant words (language filtering words) in each language in the domain is provided by the Internet search engines. The first hits returned by the Internet search engines are automatically classified according to language to check that the filter words are working properly. This strategy guarantees an accuracy of 0.82.

In addition to these automatic strategies, the presence of Basque in the case of the .EUS domains was measured manually, thus achieving a higher level of precision.

With respect to domain distribution, the following domains were analysed:

- gTLDs or generic Top Level Domains: .EUS, .COM, .NET, .INFO, .ORG and .BIZ

- ccTLD or country code Top Level Domains: The .ES and .FR domains were analysed. In the case of the .EUS domain, various pieces of data needed to be able to make language analyses are not public, and this is indicated throughout the corresponding analysis.

Sponsors