Garantuesi for the protection of personal data, has been published in a wide range of indikacionesh which of the pages (the data controller of the data) you need to follow in order to prevent the so-called scraping the web pages of the companies to develop models for generation of artificial intelligence. The advice applies to the management of the sites that publish the information to fulfill certain obligations, for example, that, for the transparency of administrative law.
The measures taken to prevent the collection of data
Garantuesi had launched an investigation fact-finding on November 22, 2023, to make the adoption of measures for the safety and security of websites, public and private, to prevent the gathering of a massive personal information. On the 21st of December, 2023, asked the parties concerned to provide the observations, comments and suggestions on the measures that can be taken by the management of the site.
Taking into account the contribution is received, the Garantuesi has published a couple of hints about the measures that the management of the site, as well as the controller of the data, one can take in order to prevent or deter scrapinge the web-site. It is a technique that allows you to “krehni” access to the internet and to create groups of data were used for training the models of the generation of HE was. The companies using robots to be similar to the ones used in Google or Microsoft for the indexed sites.
Garantuesi recommends that the four measures. The first involves the creation of a zone to be reserved, to be accessed through a subscription, in which they are presented to the user data. In this way, they are deleted from the robot. The second option is to navigate to the provisions of the specific conditions of use of the service. This does not prevent the scraping of the web, but it is also an obstacle for the operators of the site you can submit a claim for breach of contract.
Garantuesi suggests the implementation and monitoring of traffic on the network (HTTP requests), to detect the flows abnormal data in the input and the output. It is also possible to lock the traffic coming from the IP address of a specific (in some cases, the activity of the skrapimit the web is similar to an attack DDoS).
In the end, the management of the sites may limit access to the robots by adding CAPTCHA, edit the note, the HTML, by putting the text in the image, thereby blocking agent in the screening of the user, and edit the file robots.txt. This must be put a stop to the robot, but the solution is not very effective because a little bit and the company to communicate the name of the robot (for example, GPTBot by OpenAI to GPT, or Google is Extended from Google for a Gemini).
Discussion about this post