In this blog, I will briefly address data scraping practices and the general privacy concerns associated with it.
It is worth mentioning that those who automatically collect data from other sites will separately need to consider associated non-privacy concerns, including the possible impact of website/data source terms, copyright and other intellectual property issues.
What is data scraping?
Data scraping is a general term that describes a plethora of Internet-based data retrieval methodologies, used without the permission of the data owner. Data scraping can be manual or automatic - where conducted automatically, machine-to-machine interaction is used.
Data scraping practices vary from general extraction of data, to extraction in a specific way, and include:
Screen-scraping: retrieving targeted data from a web-page and removing irrelevant data
Web-scraping: retrieving all underlying data from a website, including website scripts
Web-crawling (also known as web-harvesting, web-spiders, web robots, search bots):As the name implies, this method crawls the pages of websites and indexes available words and content in a domain.
Who scrapes data?
Data scraping is prevalent in general business practices. It may not seem obvious but recruitment drives, trend identifications, marketing campaigns, sales and lead generation, credit card and customer risk assessments, and intelligence gathering practices typically scrape data to enhance their databases, information repositories and internal functions. Insurance companies often rely on data scraped from open sources (that is, data that is publically available) when defending exaggerated claims at trial or for fraud prevention. In practice, this means that the insurer (or a third party acting on its behalf) might for example, retrieve photographs and status updates that refute an alleged claim.
Purpose and lawful basis for data scraping
The GDPR requires controllers to have a purpose for processing. In data scraping terms, businesses who cannot justify or establish a legitimate purpose should not engage in the practice. Naturally, a careful and considered documented analysis of the purpose is recommend, bearing in mind individuals should reasonably expect their data to be processed for the purpose identified.
Let us scrape everything!
Data scraping practices allow for the extraction of vast quantities of data from websites. Often, businesses think to capture as much data as possible on the off chance the data serves a future use or purpose. This, however, carries the risk that it may go against some of the GDPR's key principles, purpose limitation and data minimisation.
Purpose limitation means that businesses should only collect and process personal data to achieve specified, explicit and legitimate purposes and not engage in further processing unless it is compatible with the original purpose for which the data was scraped.
Data minimisation means that businesses must only collect and process personal data that is relevant, necessary and adequate to accomplish the purpose for which the data was scraped. Data minimisation aims to reduce data collected to the lowest possible level for realising the processing purposes.
Practically speaking, businesses should apply tests of necessity and proportionality to data scraping practices. Businesses should therefore ask themselves whether all of the data scraped is necessary for and directly relevant to achieving its intended purpose, then consider the amount of data it should scrape.
Legitimate Interest as a lawful basis
Where personal data is involved, organisations must have a lawful basis to conduct data scraping. There are six lawful bases available under GDPR: consent; contract with the data subject; compliance with a legal obligation; vital interest; public interest; legitimate interest. Of these, the only potentially fitting lawful ground is legitimate interest. (Consent can quickly be discounted on the basis that most individuals will not consent to having their data scraped).
Legitimate interest allows processing to be undertaken if it is necessary for the purposes of business (or other) interests, except where such interests are overridden by the interests or fundamental rights and freedoms of individuals. Processing will be lawful if, as a result of a balancing of interests, the legitimate interest of the business prevail over the individual's whose data will be scraped. This balancing test needs to be well documented for accountability purposes and many controllers relying on this basis will implement a formal "legitimate interest assessment".
Even with an identified lawful basis, not all personal data is fair game for scraping. Where special categories of personal data are scraped, that is personal data requiring extra levels of protection under GDPR, such as race, religion, health data, political opinions, etc., the explicit consent of the individual is required unless an exemption applies under local law.
Individuals don't need to know about it… or do they?
The GDPR also requires controllers to be transparent. Data scraping, by its nature, is a practice that is often difficult to be fair and transparent about.
Where businesses engage data scraping service providers, the business is responsible for providing the individuals with a privacy notice. The privacy notice must contain specific information, set out in Article 14 GDPR, which includes data subject rights and how to exercise them - it must be provided to the individuals within one month of scraping their data. There are some exceptions to this rule, for instance, if the provision of the information proves impossible or would involve a disproportionate effort. If those circumstances apply, businesses can take alternative measures to protect individuals' rights, freedoms and legitimate interests, including making the privacy notice publicly available. Note with caution however, disproportionate effort may in some jurisdictions be interpreted narrowly. For example there was a recent decision (March 2019) by the Polish Data Protection Authority (Polish DPA) when it fined a data scraping company €220k for its failure to provide privacy notices to 5.7 million individuals whose data was scraped from a public register. The Polish DPA rejected the argument that placing a privacy notice on the data scraping business's website was enough to notify individuals, particularly where individuals were not aware that their data had been scraped and was being processed.
Data Protection Impact Assessment (DPIA)
If it is not possible to adhere to Article 14, and as the data is not collected directly from the individuals, data scraping is considered "invisible processing". Some regulators (for example, the Information Commissioner's Office (ICO)) consider this to be "high risk" processing for which a DPIA is required. In addition to the national lists of what is high risk, the European Data Protection Board (EDPB) provides an example of "gathering of public social media for generating profiles" as requiring a DPIA. The reason being, this processing includes evaluating or scoring, processing data on a large scale, matching and combining datasets and sensitive data or data of a highly personal nature as possible relevant criteria.
Other privacy concerns
If a controller engages a service provider to undertake scraping on its behalf, it must ensure that it has in place a Data Processing Agreement with the servce provider, incorporating the GDPR's processor obligations.
Data scraping does not come without health warnings. Controllers are responsible for the data scraped (even if third parties are engaged to carry out the service). Businesses need to understand the privacy risks associated with the practice, particularly where establishing a lawful basis to conduct data scraping. Businesses should also ensure that a clear purpose for data scraping is established, that only data necessary for the purpose at hand is scraped. Businesses should avoid scraping special categories of personal data (unless one of the narrow exceptions are available), provide transparency notices to individuals and have appropriate contractual terms in place with its data scraping service providers.