MUSA Hosts Webinar on Harmful Web Scraping Research

The Mitigating Unauthorized Scraping Alliance hosted a webinar on June 22, 2023 focused on a new research paper by Timothy Edgar, “Talking Past Each Other: The Legal and Technical Challenges of Harmful Web Scraping”. The webinar featured David Patariu, Attorney at Venable LLP, in conversation with Timothy Edgar, Professor of the Practice of Computer Science at Brown University, Senior Fellow at the Watson Institute for International Studies, and Public Affairs Lecturer on Law at Harvard Law School.

The discussion set the stage with an overview of web scraping and clarified its distinction from web crawling. Timothy Edgar defined scraping as the automated practice of collecting data from websites based on predefined patterns and more invasive than web crawling, which involves identifying and indexing content on web pages. The conversation highlighted the important role of robots.txt files in instructing web crawlers and scrapers which portions of a website they can visit, but emphasized the recent breakdown of norms of scrapers and crawlers adhering to these instructions. David Patariu explained that this can be partially explained by the rise of generative AI and its relationship to scraping since web scraping provides large amounts of real-world data for training models, but commercial pressures to stay competitive have contributed to a growing disregard for instructions like robots.txt and Terms of Service.

Timothy Edgar also discussed what he refers to in his paper as “unwanted” or unauthorized scraping, defined as the automated collection of data that violates a website’s terms of service. He indicated that some scraping can be considered more innocuous than others such as scraping for scientific research, though acknowledged that all scraping poses certain risks. This led to an exploration of the potential harms of web scraping, particularly in regard to the automated collection of personal information for commercial or criminal exploitation. Speakers identified that the misuse of personal information intended for specific contexts, for example dating profiles, poses significant privacy risks and even harm to users. Timothy Edgar cited real-world examples, such as that of Clearview AI’s scraping of billions of photographs for facial recognition purposes, to illustrate the possible privacy violations associated with web scraping. The conversation further revealed that web scraping is often the “bread and butter” reconnaissance for more malicious activity, and the collection of data such as email addresses can serve as a gateway to phishing and access to a public site’s more private sections.

The webinar also highlighted the importance of authentication, authorization, and access control (AAA), which are well understood in the field of cybersecurity and explored in the paper. These technical concepts play a crucial role in achieving cybersecurity goals, verifying identity, granting access, and setting limitations. The speakers discussed the common misunderstanding that lawyers have when dealing with these terms, clarifying that authentication and authorization, while related, are distinct processes that can happen in any order and are often erroneously conflated with a login process that allows a computer to limit access to a particular user with an account. Speakers discussed how the Computer Fraud and Abuse Act (CFAA), the main anti-hacking law which was written in the 1980s, uses the terms “without authorization” and “exceeding authorized access” without mention of authentication, which has created problems for the courts who have struggled to accurately interpret these terms. Timothy Edgar proposed that one solution to this problem would be updating the CFAA to provide a technically-sound definition of authorization and clarify the role of authentication, though he emphasized that amending the CFAA would only be a partial remedy since the CFAA only addresses the rights of owners but not rights of users whose privacy is violated, and indicated that we need a comprehensive privacy law in the US.

In conclusion, the webinar provided valuable insights into the complex world of web scraping, raising important questions about privacy, security, and ethical considerations. Speakers emphasized the need for regulators and policymakers to be aware of the real potential for harm that unauthorized web scraping poses and the privacy problems that can result from the misuse of publicly available user data. Balancing the benefits of web scraping with the protection of personal information remains a challenge, requiring collaboration between legal and technical experts and the forming of public-private partnerships much like the ones the Mitigating Unauthorized Scraping Alliance is driving in order to find effective solutions.