MUSA Hosts Webinar on Generative AI & Unauthorized Scraping

The Mitigating Unauthorized Scraping Alliance (MUSA) hosted a webinar on May 18, 2023, titled “The Rise of Generative AI & Unauthorized Scraping: Exploring the Ethical and Legal Considerations in the Age of Big Data.” The webinar featured David Patariu, Attorney at Venable LLP; Daniel Gervais, Milton R. Underwood Chair in Law and Director of Intellectual Property Program at Vanderbilt University; and Brandi Guerkink, Senior Policy Fellow at Mozilla Foundation. Venable LLP’s partner, A.J. Zotolla moderated the discussion.

The panelists examined the benefits, risks, and challenges associated with Generative AI (GenAI) and its relationship with unauthorized scraping.

The conversation began with an introduction to artificial intelligence and its various applications, focusing on current generative models. Speakers underscored that generative models are trained on terabytes of data and have positive uses such as advancing health and medicine, combating climate change, and accelerating software engineering. However, speakers also acknowledged the potential downsides of the technology, including false or harmful generated information and the need for accountability of language learning model (LLM) outputs. The discussion addressed the issue of biases that can result from limited, unfiltered, or homogenous data sets, emphasizing the need for supervised machine learning, careful data labeling, and curated data collection efforts. Panelists also noted that GenAI models primarily rely on internet scraping for training data, highlighting the need for algorithmic transparency and ensuring appropriate consent options and disclosures are in place.

The webinar’s focus then shifted towards examining the ethical implications of GenAI technology and the importance of compliance. Guerkink noted the need for compliance with data protection laws and emphasized the significance of conducting thorough risk assessments. This discussion also raised concerns about the scraping of personal and sensitive information. Patariu cited a recent example of an artist who discovered her private medical record photos had been made public, scraped, and added to the LAION-5B data set without her consent. Speakers agreed that greater alignment regarding the boundaries of publicly available information and its usage is needed, emphasizing that once data is disseminated it is impossible to control or alter. Setting limits to data access can minimize the privacy risks around potential misinformation, toxicity, and bias in outputs. However, completely banning access to data can hinder critical public service projects and research that rely on public data.

The conversation concluded with an examination of the legal implications of GenAI technology and strategies required for its development and regulation. Gervais emphasized that the assumption made by scraping companies – that if data is publicly available, it’s fair game — is flawed and easily challenged. Unauthorized scraping used to train AI models can lead to a multitude of legal issues related to privacy, trade secrets, publicity, and copyright protections. In addition, legal outcomes can be influenced by factors such as the type of scraping, commercialization, and application of the generated outputs. Speakers agreed that guidance around GenAI will largely be determined in the courts and questions around infringement for temporary copies and derivative work produced by machines remain unresolved. Speakers also discussed the need for output accountability from people responsible for machines as well as the extension of protections such as Section 230, which shields platforms from liability for third-party user-generated content, to machine generated content. Patariu highlighted that regulators can enforce data protection laws through algorithmic disgorgement and algorithmic deletion requirements, though a lot of work still needs to be done on the legislative and regulatory side of ensuring privacy protections. Speakers expressed hope for the development of private ordering and licensing solutions for scraping that can help with compensation for reuse of existing material and put use limitations on what can and cannot be done with data. However, they acknowledged that this will not address LLM production, use, and ownership issues. How regulators respond will continue to vary by country, with provisions already being made for ‘high risk’ and ‘high impact’ AI. Panelists highlighted that any limitations on scraping will need to be balanced with the risks associated with synthetic data training and proposed expansion of authorized commercial and research channels for data collection. 

Given the difficulty of completely preventing unauthorized scraping, it is crucial for users to exercise caution when sharing information publicly. Companies also need to carefully evaluate their user data policies to ensure they have proper safeguards for intellectual property and privacy protection.

As scraping is the precursor to much of GenAI activity, participation in discussions like this one and engaging with organizations like MUSA is important now more than ever to ensure that technological innovation and data protection grow hand in hand.

Webinar: The Rise of Generative AI & Unauthorized Scraping: Exploring the Ethical and Legal Considerations in the Age of Big Data

Event Date/Time:

Thursday May 18th 9:00 PST/ 12:00 PM EST (45 min talk + Q&A)

Generative AI presents a significant opportunity for technology, innovation, and connection, yet its dramatic growth has raised questions about how to use this innovation in ethical and legal ways that require careful consideration by industry, policymakers, and regulators. This MUSA webinar explored whether the growing demand for datasets to be used to feed generative AI model training has contributed to the rise of unauthorized scraping and whether this gives rise to questions around privacy risks and the need for transparency. Experts from civil society, academic and industry fields came together to discuss the benefits, risks, and challenges of generative AI and its relationship to unauthorized scraping.

The Mitigating Unauthorized Scraping Alliance (MUSA) brings together industry members and experts to address challenges and establish a unified front against unauthorized scraping and data misuse. It is working with member companies and experts to publish industry-aligned practices to mitigate unauthorized scraping across member platforms, reduce the attack vector for unauthorized scraping threat actors, and serve as a resource for media and policymaker engagement.

Speakers and Moderator:

Daniel Gervais, Professor, Vanderbilt University

David Patariu, Associate, Venable LLP; Privacy Law Specialist (PLS); International Association of Privacy Professionals Fellow of Information Privacy, CIPP/US, CIPP/E, CIPM; ISC² CCSP, CISSP

Brandi Guerkink, Senior Policy Fellow, Mozilla Foundation

A.J. Zottola, Moderator, Partner, Venable LLP

MUSA Partners with IAPP on Global Privacy Summit Panel

On April 5, 2023 the Mitigating Unauthorized Scraping Alliance (MUSA) partnered with the International Association of Privacy Professionals (IAPP) to organize the panel “Web Scraping: Understanding Compliance Risks and Hidden Costs” for the IAPP’s Global Privacy Summit 2023. The panel featured perspectives from Eric Null, the Director of Privacy & Data Project at the Center for Democracy and Technology (CDT), Chelsea Reckell, an attorney at Venable LLP, Lindsay Vogel, the Lead U.S. Counsel for Privacy at Bumble, and was moderated by Cobun Zweifel-Keegan, the Managing Director for IAPP D.C.

The discussion opened with an analysis of the complex legal and ethical environment around unauthorized data scraping. Speakers outlined the challenges around combating the issue given that web scraping currently is not clearly addressed or defined under the law. The panel highlighted some of the avenues of redress that have been tested, particularly the Computer Fraud and Abuse Act, the principal federal anti-hacking statute, and their limitations. Speakers identified the need for a civil remedy and legal framework to create distinctions around the liability of unauthorized scraping, particularly for public data. Current U.S. privacy laws have publicly available data exceptions that leave no recourse for platforms to pursue threat actors. Thus, speakers continued to highlight the need for defining the limits of how “publicly available” data is used as a necessary first step in improving current privacy legislation.

The panel also highlighted distinctions between authorized and unauthorized scraping and explored the privacy risks to users, particularly for sensitive and personally identifiable information. Speakers highlighted Clearview AI and Weight Watchers’ Kurbo as notable examples of personal images and information being collected without authorization and the difficulty around removing information once it is scraped and packaged in datasets. Enforcement options like cease and desist letters and litigation can be challenging to pursue if actors cannot be identified or unauthorized data is already being used to train an algorithm or widely dispersed through other mechanisms. Thus, the panel emphasized the importance of coordinated action around unauthorized scraping prevention, whether as a technical solution like modifying APIs, or contractual, such as updating terms of service. Panelists urged attendees to turn to the Industry Practices to Mitigate Unauthorized Scraping document as a useful resource.

Finally, the panel emphasized the need to start conversations around unauthorized scraping and the importance of industry collaboration. Panelists highlighted that scraping affects many companies and having regulators involved in conversation with industry members on a global scale is crucial for generating meaningful action. The panel raised a number of unresolved questions around the boundaries of permissible scraping related to first amendment protected activities like journalism or research and the absence of legal distinctions between authorization and authentication within the current law, suggesting that there is still much work to be done. However, what is clear is that continuing to promote dialogue around the issue of unauthorized scraping  is necessary to tackle this complex issue.

MUSA Hosts Webinar on Practices to Combat Unauthorized Scraping

On March 30, 2023, the Mitigating Unauthorized Scraping Alliance (MUSA) hosted a webinar,  “Practices for Combating Unauthorized Scraping”, in conjunction with the publication of its Industry Practices to Mitigate Unauthorized Scraping. The event featured William Glazier, Director of Threat Research at Cequence Security, in conversation with Hemu Nigam, a partner at Venable LLP, and examined the technical enforcement mechanisms available to industry to combat unauthorized scraping.

The conversation kicked off with an overview of the recent rise of unauthorized scraping and the multifaceted nature of this problem. Both speakers underscored that data has become increasingly valuable and technology has advanced, reducing barriers to entry. For example, Glazier highlighted the prevalence of residential proxies, which can mask a scraper’s identity and be used by threat actors. As more companies rely on APIs to power the digital economy, bots will find ways to continue to exploit their vulnerabilities. Glazier emphasized that while unauthorized scraping can feel like a benign issue, it has real world impacts for consumers and users. 

The discussion also examined the technical measures available to combat unauthorized scraping detailed in MUSA’s practices document. Glazier lauded the document’s holistic approach to addressing unauthorized scraping and focused on the importance of building institutional awareness. Executive level sponsorship of efforts to combat unauthorized scraping is needed to develop strong systems that mitigate risks, such as sensitive data exposure, at every step of the product development process. In addition, Glazier noted that detection and mitigation provisions in the practices document, like implementing CAPTCHAs and rate limits, are tools available to companies to “introduce a speed bump” on a user interaction. These checkpoints can help identify and stop threat actors to a degree, but are not a comprehensive strategy. Glazier shared that developing even more advanced prevention strategies that employ techniques like behavioral analytics to detect bot behavior can further help mitigate against platform abuse.

Both Nigam and Glazier stressed the necessity of industry collaboration to combat unauthorized scraping. There will always be gaps in institutional knowledge, and companies can learn from each other on strategies to mitigate unauthorized scraping. Through initiatives like threat intelligence sharing, institutions can better understand threat actors and trends affecting industry in order to develop more informed prevention mechanisms. Collaboration through institutions like MUSA is essential to create industry alignment around processes to combat unauthorized scraping.

As the conversation concluded, Glazier noted that public education around unauthorized scraping is essential. Greater understanding around unauthorized scraping and its potential impacts helps consumers understand how to protect their own data and informs regulators on how best to enforce against unauthorized scraping. 

Building awareness around the impact of unauthorized scraping and fostering public-private collaboration helps ensure that there is an expectation of consequence for threat actors. Through its advocacy work and recently launched Industry Practices to Mitigate Unauthorized Scraping, MUSA is leading this exact effort. 

MUSA Complimentary Webinar: Practices for Combating Unauthorized Scraping

The Mitigating Unauthorized Scraping Alliance (MUSA) hosted a webinar discussion to examine the technical and legal enforcement mechanisms available to industry to combat unauthorized scraping. This was the first event in a series of webinars examining topics related to unauthorized scraping.

Currently, there are no industry standards for combating unauthorized scraping, and a singular approach to addressing the practice does not exist. However, there are various tools that companies utilize to mitigate unauthorized scraping. This event discussed the need for industry collaboration and highlighted the mechanisms available to companies to combat unauthorized scraping.

MUSA brings together industry members to address challenges and establish a unified front against unauthorized scraping and data misuse. It is working with member companies and experts to publish industry-aligned practices to mitigate unauthorized scraping across member platforms, reduce the attack vector for unauthorized scraping threat actors, and serve as a resource for media and policymaker engagement.

Alliance Releases First Industry Practices to Mitigate Unauthorized Data Scraping

Effort Aims to Raise Awareness and Adoption of Effective Approaches Across Industries

WASHINGTON, D.C., March 30, 2023 – With unauthorized data scraping incidents by threat actors on the rise, the Mitigating Unauthorized Scraping Alliance (MUSA) today released the first non-binding and voluntary industry practices that promote means of detecting, preventing, mitigating, and enforcing against unauthorized data scraping.

These practices were compiled through extensive conversations with industry members and experts on measures to mitigate the risk of unauthorized data scraping. They also draw from industry research conducted by the research firm NewtonX in its study of 1,300 professionals to better understand data extraction prevention. 

“Individual company practices to protect against unauthorized scraping have significantly evolved over recent years, despite the absence of industrywide standards,” said Hemu Nigam, partner at Venable LLP and coordinator of the Mitigating Unauthorized Scraping Alliance. “This publication is an important first step to raise awareness and broaden adoption of helpful practices to combat unauthorized data scraping.” 

The practices in the publication are divided into institutional, prevention, detection/mitigation, and enforcement categories that highlight measures against unauthorized data scraping which can be maintained and updated effectively over time to serve the needs of companies. 

The publication does not claim to be a fully comprehensive list of every unauthorized data scraping mitigation practice that companies may take, nor does it identify which measures will be appropriate for any given platform; however, it offers useful guidance for potential mitigation. In addition, it is necessary to acknowledge that due to the continuously evolving nature of scraping technologies and functional need for public-facing data, even comprehensive detection, mitigation, prevention, and enforcement practices can only reduce the incidence of unauthorized data scraping; they cannot prevent it altogether.

MUSA will host a complimentary webinar discussion on Thursday, March 30, 2023 at 2 p.m. EDT to examine the need for industry collaboration and highlight the mitigation and enforcement mechanisms available to industry to combat unauthorized scraping.

The Mitigating Unauthorized Scraping Alliance brings together leading companies committed to protecting data from unauthorized scraping and misuse. In collaboration with industry members, policymakers, and the public, MUSA is generating a global dialogue around unauthorized data scraping focused on protecting user data through education, advocacy, public-private partnerships, and the sharing of reasonable practices to mitigate unauthorized scraping.