MUSA Hosts Webinar on Generative AI & Unauthorized Scraping

The Mitigating Unauthorized Scraping Alliance (MUSA) hosted a webinar on May 18, 2023, titled “The Rise of Generative AI & Unauthorized Scraping: Exploring the Ethical and Legal Considerations in the Age of Big Data.” The webinar featured David Patariu, Attorney at Venable LLP; Daniel Gervais, Milton R. Underwood Chair in Law and Director of Intellectual Property Program at Vanderbilt University; and Brandi Guerkink, Senior Policy Fellow at Mozilla Foundation. Venable LLP’s partner, A.J. Zotolla moderated the discussion.

The panelists examined the benefits, risks, and challenges associated with Generative AI (GenAI) and its relationship with unauthorized scraping.

The conversation began with an introduction to artificial intelligence and its various applications, focusing on current generative models. Speakers underscored that generative models are trained on terabytes of data and have positive uses such as advancing health and medicine, combating climate change, and accelerating software engineering. However, speakers also acknowledged the potential downsides of the technology, including false or harmful generated information and the need for accountability of language learning model (LLM) outputs. The discussion addressed the issue of biases that can result from limited, unfiltered, or homogenous data sets, emphasizing the need for supervised machine learning, careful data labeling, and curated data collection efforts. Panelists also noted that GenAI models primarily rely on internet scraping for training data, highlighting the need for algorithmic transparency and ensuring appropriate consent options and disclosures are in place.

The webinar’s focus then shifted towards examining the ethical implications of GenAI technology and the importance of compliance. Guerkink noted the need for compliance with data protection laws and emphasized the significance of conducting thorough risk assessments. This discussion also raised concerns about the scraping of personal and sensitive information. Patariu cited a recent example of an artist who discovered her private medical record photos had been made public, scraped, and added to the LAION-5B data set without her consent. Speakers agreed that greater alignment regarding the boundaries of publicly available information and its usage is needed, emphasizing that once data is disseminated it is impossible to control or alter. Setting limits to data access can minimize the privacy risks around potential misinformation, toxicity, and bias in outputs. However, completely banning access to data can hinder critical public service projects and research that rely on public data.

The conversation concluded with an examination of the legal implications of GenAI technology and strategies required for its development and regulation. Gervais emphasized that the assumption made by scraping companies – that if data is publicly available, it’s fair game — is flawed and easily challenged. Unauthorized scraping used to train AI models can lead to a multitude of legal issues related to privacy, trade secrets, publicity, and copyright protections. In addition, legal outcomes can be influenced by factors such as the type of scraping, commercialization, and application of the generated outputs. Speakers agreed that guidance around GenAI will largely be determined in the courts and questions around infringement for temporary copies and derivative work produced by machines remain unresolved. Speakers also discussed the need for output accountability from people responsible for machines as well as the extension of protections such as Section 230, which shields platforms from liability for third-party user-generated content, to machine generated content. Patariu highlighted that regulators can enforce data protection laws through algorithmic disgorgement and algorithmic deletion requirements, though a lot of work still needs to be done on the legislative and regulatory side of ensuring privacy protections. Speakers expressed hope for the development of private ordering and licensing solutions for scraping that can help with compensation for reuse of existing material and put use limitations on what can and cannot be done with data. However, they acknowledged that this will not address LLM production, use, and ownership issues. How regulators respond will continue to vary by country, with provisions already being made for ‘high risk’ and ‘high impact’ AI. Panelists highlighted that any limitations on scraping will need to be balanced with the risks associated with synthetic data training and proposed expansion of authorized commercial and research channels for data collection. 

Given the difficulty of completely preventing unauthorized scraping, it is crucial for users to exercise caution when sharing information publicly. Companies also need to carefully evaluate their user data policies to ensure they have proper safeguards for intellectual property and privacy protection.

As scraping is the precursor to much of GenAI activity, participation in discussions like this one and engaging with organizations like MUSA is important now more than ever to ensure that technological innovation and data protection grow hand in hand.