Generative Artificial Intelligence (AI) burst into the public consciousness in November 2022, with the launch of ChatGPT3 by OpenAI. The chatbot amazed with its ability to answer a broad range of text-based questions with detailed and generally appropriate responses. The application became the fastest in history to hit 100 million active users, outpacing TikTok and Instagram by a wide margin. AI models from other companies have since been released which can generate images and videos.
With rapid investment and progress in the AI space, policymakers must quickly develop and implement early regulatory frameworks. One area where Congress must act is in protecting private data. AI models are developed using large corpuses of data, which could come in a variety of formats such as text, pictures, videos, etc. The sourcing and commercialization of such data should be explicitly covered by law.
AI applications utilize mathematical models that are developed and refined using example data. These example data sets are known as training data, and the best AI models usually require large volumes of high-quality data. It is this underlying training data that typically differentiates the quality of output between AI applications, and therefore access to unique, quality training data will be one of the most critical determinants of commercial success in AI applications over the long run.
The protection of private data from unauthorized monetization in commercial AI models would preserve the incentive to innovate. For example, if an AI chatbot company is able to train their model on all public Yelp reviews and can then seamlessly recommend restaurants in response to user input, it would greatly reduce the utility of Yelp’s website and application. This would decrease traffic to Yelp and reduce its ability to monetize its platform. Allowing this type of behavior would lead to AI applications targeting all online platforms reaching a certain scale, attempting to disrupt business models by training models on the data that service has collected. Without compensation, the original online service’s financial upside would be capped. The incentive to create the next great digital product would be greatly reduced if much of the gains would eventually accrue to AI applications.
There is an urgency to act because the technology sector and, in particular, large technology companies have a history of ignoring externalities in favor of profit maximization. Facebook did not enforce its own policies against violence and hate speech in Myanmar, contributing to the violent targeting of the Rohingya Muslim community in 2016 and 2017. Whistleblower Frances Haugen showed that Instagram had internal research dating back to 2019 showing that its usage among teenage girls exacerbated feelings of depression. Uber intentionally ignored local taxi regulations around the world, internal documents show. Despite knowing the risks and impact, large technology platforms have often ignored them in pursuit of greater profits. History shows that technology companies will have little qualms in monetizing data that does not belong to them via AI applications, despite the downside to the innovation economy.
The timing of initial legislation is particularly important due to the costly nature of AI model development. There is somewhat limited AI talent, which has been concentrated in larger technology companies, although the total number of U.S.-based AI companies has roughly doubled since 2017. The cost of AI model training and usage can also be very high. The cost to train a model like ChatGPT is estimated to be around $4 million, which does not include the ongoing costs of answering queries. This talent concentration and high cost-to-serve makes it highly likely that current large technology platforms will remain key AI players, through either developing their own applications or providing computational power to partners. Large technology companies continue to outspend all other industries in lobbying, setting a new record of $70 million spent in 2022, well above both the pharmaceutical and oil and gas industries. The longer Congress waits to act on AI, the harder it will become to counteract sophisticated lobbying aimed at blocking any legislation that could impact profitability.
Some argue that the courts should review this issue similar to copyright. However, there is significant ambiguity since there is substantial publicly available data on the internet that is not technically copyrighted, such as the Yelp reviews discussed earlier, that would be very valuable in high volumes to train AI models. The users that wrote Yelp reviews did not intend for their writing to be used by an AI company, but those AI companies also do not have agreements with those users directly. OpenAI, the developer of ChatGPT, is currently being sued for the use of data scraped from the web without permission. Legislation could resolve such ambiguity, as courts do not always operate predictably in the absence of a clear legal framework. Finally, speed again must be emphasized. These challenges can take years to resolve in court, and AI is developing rapidly. Congressional legislation would much more quickly and clearly resolve this fundamental issue relative to judicial review.
Congress must transition from exploring the basics of AI to legislating. While policymakers need to tread thoughtfully to ensure continued American leadership in such a rapidly evolving and high potential area, they should still begin formalizing some basic protections starting with the explicit protection of private data from unauthorized use in commercial AI models.
Leave a comment