To make AI safe, governments must regulate data collection

Canadian Prime Minister Justin Trudeau. File photo by Alex Tétreault / Canada’s National Observer. Rochelle Baker, Local Journalism Initiative Reporter, Canada’s National Observer

Wendy H. Wong, University of British Columbia

May 3, 2024

Canadian Prime Minister Justin Trudeau recently announced a $2.4-billion investment in artificial intelligence. Part of the funding will create an AI Safety Institute. But what is AI safety?

Many countries, including Canada, the United States and those in the European Union, have pushed to curtail AI’s harms. Most of them are focused on the deployment and effects of AI. Because AI systems are so pervasive and diverse, governments should approach safety by breaking down AI into its components — algorithms, data and computing resources known simply as “compute.”

Innovation in compute and algorithms is lightning-quick; governance is not. Therefore, governments should consider playing to their existing strengths in data to make AI safe.

Data collection

Governments are experts at collecting data. Entire bureaus are set up to collect data on everything from business vitality to citizen health and the flow of traffic.

Part of collecting data, whether analog or digital, is making decisions about precisely what information to collect and organizing it so it’s useful and usable. Data collection entails decisions that make some categories “real” while ignoring others.

To take a non-digital example, the recent U.S. decision to change race and ethnicity classifications in its census will recognize new groups. These new groups will affect the counts in other categories, which in turn affects government functions, such as how public programs are distributed and how election districts are drawn.

Governments are skilled at managing access to data. In Canada and the U.S., research data centres restrict access to individual census responses and other data to certain university data centres. Governments limit access to sensitive data to protect individuals.

At the same time, we often think more data improves societies, especially in democracies. The Organisation for Economic Co-operation and Development has information about how accessible government data is and promotes the idea of “open government.” The EU’s Data Act facilitates data-sharing among private and public entities to promote a “data economy.”

What does the new EU Data Act bring to companies, innovators and Europeans? pic.twitter.com/ifbv1XNPOZ
— European Parliament (@Europarl_EN) July 25, 2023

But data, like most things, is not an unqualified good even as it plays a significant role for making AI safe. By understanding biases in the data, we can anticipate problems in an AI system’s outputs.

So why not start thinking about what kinds of data are too risky to allow private companies to collect and analyze? Why not use considerations of human dignity or autonomy when deciding if certain kinds of data should even exist?

Regulating data

Governments are focusing on AI applications and uses, such as the EU’s AI Act and Canada’s Artificial Intelligence and Data Act.

A U.S. executive order on AI issued in October 2023 seeks “safe, secure, and trustwothy” AI. Importantly, it acknowledges that data are part of AI systems and outlines basic provisions to mitigate against potential harms. But the order fails to go far enough in articulating just how much data about human activities are being fed into AI systems.

These efforts aren’t wrong. They’re just incomplete.

Given the urgency to regulate AI, governments need to see data as an equally important area of regulation. Data explicitly pertaining to living, breathing, rights-bearing human beings must be regulated differently.

Data about people gets fed into AI systems that are powered by algorithms. We need to regulate the area of algorithm innovation, but we’re neglecting the data that algorithms need to function.

The “prohibited” list of the EU’s AI Act reads like a list of humanity’s worst nightmares. Real-time biometric systems in public spaces, discriminating against vulnerable groups and using AI to predict criminality are capabilities that exist today but are subject to scrutiny or bans under the act. But prohibiting the creation and sale of such systems doesn’t take away the fact that the data can and have been collected.

Rather than giving companies unfettered ability to collect data on legions of users globally, why not restrict the incoming data? Why not create a domestic or even global registry system for companies seeking to collect potentially sensitive data about people and make them explain why they need the information?

If organizations make the data widely accessible, they need to explain why and put in appropriate safeguards. A registry could review, allow or deny applications for data use for limited times or purposes. Such a registry would then allow regulators to sniff out unauthorized data collections and uses. Violating companies could be punished.

An onerous registration process would force companies to consider whether it’s worth collecting certain types of data. In some cases, maybe it’s not worth the paperwork.

Less data-intensive models

A more robust and globalized enforcement mechanism of existing data minimization policies seems like a better idea and a more appropriate framework to follow.

Ambiguities around regulation can rely on global human rights as a legitimate justification for disallowing data collection.

Furthermore, governments can encourage innovation in AI models that are less data-intensive. AI researchers are experimenting with “less is more” — smaller models that demonstrate that you don’t need as much data as ChatGPT does to generate good-quality outputs.

Emerging research has found that machines can learn by replicating babies’ abilities to generalize from relatively few experiences. Where modern large language models such as ChatGPT use millions if not trillions of words, young children learn from much fewer.

Perhaps “intelligence” can be replicated in machines by changing the method used to train machine learning models, rather than the approach today of gobbling up data or adding more computing resources.

It might be tempting to roll one’s eyes at the idea of governments ever gaining control over the constantly evolving landscape of AI.

But maybe that’s because governments haven’t focused on their strengths. Governments have a lot of experience managing data on people, and AI currently needs lots of data to work. Policymakers must reject the hype to recognize data’s importance to make AI both safe and functional.