Proper data sharing essential for language models

By Madeline Carr | China Daily | Updated: 2025-02-11 07:17

Share - WeChat

The potential for artificial intelligence to improve lives has captured the attention of governments across the world. Straining budgets, growing inefficiencies, and the rising costs of healthcare, housing, and other social services mean that the promise of AI-driven systems is becoming increasingly attractive. There are a range of challenges involved in doing this including sharing sensitive or proprietary data sets, ensuring the outcomes truly benefit human beings, and designing policies that can make all of this possible.

One lesson that we've taken from the past is that the country that develops or leads in emerging technologies inevitably does so through its own vision of what is "good", "preferable" and "beneficial" — particularly for its own political, commercial and civil benefits. Views on what "good" looks like can, and usually do, vary quite widely but the decisions that dominant states or private actors take on technologies have a huge impact in terms of how those technologies are used by others.

Sharing data sets for training AI large language models (LLMs) is a particularly powerful and yet sensitive issue. Imagine the potential for medical researchers if they had unrestricted access to immediate and dynamically updated data on diabetes through medical implants. These data could include a range of information, from geography, activity levels, diets, environmental factors, medical treatment, and more providing an incredibly comprehensive overview of a disease that impacts more than half a billion people worldwide. AI analysis of those data sets could bring benefits in a fraction of the time otherwise required.

But such data are increasingly locked within commercial arrangements focused on extracting profits. This raises alarm bells for governments that feel their own indigenous data are at risk of predatory or monopolistic AI companies based elsewhere. Connections between notions of sovereign control over data for those with "low token" languages (those not widely spoken) are growing.

There is a very live discussion underway in Latin America, for example, on the absorption of indigenous languages into foreign owned and operated LLMs. The European Union cloud ecosystem, critical to the increased computer processing required by advanced AI systems, is still dominated by US monopolies. Consideration needs to be given to how a small number of (often monopolistic) companies can and should be governed globally through a system within which they are able to influence the industrial, trade and even foreign policies of state actors.

Focusing on profit generation and efficiency when it comes to technology innovation has only taken us so far. One could argue (and many do) that technology motivated by these twin forces has delivered huge benefits to society over the last century. But we have also observed that there is a limit to how effectively those benefits trickle down if they are not carefully governed.

Indeed, one of the harsh truths that we have confronted in many places is that there are no market drivers for many of the outcomes that we have hoped would eventuate from emerging technologies. Cybersecurity is an apt example. We've seen the growth of two symbiotic sectors, both hugely profitable.

The first sector releases insecure software and hardware into the market with insufficient investment in security. And the second sector comes along later, finding vulnerabilities and problems and reporting them. Both of these sectors are hugely profitable and the product of market forces. Both are reliant on the other not changing. And neither delivers security to the level that we need it or when we need it. We should take note of this and make sure AI systems and services do not replicate this model.

Markets have not and will not deliver human-focused outcomes or public goods by themselves. To ensure they do so, we require policy initiatives, planning, regulation, and healthy discussions on what is and what is not desirable for human beings. Technological innovation develops to fulfill the wishes and needs of those in a position to direct it. That's why it is so important that there is a broad range of inputs into that problem definition process. Mark Zuckerberg's recent announcement that Meta will dismantle its DEI program is a retrograde step away from ensuring that we have a diversity of perspectives in these powerful organizations.

Despite the incredibly exciting, dynamic period of technological innovation that we are in the midst of, one thing that has lagged behind in many places is any kind of innovation in the processes and practices needed to translate technological innovations into positive outcomes. Indeed, policymaking is generally carried out today in much the same way that it was 100 years ago. Regulating technologies to extract benefits while minimizing the negative consequences of technologies is a practice in its infancy.

Furthermore, it's not a field in which we've particularly been able to accommodate failure. Experimentation in policymaking on technology remains challenging for most governments and when something is attempted but found to be ineffective, winding that policy back or reversing course is often perceived as a "policy failure". This is in stark contrast to the "fail fast" culture that dominates those tech companies we are attempting to govern. Human rights and societal benefits have too frequently been neglected out of fear that "regulation will stifle innovation "but this has set us up for decades of very poor protection for any element of society apart from the tech sector itself.

China perhaps has been the most innovative in this field with very dynamic and flexible approaches to tech policy. The international data port established at Lingang Special Area in Shanghai is an excellent example of thinking creatively and constructively about the challenges of cross-border data flows. Good policy is an integral element of the successful uptake of AI and other emerging technologies. And that gets forgotten far too often at our peril.

Ultimately, AI offers not only technological solutions to societal problems (if properly governed). It is a well-established principle of international relations that the more economically integrated states are, and the more they trade, the less likely they are to descend into primitive, kinetic conflicts. And it's quite possible that the imperative and incentives to share global data sets could have a similar effect on global affairs.

If governments remain focused on using AI to address human-centric goals, the significant benefits of shared data sets could not only set us up for technological innovation, but also sufficiently bind us together in ways that make continued international cooperation the bedrock of that success.

The author is a professor of Global Politics and Cybersecurity at University College London. The views don't necessarily reflect those of China Daily.

If you have a specific expertise, or would like to share your thought about our stories, then send us your writings at opinion@chinadaily.com.cn, and comment@chinadaily.com.cn.

Cartoons