Lawsuit targets how AI is built

At the end of June, Microsoft launched a new type of artificial intelligence technology capable of generating its own computer code.

Called Copilot, the tool was designed to speed up the work of professional programmers. As they typed on their laptops, it suggested ready-made blocks of computer code that they could instantly add to their own.

Many programmers loved the new tool or were at least intrigued by it. But Matthew Butterick, a programmer, designer, writer and lawyer in Los Angeles, was not one of them. This month, he and a team of other attorneys filed a lawsuit seeking class-action status against Microsoft and the other top companies that designed and deployed Copilot.

Like many cutting-edge AI technologies, Copilot developed its skills by analyzing large amounts of data. In this case, he relied on billions of lines of computer code published on the Internet. Mr Butterick, 52, likens this process to hacking, as the system fails to recognize its debt to existing work. His lawsuit claims that Microsoft and its collaborators violated the legal rights of millions of programmers who spent years writing the original code.

The lawsuit is believed to be the first legal attack on a design technique called “AI training”, which is a way to build artificial intelligence that is set to remake the tech industry. In recent years, many artists, writers, experts and privacy activists have complained that companies are training their AI systems using data that does not belong to them.

The lawsuit has echoes in the last decades of the technology industry. In the 1990s and 2000s, Microsoft fought the rise of open source software, seeing it as an existential threat to the future of corporate business. As the importance of open source grew, Microsoft embraced it and even acquired GitHub, a home for open source programmers and a place where they built and stored their code.

Almost all new generations of technologies, even online search engines, have faced similar legal challenges. Often, “there’s no law or case law that covers it,” said Bradley J. Hulbert, an intellectual property lawyer who specializes in this increasingly important area of ​​law.

The lawsuit is part of a wave of concern over artificial intelligence. Artists, writers, composers and other creators are increasingly concerned that companies and researchers will use their work to create new technologies without their consent and without compensation. Companies train a wide variety of systems this way, including art generators, voice recognition systems like Siri and Alexa, and even driverless cars.

Copilot is based on technology developed by OpenAI, an artificial intelligence lab in San Francisco backed by billion-dollar funding from Microsoft. OpenAI is at the forefront of the increasingly widespread effort to train artificial intelligence technologies using digital data.

After Microsoft and GitHub released Copilot, GitHub Managing Director Nat Friedman tweeted that using existing code to train the system was “fair use” of the material under copyright law, an argument often used by the companies and researchers who built these systems. But no court case has yet tested this argument.

“Microsoft and OpenAI’s ambitions go far beyond GitHub and Copilot,” Butterick said in an interview. “They want to train on any data anywhere, for free, without consent, forever.”

In 2020, OpenAI unveiled a system called GPT-3. The researchers trained the system using massive amounts of digital text, including thousands of books, Wikipedia articles, chat logs and other data published on the Internet.

By identifying patterns in all that text, this system learned to predict the next word in a sequence. When someone typed a few words into this “large language model”, they could complete the thought with entire paragraphs of text. In this way, the system could write its own posts, speeches, poems and press articles on Twitter.

Much to the surprise of the researchers who built the system, it could even write computer programs, having apparently learned from countless programs published on the Internet.

So OpenAI went a step further by training a new system, Codex, on a new collection of data stored specifically with code. At least some of that code, the lab said later in a research paper detailing the technology, came from GitHub, a popular programming service owned and operated by Microsoft.

This new system became the underlying technology for Copilot, which Microsoft distributed to programmers via GitHub. After being tested with a relatively small number of programmers for about a year, Copilot was rolled out to all coders on GitHub in July.

For now, the code produced by Copilot is simple and could be useful for a larger project, but needs to be massaged, augmented and verified, said many programmers who have used the technology. Some programmers only find it useful if they are learning to code or trying to master a new language.

Yet Mr. Butterick feared that Copilot would end up destroying the global community of programmers who built the code at the heart of most modern technology. A few days after the system was released, he published a blog post titled: “This co-pilot is stupid and wants to kill me”.

Mr. Butterick identifies himself as an open source programmer, part of the community of programmers who openly share their code with the world. Over the past 30 years, open source software has helped power most of the technologies consumers use every day, including web browsers, smartphones, and mobile apps.

Although open source software is designed to be shared freely among coders and businesses, this sharing is governed by licenses designed to ensure that it is used in a way that benefits the wider community of programmers. Butterick believes Copilot violated those licenses and, as it improves, will make open-source coders obsolete.

After complaining publicly about the issue for several months, he took his complaint to a handful of other attorneys. The lawsuit is still in its early stages and has not yet been granted class action status by the court.

To the surprise of many legal experts, Mr. Butterick’s lawsuit does not accuse Microsoft, GitHub and OpenAI of copyright infringement. His lawsuit takes a different turn, arguing that the companies violated GitHub’s terms of service and privacy policies while also violating a federal law that requires companies to display copyright information when using content. material.

Mr Butterick and another lawyer behind the lawsuit, Joe Saveri, said the lawsuit could eventually solve the copyright problem.

When asked if the company could discuss the lawsuit, a GitHub spokesperson declined, saying in an emailed statement that the company is “committed to responsible innovation with Copilot from the start, and will continue to evolve the product to better serve developers around the world.” .” Microsoft and OpenAI declined to comment on the lawsuit.

Under existing laws, most experts believe that training an AI system on copyrighted material is not necessarily illegal. But it could be if the system ends up creating material substantially similar to the data it was trained on.

Some Copilot users have said it generates code that looks identical – or nearly identical – to existing programs, an observation that could become the central part of the case for Mr. Butterick and others.

Pam Samuelson, a professor at the University of California, Berkeley who specializes in intellectual property and its role in modern technology, said legal thinkers and regulators briefly explored these legal issues in the 1980s, before technology does not exist. Now, she says, a legal assessment is needed.

“It’s not a toy problem anymore,” Dr. Samuelson said.

Leave a Reply

%d bloggers like this: