The first wave of major generative AI tools were trained primarily on “publicly available” data – whatever data they could glean from the internet. Now, sources of training data are increasingly limiting access and requiring licensing agreements. As the search for additional data sources intensifies, new licensing startups are emerging to keep source material supplied.
The Dataset Provider Alliance, an industry group that formed this summer, wants to make the AI industry more standardized and fair. To that end, it released a position paper outlining its stance on key AI-related issues. The alliance is made up of seven AI licensing companies (with at least five new members expected to be announced in the fall), including music copyright management company Rightsify, Japanese stock photo marketplace Pixta, and generative AI copyright licensing startup Calliope Networks.
The DPA encourages an opt-in system, meaning data will only be used with explicit consent from creators and rights holders. This is a significant departure from how most large AI companies operate: Some have developed their own opt-out systems, placing the burden on data owners to back out of work on a case-by-case basis, and some don’t offer an opt-out at all.
The DPA wants its members to adhere to the opt-in rules, which it sees as a much more ethical approach. “Artists and creators should be on board,” says Alex Vestal, CEO of Rightsify and music data licensing company Global Copyright Exchange, which is spearheading the effort. Vestal sees the opt-in as both a moral and pragmatic approach. “Selling publicly available data sets is one of those things that gets you sued and discredited.”
Ed Newton-Rex, a former AI executive who now runs the ethical AI nonprofit Fairy Training, said the opt-out is “fundamentally unfair to creators,” adding that some may not even know when the opt-out is offered. “It’s especially good that the DPA is asking for opt-in,” he said.
Shane Longpre, leader of the Data Provenance Initiative, a volunteer group that audits AI datasets, thinks the DPA should be commended for its efforts to ethically source data, but worries that implementing an opt-in standard would be difficult, given the vast amounts of data most modern AI models require. “In this regime, you’re either short on data or you’re paying too much,” he says. “Only a handful of companies, probably the big tech companies, will be able to afford to license all that data.”
In the paper, the DPA argues against government-mandated licensing, instead arguing for a “free market” approach in which data creators and AI companies negotiate directly. Other guidelines are more detailed. For example, the coalition proposes five potential compensation structures to ensure creators and rights holders are paid appropriately for their data, including subscription-based models, “usage-based licenses” (where a fee is paid per use), and “outcome-based” licenses (where royalties are tied to profits). “These could be applied to everything from music to images to film to TV to books,” Bestall says.