The checks being cut to ‘owners’ of training data are creating a huge barrier to entry for challengers. If Google, OpenAI, and other large tech companies can establish a high enough cost, they implicitly prevent future competition. Not very Open.
Model efficacy is roughly [technical IP/approach] * [training data] * [training frequency/feedback loop]. Right now I’m comfortable betting on innovation from small teams in the ‘approach,’ but if experimentation is gated by nine figures worth of licensing deals, we are doing a disservice to innovation.
These business deals are a substitute for unclear copyright and usage laws. Companies like the New York Times are willing to litigate this issue (at least as a negotiation strategy). It’s likely that our regulations need to update ‘fair use.’ I need to think more about where I land on this – companies which exploit/overweight a data source that wasn’t made available to them for commercial purposes do owe the rights owner. Rights owners should be able to automatically set some sort of protections for at least a period of time (similar to Creative Commons or robots.txt). I don’t believe ‘if it can be scraped, it’s yours to use’ and I also don’t believe that once you create something you lose all rights to how it can be commercialized.
What I do believe is that we need to move quickly to create a ‘safe harbor‘ for AI startups to experiment without fear of legal repercussions so long as they meet certain conditions. As I wrote in April 2023,
“What would an AI Safe Harbor look like? Start with something like, “For the next 12 months any developer of AI models would be protected from legal liability so long as they abide by certain evolving standards.” For example, model owners must:
- Transparency: for a given publicly available URL or submitted piece of media, to query whether the top level domain is included in the training set of the model. Simply visibility is the first step — all the ‘do not train on my data’ (aka robots.txt for AI) is going to take more thinking and tradeoffs from a regulatory perspective.
- Prompt Logs for Research: Providing some amount of statistically significant prompt/input logs (no information on the originator of the prompt, just the prompt itself) on a regular basis for researchers to understand, analyze, etc. So long as you’re not knowingly, willfully and exclusively targeting and exploiting particular copyrighted sources, you will have infringement safe harbor.
- Responsibility: Documented Trust and Safety protocols to allow for escalation around violations of your Terms of Service. And some sort of transparency statistics on these issues in aggregate.
- Observability: Auditable, but not public, frameworks for measuring ‘quality’ of results.
In order to prevent a burden that means only the largest, well-funded companies are able to comply, AI Safe Harbor would also exempt all startups and researchers who have not released public base models yet and/or have fewer than, for example, 100,000 queries/prompts per day. Those folks are just plain ‘safe’ so long as they are acting in good faith.”
Simultaneously our government could make massive amounts of data available to US startups. Incorporate here, pay taxes, create jobs? Here’s access to troves of medical, financial, legislative data.
In the last year we’ve seen billions of dollars invested in AI companies. Now is the time to act if we don’t want the New Bosses to look like the Old Bosses (or in most cases, be the exact same Bosses).