Research-based Defensibility
…is a myth. As an ML researcher and engineer of 15+ years, here's the pickle: to patent (non process) something, you need to be able to explain how it works. One of the many fields of science I’ve had to study to get better at ML is Neurology, which suffers from a lack of candidates for a theory of consciousness. We've been reverse engineering biological processes without understanding the inner workings of the underlying biological substrate to begin with. A poor lay man’s analogy is that number matrices are digital representations of neurons. For most readers this is only relevant in the sense of them representing weights, biases, and activation functions. Anyways:
Lack of a theory in Neurology = lack of ML explainability = lack of ML patentability
While process patents are viable, the ML community faces a dilemma: publish vague papers to hinder reproducibility or risk losing competitive advantage in the act of publishing the paper itself. I hope these misaligned incentives get resolved soon.
Data-based Defensibility
Two of my startups are in this category, which is built through a capable professional and personal network. This is viable but you should be in a very specific, targeted market section the market cap of which is not enough for a Fortune 500 company to go into (yet). One of my startups - this is a good way of extending the first mover advantage - is in a market section that didn't exist before it. We created it. This does touch on the next section of defensibility and relates to what I call "the spirit of invention". I'll skip the N+2 and N+3 leap types since they are not relevant. N+1 type inventions (one logic gap away from what is currently known and exists) are simple:
You combine two previously seemingly unrelated concepts in an atypical way to provide a previously unthought of solution - this is my personal take on it.
Which comes down to how I view successful businesses:
solving problems in a creative, defensible, and profitable way.
These combinatorics apply to data-based defensibility (combining data sets which appear unrelated but reveal patterns capable of standing up to fundamental rigor such as causality and correlation, etc) and implementation-based defensibility.
The closest thing to a silver bullet here are private data sets, or self-gathered data sets. It's very difficult to replicate the performance of an ML system, even with nearly unlimited resources, if you don't know what underlying data is used for it.
Implementation-based Defensibility
Was pretty much explained above, alas instead of data we have processes (which may be patentable if certain specific conditions are met, so there is some hope here as well). Once again, we combine two previously seemingly unrelated processes in an atypical way to provide a previously unthought of solution. This is the category where my third and current primary focus startup is, via a hybrid of data-based and implementation-based defensibility. We're also in a market that didn't exist before here too. Less than two months ago, I made a joke of a meme project for a friend of mine due to my aversion to photoshop. We wanted to record a demo, so I built the MVP itself in a few days. We now have 2.3M+ in signed AR which also leads into 1-3 year maintenance contracts totaling up to around 400k ARR so far.
For Adults: Model & Talent Defensibility
If you are capable of creating foundational models, no matter how specialized, that stand up to the Big 3 (Google, Microsoft/OpenAI, Anthropic) on any of the known and standard benchmarks, you’re in a good place. This often requires a hefty R&D budget for compute, access to large + high quality data sets, and some notable ML talent.
Speaking of compute - things are going to get weird fast. As companies get better at putting hard ROI numbers to the data, personnel, and compute cost of a model vs the GMV of the end product, expect to see some wild stuff. As hardware advances and the market approaches saturation we will most likely see model training costs in billions.
Probably tens or hundreds of billions USD before 2026. A great napkin math formula for evaluating a model's "power" is: Square root of (Parameters x Tokens) / 300.
Claude 3 Opus scores about a 29.8 on that. Gemini 1.5 Pro (and Gemini 1.0 Ultra) both come out to 22.4. Model parameters will scale as fast as hardware and cost allow them to scale due to their relationship to a model’s “intelligence”. The main counterweight to this is model response time, which the market is still very much figuring out.
An Actionable Note For The Future
Later this year, and from here on out, it’s likely that SLMs (Small Language Models) and MLMs (Medium Language Models) will start reigning supreme for most tasks now done by LLMs because people will finally catch on that:
The quality of training data is MASSIVELY misunderstood and underrated.
Preprocessing training data at a token level yields disproportional results.
Tokenizers (and more complex ingress transformations) are underrated.
Most parameters in LLMs are worse than useless (this is very important).
You can run these economically, at scale, with sane response times.
Techniques like RLAIF and knowledge distillation will thrive.
Techniques like selective fine-tuning will be more popular.
System accuracy, consistency, and reliability skyrocket when you leverage many SLMs/MLMs, each specialized, in a narrow context, and with specialized tooling. Especially so when one shifts focus from training to inference compute scaling.
Orchestrator models that define steps and then, for each step, select a specialized executor model with narrow tooling and context (this is how my startup currently not only competes with but outperforms massive-budget R&Ds) are magical.
As a result (this is a personal favorite of mine, possibly an original approach at this time, that I haven’t seen used anywhere besides my startup for foundational model development - letting the cat out of the bag now, so you’re welcome):
Alternating reduction techniques with refinement techniques absolutely BTFOs every other approach I’ve seen used for creating performant language models at this time:
Example:
Start with your own or an open-source model. These are fairly well-trained, so you either reduce from here or specialize it and proceed to a reduction step next.
Apply reduction techniques like Quantization, Pruning, Distillation, Low-Rank Factorization, Weight Sharing, Parameter Clustering, or Tensor Decomposition. FP2/4/8/16/32, TF32, and 64- types do NOT work the way people think they do.
Apply fine-tuning techniques like Parameter-Efficient Fine-Tuning, Layer-Wise Freezing, Task-Specific Head Training, Domain Adaptation Fine-Tuning, Gradual Unfreezing, and Multi-Task or Quantization-Aware Fine-Tuning.
Repeat steps 2 and 3 in that sequence without reducing too much at a time. A combination of RLAIF with an Arbiter Pattern (2/3 or 4/5) is extremely effective.
You can train a competitive model with a budget of $1,000 to $15,000. You just need an extremely high-quality dataset, mediocre knowledge of what you’re doing, experimentation with the reduction/fine-tuning cycle, and creativity - in that order.
Many highly specialized vector DBs that are dynamically selected as part of the proper context and tooling will always outperform generic and bloated embeddings. Having hundreds of automatically updated and cached embedding sets, each for a specific combination of context and tooling, helps both maintainability and performance. This way, when you update an underlying data source or some tooling, you simply initiate a process that updates all embeddings containing either the data or the tooling. Final pro-tip: you can use tokens or embeddings instead of English as step input or output.
There will also be an “agentive AI” craze during which proper system architecture will be undervalued. Unspecialized models that are far too large will be used, along with poorly selected tooling and context. People will then realize that unless such a system is composed of small, atomic operations, it’s almost impossible to update or maintain.
The above is a miniscule (at best) part my startup’s competitive advantages and why I don’t talk to VCs that “want to find out more about my startup” - neither should you.
As usual, by the time this catches on, startups that know what they’re doing will be quarters or years ahead of the game. Compute budgets are overrated. In ML, and especially LMs, making things specialized and performant also makes them reliable.
What To Expect From This Blog
Look forward to multiple posts per week covering the latest and greatest in AI now that I’ve finally gotten around to starting this thing which was inspired by
Topics and post frequency you can expect to see in the future: 3-5 RPW (rants per week, gotta get them benchmarks) about business, startups, behavioral economics, neuroscience and cognitive science, psychology, statistics, decision and experimental game theory, process architecture, venture capital, latest ML/AI news, and ofc memes: