Notes from Kai Wu on Flirting With Models

From Flirting With Models host Corey Hoffstein:

My guest in this episode is Kai Wu, CEO and founder of Sparkline Capital. Kai is a pioneer in the measurement of intangible value.  Using machine learning, he tackles unstructured data sources like patent filings, earnings transcripts, LinkedIn network connections, and GitHub code repositories to try to measure value across the four key pillars of Brand, Intellectual Property, Network, and Human Capital.

We discuss why intangibles are important, how they differ from the traditional factor zoo, the opportunities and risks of unstructured data, and how even big data can have small data problems within it.

Finally, we discuss Kai’s most recent applications of his research to the world of crypto.


A word on my notes:

These are just interesting bits that stood out to me, not a comprehensive summary. Kai and Corey pick over many nuanced questions related to unstructured data, meta-problems in data analysis, distinctions between Kai’s “4 pillars”, and techniques. I encourage you to listen to the whole episode to appreciate the depth that both of them are able to bring to the discussion. Kai has thought about these problems deeply and Corey, despite being an outsider, asks extremely poignant questions reflecting his own deep appreciation for the pitfalls of number-crunching.

Challenge of Machine Learning In Investing:

To take advantage of machine learning truly requires rather large investments, alternative data and the infrastructure required to support it can be very expensive. And even worse is that you know, the prohibitive item here really is getting the right people to run it. Machine learning is complicated and has many pitfalls. And it’s also a relatively new field so that the pool of experienced folks is pretty small.

I actually wrote a paper in May 2019, called machine learning in the Investment Management age. And so in this paper, I outlined three ways to apply machine learning to the industry:

  1. Use of machine learning to transform unstructured data into the investment process
  2. Data mining. And this is the idea of taking hundreds if not 1000s of features or signals and allocate capital across them deciding which ones you want to invest in and what you want to ignore.
  3. Risk models.
  • In the quant world, we’ve seen the most effort applied to the second use case, in other words, trying to figure out how to allocate capital across these 1000s of futures. And this has had actually significant success, but mostly at higher frequencies. For capacity reasons, most capital is managed on lower frequencies. So of course it doesn’t matter as much for the average investor. And then the problem is at the lower frequencies, we have sort of like a small data problem. For example,  every decade there are 10 annual filings, and these are often serially correlated. So the true dimensionality is actually quite a bit smaller.
  • I think we haven’t seen as much innovation on the risk model front. This is an underappreciated dimension. Quants use risk factor models such as Barra and US equities. Barra works by identifying industry factors like “tech” and “consumer discretionary”and a few dozen style factors, value growth, etc. The Barra model has been largely unchanged since becoming the industry standard several decades ago. And I think the biggest weakness of the model was actually its reliance on the GICs industry classifications. These are binary definitions, there’s like 11 different sectors. So firms like Tesla can’t be both tech and auto. They’re also very static. So if a company like Amazon starts investing in a new business like AWS, that doesn’t kind of get incorporated into the risk model. We’ve actually shown that natural language processing models can be used to create superior text-based industry definitions that can capture part of the greater richness and nuance of the business landscape. So in this framework, for example, Tesla will be considered similar to both GM and Ford, then also to Apple.
  • The final area which I think has the most room, which has yet been kind of fully realized, but has the most potential is this idea of unstructured data. The best way to define unstructured data is by opposition to structured data. Structured Data is the information you find in Excel spreadsheets and SQL databases. Its price volume, and financial ratios like P E ratios. Unstructured data, on the other hand, is everything else. It’s text, images, audio, video, anything else any visuals, information, and unstructured data is 80% of outstanding data, and it’s growing exponentially. It’s doubling every one to two years. Importantly, it’s also being created faster than it can be structured meaning that 80% of the data is underestimate because as we move forward through time, it’s only set to increase. And of course, it’s not just quantity, right unstructured data can also contain a lot of valuable information about companies. At Sparkline, we look at like LinkedIn to measure human capital. We look at Glassdoor to measure culture patterns for innovation, Twitter, for brand. And for the most part, investors are not using this data at least in a systematic way.  We’ve seen some unstructured data be adopted, such as news sentiment, become popular but I think it’s really only scratching the surface on what this dataset can offer.

“Value is not dead, it just needs to be reformed”

The father of value investing Ben Graham wrote Security Analysis in the 1930s. The world was very different. The big companies were railroads and industrial firms. Buying stocks below book value was a reliable way to make money. Fast forward to today. We have Google and Apple which don’t use tangible capital to generate earnings. They rely on intangibles. We have these four pillars at Sparkline:

  1. intellectual property
  2. brand
  3. human capital
  4. network effects

These are the pillars most firms rely on today. Our research has shown that intangible capital has grown from basically 0% to 60 to 80% of the capital stock of the S&P 500. Meanwhile, the efficacy of traditional value metrics like trailing earnings or book value has declined. So Baruch Lev and Fong Gu in their excellent book, The End of Accounting show that the R squared of using book value and earnings to explain market caps across nationality used to be 90% in 1950, and it’s fallen to around 50% in 2010 and this was 10 years ago. So I’m not the first person to argue that value investors need to incorporate intangible assets into their assessment of corporate value. But as far as I can tell, we are the first firm to use machine learning and unstructured data to measure this value. For example, we use live data to track the flow of human capital from company to company or Twitter to measure the brand perception of firms. These datasets require using machine learning to take the unstructured data and form them into factors which we can then use to trade like each of these four pillars. So basically, we have two big insights at the firm.

  1. The economy is becoming increasingly intangible, but investors and accountants are failing to adapt.
  2. Unstructured data is exploding and it contains valuable insights on the intangible economy that can be unlocked using machine learning.

By combining these two insights, we hope to help investors access the opportunities in these undervalued intangible assets.

The state of research around intangible assets

There are a dozen or so researchers who have written about how to incorporate intangibles into measures of book value. While they each have slightly different approaches, the common theme is that they all rely on accounting data to measure intangible assets. To be more specific, they focus on two particular line items in the accounting statements.

  1. R&D
  2. SG&A (selling, general, administrative expenses.  SG&A is kind of a catch-all idea that captures many things including sales and marketing expenses.)

The idea is that R&D and SG&A are expensed rather than capitalized. For example, if I were to spend $10 million dollars building a factory to manufacture a new drug that I developed, that capex is capitalized, that goes on my balance sheet. On the other hand, $10 million of R&D to develop the drug that will then be manufactured is considered a cost that comes out of the income. This inconsistency means that investments in intangible capital are considered not an asset but an expense. So led by Baruch Lev who we mentioned just a second ago, a lot of different researchers have now decided to treat intangible investments the same way they do tangible investments, in other words, to build balance sheet assets for intellectual property and brand.

If you take price-to-book plus capitalized r&d you end up with this slightly more comprehensive version of a value factor. This adds somewhere between one to four points of excess returns each year to performance. And the problem though, for us, is that value still in a deep drawdown notwithstanding. So, while these are very sensible adjustments, they’re not a panacea. I think the limitations are twofold.

  1. There is a pretty weak relationship between the input costs and then the output value for any intangible investments. The goal of accounting is to capture historic costs, but the exposed value of intangible investment is very uncertain. The $10 million we spent on this new cancer drug could be worth a billion dollars or could be worth zero to market, this new drug could go viral or flop. So that’s the first problem.
  2. Accounting statements basically ignore the other two intangible pillars. All CEOs claim that their people are their greatest assets, but the only disclosure they put into the 10-Ks is headcount, which of course makes no distinction between the quality of employees or what functions they’re hired to do. And then finally, network effects. So when all is said and done, this means that we are forced to go beyond accounting data. And we believe that by using unstructured data, we can actually measure the output as opposed to the inputs of the R&D investment and the quality of human capital and network effects and brand. And this allows us to transcend some of these limitations.

Corey asks a brilliant question addressing the predator/prey dynamic of competitive markets

Corey: As more and more firms adopt NLP tools to rapidly trade news releases and earnings transcripts. How do you outrun the adversarial issue where CEOs may now get coached against using specific words and phrases or coach to use specific words and phrases?

Kai’s answer confirms just how much of an arms-race market communication can be!

I love this question. Look, investing is like poker. It’s a game theoretic endeavor. One of my favorite papers is actually called How To Talk When A Machine Is Listening. And it has a really interesting finding. So there’s this dictionary called the Loughran and McDonald dictionary. It consists of a bunch of lists of words. Like positive and negative keywords. And the key is that it’s adapted to the finance industry. It was created by two finance professors solely for this focus on trying to classify financial jargon. It was published in 2011 and quickly became widely used in natural language processing. The paper How To Talk When A Machine Is Listening found that companies started to avoid using the negative Loughran and McDonald words in their 10-Ks and 10-Qs after this dictionary was published. So yeah, this is a very real thing. As investors watch and try make sense of unstructured data and deceit in general,  CEOs will try to manipulate the narrative to their advantage.

How Kai’s team zeros in on actions not words to defend against CEOs that learn the right things to say

The way we deal with this is we define three buckets of data with varying levels of susceptibility to such a manipulation.

  1. Company communications. So this is your 10k earnings calls, press releases, and anything coming directly from the mouthpiece of the company.
  2. Third-party information. Media blogs, sell-side research, and company reviews (ie Glassdoor)
  3. Ground truth. So I would classify human capital and passions in this category. A good example is to go back to our culture thing. We wrote a paper called Measuring Culture where we started off by showing the famous slide about how Enron its leaders went to jail for fraud. They proudly displayed the values of integrity on their office lobby. Most CEOs invariably just love talking about how great their culture is, but this is no correlation with the true culture of a company. So to get around this problem, we don’t look at the CEO interviews. Instead, we look to the opinion ranking of all employees. These are the opinions that on a day-to-day basis constitute the culture of a firm. Again, we use Glassdoor. The website allows individual employees or former employees to review their employers. We find this data is a much more reliable source in particular. We find that it’s not the quantity of the story that matters, but the information contained in the freeform text associated with each of these reviews. It gives us interesting clues about the facets of each company’s culture. A similar example would be that all CEOs just love talking about how they’re embracing innovation and digital transformation. But talk is cheap. So instead, we look at job postings and LinkedIn to see if companies are truly hiring talent in these areas. It’s easy to say you’re investing in innovation, but do you actually go out and spend the extra money to hire top graduates from  Carnegie Mellon computer vision PhDs? Is it actually going to be the case that your employees have skill sets such as TensorFlow and PyTorch on the resume? Are you really investing in AI?

Because crypto’s value is entirely intangible it’s fertile ground for Sparkine’s methods

Porting our model into crypto was actually pretty seamless. Brand new human capital matters just as much for Web3 as Web2 organizations. So we were really able to just apply the framework wholesale with no modifications. The big difference in crypto is the data sources are different. But because Web3 is being built in the open, in many ways, crypto is actually an even more attractive area to apply this framework. So we focused on three different data sets.

  1. Blockchain data. By definition, we can see the history of a blockchain all the way back through time it is publicly available is immutable. This allows us to form metrics for the adoption of a protocol. For example, we can calculate the number of daily active users or the dollar volume of transactions over any kind of arbitrary time period. And this, of course, maps back to our pillar of network effects.
  2. GitHub. The really cool thing about crypto is that it’s all built on open-source principles, which of course is key for us. We see the source code of 1000s of crypto projects today, as well as yesterday and each point back in time to inception. So this allows us to form metrics for human capital and intellectual property. So for example, we can see the number of repo changes as a proxy for iteration over a period of time or we can look at the growth of the developer community over the years.
  3. Social media data. While social media is of course important for all firms, it is especially important for Web3, which are digitally native and involve the coordination of online communities across the globe. We can look at datasets such as Twitter, Reddit, Telegram, and Discord, to track the growth of these online communities and brands. So now with these measures of fundamental value in place, we then compare them to the price you pay.

I think what makes us confident in this strategy and gets us all excited about it is this is an inefficient frontier asset class, and very few other investors, if any, are approaching it with systematic valuations. So it just stands to reason that there might be some alpha here.

Does the 4 pillars of intangibles approach apply to assets that might never spit off a cash flow?

You’re right that in general, the token economics are a bit different from that of equities… Many of these projects are using tokens as a method of financing their growth, but they want to avoid technically calling them equity securities from like a regulatory standpoint but it doesn’t diminish the actual value in these tokens. Let’s take the example Ethereum.Eth is a utility token. It is required if you want to use the Ethereum network. Therefore, the value of Eth is a function of the demand for the Ethereum network. This logic applies to any other token, whether it’s a video game, a decentralized exchange or a blockchain. The value of tokens will be a function of demand for the underlying project. So our framework attempts to establish what is the fundamental attraction of these underlying projects. So in this way, we’re actually much more similar to venture capitalists. We think about these projects as early stage startups. They may not have monetized their projects or their users or whatever yet, but if we have a lot of users, we have a robust development community and a strong brand, it certainly does bode well for their ability to flourish ultimately, which of course, would somehow filter down to the token investors profiting.

[Kris: This response sparked a thought for me. A casino requires its chips to be able to play. But the chips themselves never increase in value even though they provide utility in the form of “access to entertainment and gambling”. And for poker pros, the chips are literally an on-ramp to their professional “business”. And still the chips do not increase in value. The analogy is weak since the casino can always produce more chips but it’s just a reminder that the value of a “token” is not just a function of its user base, but its supply, the incentive to increase its supply, and the alternatives. If a user can just cash out because there’s a quality competing casino or blockchain it acts as a limitation on any token’s value]

Identifying the 4 pillars in crypto

  1. Brand

    Dogecoin, which is a joke, has its main value on its brand. A lot of people think it’s funny, they like it. It’s kind of fun to play with. So its primary pillar is brand.

  2. IP

    On the infrastructure side, you have things like Filecoin for decentralized storage.

  3. Network Effects

    Decentralized exchanges. Similar to how the NYSE and CME derive value from the fact that you have many buyers and sellers who want to aggregate liquidity on their platform. Same thing for uniswap and sushi.

So yeah, very much the same concept here. What we’re trying to look for, same as with equities, are firms where you have a bit of everything. What we’ve discovered is that simply having one pillar is generally insufficient for success. I always give the example Wozniak & Jobs. You have technology and IP, but you really need marketing as well. So what we’re looking for is crypto organizations, stocks, whatever, asset class doesn’t matter, is strength on all of the advantages, or as you know, as much as possible.

Leave a Reply