S

SteveZ

48 karmaJoined Redmond, WA, USA

Comments
5

I am curious if the FTX stake in Anthropic is now valuable enough to plausibly bail out FTX? Or at least put a dent in the amount owed to customers who were scammed?

I've lost track of the gap between assets and liabilities at FTX, but this is a $4B investment for a minority stake, according to news reports. Which implies Anthropic has a post-money valuation of at least $8B. Anthropic was worth $4.6B in June according to this article. So the $500M stake reportedly held by FTX should might be worth around double whatever it was worth in June, and possibly quite a bit more.

Edit: this article suggests the FTX asset/liability gap was about $2B as of June. So the rise in valuation of the Anthropic stake is certainly a decent fraction of that, though I'd be surprised if it's now valuable enough to cover the entire gap.

Edit 2: the math is not quite as simple as I made it seem above, and I've struck out the word "should" to reflect that. Anyway, I think the question is still the size of the minority share that Amazon bought (which has not been made public AFAICT) as that should determine Anthropic's market cap.

Hi, thanks for writing this up. I agree the macro trends of hardware, software, and algorithms are unlikely to hold true indefinitely. That said, I mostly disagree with this line of thinking. More precisely I find it unconvincing because there just isn’t a lot of empirical evidence for or against these macro trends (e.g. natural limits to the growth of knowledge), so I don’t really understand how you can use it to rule out certain endpoints as possibilities. And when I see an industry exec make a statement about Moore’s Law I generally assume it is only to reassure investors that the company is on the right path this quarter rather than making a profound forward-looking statement about the future of computing. For example since that 2015 quote, Intel lost the mobile market, fell far behind on GPUs, and is presently losing the datacenter market.

There are a number of well-funded AI hardware startups right now, and a lot of money and potential improvements on hardware roadmaps including but not limited to: exotic materials, 3D stacking, high-bandwidth interconnects, new memory architectures, and dataflow architecture. On the AI side techniques like distillation and dropout seem to be effective at allowing much smaller models to perform nearly as well. Altogether I don’t know if this will be enough to keep Moore’s law (and whatever you’d call the superlinear trend of AI models) going for another few decades but I don’t think I’d bet against it, either.

Machine learning involves repetitive operations which can be processed simultaneously (parallelization) 

I agree, but of course Amdahl's Law remains in effect.

The goal of hardware optimization is often parallization (sic)

Generally when designing hardware increased throughput or reduced latency (for some representative set of workloads) are the main goals. Parallelization is one particular technique that can help achieve those goals, but there are many ideas/techniques/optimizations that one can apply.

The widespread development of machine learning hardware started in mid-early 2010s and a significant advance in investment and progress occurred in the late 2010s 

Sure... I mean deep learning wasn't even a thing until 2012. I think the important concept here is that hardware designs have a long time horizon (generally 2-3 years) because it takes that long to do a clean-sheet design and also because if you're spending millions of dollars to design/tapeout/manufacture a new chip, you need to be convinced that the workload is real and people will still be using it years from now when you're trying to sell your new chip.

CUDA optimization, or optimization of low-level instruction sets for machine learning operations (kernels), generated significant improvements but has exhausted its low-hanging fruit 

Like the other commenter, this could be true but I'm not sure what the argument is for this. And again, it depends on the workload. My recollection is that even early versions of cuDNN  (circa 2015) were good enough that you got >90% of the max floating point performance on at least some of the CNN workloads common at that time (of course transformers weren't invented yet). 

The development of specialized hardware and instruction sets for certain kernels leads to fracturing and incentivizes incremental development, since newer kernels will be unoptimized and consequently slower 

This could be true, I suppose. But I'm doubtful because those hardware designs are being produced by companies that have studied the workloads and are convinced they can do better. If anything competition may incentivize all hardware manufacturers to spend more time optimizing kernel performance than they otherwise would.

intermediate programs (interpreters, compilers, assemblers) are used to translate human programming languages into increasingly repetitive and specific languages until they become hardware-readable machine code. This translation is typically done through strict, unambiguous rules, which is good from an organizational and cleanliness perspective, but often results in code which consumes orders of magnitude more low-level instructions (and consequently, time) than if they were hand-translated by a human. This problem is amplified when those compilers do not understand that they are optimizing for machine learning: compilation protocols optimized to render graphics, or worse for CPUs, are far slower.

This is at best an imperfect description of how compilers work. I'm not sure what you mean by "repetitive", but yeah, the purpose is to translate high-level languages to machine code. However:

  • Hardware does not care about code organization and cleanliness, nor does the compiler. When designing a compiler/hardware stack the principal metrics are correctness and performance. (Performance is very important, but in relative terms is a distant second to correctness.)
  • The number of instructions in a program, assembly or otherwise, is not equivalent to runtime. As a trivial example, "while(1)" is a short program with infinite runtime. Some optimizations such as loop unrolling increase instruction count while reducing runtime.
  • Such optimizations are trivial for a compiler, and tricky but possible for a human to get right. 
  • "often results in code which consumes orders of magnitude more low-level instructions": not sure what this means. Compilers are pretty efficient, you can play around with source code and see the actual assembly pretty easy (e.g. Godbolt is good for this). There's no significant section of dead code being produced in the common case. 

    (Of course the raw number of instructions increases from C or whatever language, this is simply how RISC-like assembly works. "int C = A + B;" turns into "Load A. Load B. Add A and B. Allocate C on the stack. Write the computed value to C's memory location.")
  • Humans can sometimes beat the compiler (particularly for tight loops), but compilers in 2023 are really good. I think the senior/junior engineer vs compiler example is wrong. I would say (for a modest loop or critical function): the senior engineer (who has much more experience and knowledge of which tools, metrics, and techniques to use) can gain modest improvement by spending significant time. The junior engineer would probably spend even more time for only a slight improvement.
  • "This problem is amplified when those compilers do not understand that they are optimizing for machine learning": Compilers never know the purpose of the code they are optimizing; as you say they are following rule-based optimizations based on various forms of analysis. In LLVM this is basically analysis passes which produce data for optimization passes. For something like PyTorch, "compilation" means PyTorch is analyzing the operation graph you created and mapping it to kernel operations which can be performed on your GPU.
  • "compilation protocols optimized to render graphics, or worse for CPUs, are far slower": I don't understand what you mean by this. What is a compilation protocol for graphics? Can you explain in terms of common compiler/ML tools? (E.g. LLVM MLIR, PyTorch, CUDA?)
  • I honestly don't understand how the power plant/flashlight analogy corresponds to compilers. Are you saying this maps to something like LLVM analysis and optimization passes? If so this is wrong; running multiple passes with different optimizations increases performance. Multiple optimization passes was historically (i.e. circa early 2000s) hard for compilers to do but (LLVM author) Chris Lattner's key idea was to perform all the optimizations on a simple intermediate layer of code (IR) before lowering to machine code.