As Robin Hanson says, building the sheer variety of products we have is actually bad, because it increases unit costs. This is especially clear in laptops - there are far too many laptops with too little to distinguish them and too many nonsense minor issues. As such, I think we need a new streamlined and harmonized lineup of all laptops:
- Cheapest Possible Technically Functional Laptop
- Mediocre Office and Home Laptop (to be issued to most office workers and people who want to edit spreadsheets or emails and such)
- CEO Laptop (reasonably fast, expensive, big battery for CEO activities)
- Programmer Laptop (ThinkPad-like focused on CPU performance and reasonable portability)
- Gamer Laptop (16" Legion-like with middling battery life and decently high-powered CPU/GPU)
- Gamer Laptop (Big) (17"-18" desktop replacement)
- Technician Laptop (smallish thick and rugged laptop with many ports)
- Multimedia Laptop (Mediocre Office and Home Laptop with a nicer display and better graphics)
There would also be a version number updated whenever new components are available, of course. There can perhaps be two or three variants of each (with the same chassis, board, etc but different components) with different pricing, but no more.
Anyone optimistic about society adapting sanely to AGI should look at the uptake of IPv6.
Why do all three of the reasonably okay AI music tools (Udio, Suno, Riffusion) have fairly similar artifacts? Except for, I think, older versions of Udio, they all sound consistently off in some way I don't know enough music theory to explain, particularly in metal vocals and/or complex instrumentals. Do they all use the same autoencoders or something?
Street-Fighting Mathematics is not actually related to street fighting, but you should read it if you like estimating things. There is much power in being approximately right very fast, and it contains many clever tricks which are not immediately obvious but are very powerful. My favourite part so far is this exercise - you can uniquely (up to a dimensionless constant) identify this formula just from some ideas about what it should contain and a small linear algebra problem!
People are claiming (I don't know much RL) that DeepSeek-R1's training process is very simple (based on the paper: https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf) - a boring standardish (for LLMs) RL algorithm optimizing for reward on some ground-truth-verifiable tasks (they don't say which). So why did o1 not happen until late 2024 (public release) or late 2023 (rumours of Q*)? "Do RL on useful tasks" is a very obvious idea. I think the relevant algorithms are older than that.
The paper says that they tried applying it to smaller models and it didn't work nearly as well, so "base models were bad then" is a plausible explanation, but it's clearly not true - GPT-4-base is probably a generally better (if costlier) model than 4o, which o1 is based on (could be distillation from a secret bigger one though); and LLaMA-3.1-405B used a somewhat similar postttraining process and is about as good a base model, but is not competitive with o1 or R1. So I don't think it's that.
What's going on here? The process is simple-sounding but filled with pitfalls DeepSeek don't mention? What has changed between 2022/23 and now which means we have at least three decent long-CoT reasoning models around?
Religion has progressed, historically, from:
- there is a very large quantity of widely dispersed gods and you don't know about the vast majority of them
- there are quite a few gods, but a bounded amount
- there is exactly one god
- there are exactly zero gods
By extrapolation, we can conclude that the next step is that humanity has negative one god, i.e. is in theological debt and must build a god to continue. This is where the EY-style "aligned singleton" came from. But people are now moving toward "we need everyone to have pocket gods" because they are insane, in line with the pattern. The next step is of course "we need to build gods and put them in everything".
It annoys me that my bank makes it so onerous to send payments ever. Five confirm screens and an 8-character base36 OTP I can't fit in working memory. I get why (they are required to reimburse you if you get defrauded and happen to use the bank's push payments while being defrauded, in some circumstances) but this is a very silly consequence.
I finally got round to watching the political documentary "Yes, Minister". It would be very funny if it were fictional, which I am told it is not.
DeepSeek V3 was unexpectedly released recently. It's a decently big (685 billion parameters) model and apparently outperforms Claude 3.5 Sonnet and GPT-4o on a lot of benchmarks. And they release the base model! Very cool. Some notes:
- They don't make this comparison, but the GPT-4 technical report has some benchmarks of the original GPT-4-0314 where it seems to significantly outperform DSv3 (notably, WinoGrande, HumanEval and HellaSwag). I can't easily find evaluations of current-generation cost-optimized models like 4o and Sonnet on this. Is this just because GPT-4 benefits lots from posttraining whereas DeepSeek evaluated their base model, or is the model still worse in some hard-to-test way? GPT-4 is 1.8T trained on about as much data.
- It's conceivable that GPT-4 (the original model) is still the largest (by total parameter count) model (trained for a useful amount of time). The big labs seem to have mostly focused on optimizing inference costs, and this shows that their SOTA models can mostly be matched with ~600B. We cannot rule out larger, better models not publicly released or announced, of course.
- DeepSeek has absurd engineers. They have 2048 H800s (slightly crippled H100s for China). LLaMA 3.1 405B is roughly competitive in benchmarks and apparently used 16384 H100s for a similar amount of time. This is due to some standard optimizations like Mixture of Experts (though their implementation is finer-grained than usual) and some newer ones like Multi-Token Prediction - but mostly because they fixed everything making their runs slow. They avoid tensor parallelism (interconnect-heavy) by carefully compacting everything so it fits on fewer GPUs, designed their own optimized pipeline parallelism, wrote their own PTX (roughly, Nvidia GPU assembly) for low-overhead communication so they can overlap it better, fix some precision issues with FP8 in software, casually implement a new FP12 format to store activations more compactly and have a section suggesting hardware design changes they'd like made.
- It should in principle be significantly cheaper to host than LLaMA-3.1-405B, which is already $0.8/million tokens.
Mass-market robot dogs now beat biological dogs in TCO.
When analyzing algorithms, O(log n) is actually the same as O(1), because log n ≤ 64. Don't believe me? Try materializing 2^64 things on your computer. I dare you.
https://pmc.ncbi.nlm.nih.gov/articles/PMC10827157/
What other things are hiding in underanalyzed sequence data?
This paper is kind of hilarious: https://www.nber.org/papers/w31047
Apparently "hyperbolic discounting" - the phenomenon where humans incorrectly weight future rewards ("incorrectly" in that if you use any curve which isn't exponential you will regret it at some point) - isn't necessarily some kind of issue of "self-control", or due to uncertain future gains. It results from humans being really bad at calculating exponentials.
It's always "exciting" when you have a problem and it turns out that your problem is addressed by some research from the last year.
The posthuman technocapital singularity is reaching backward in time to give itself a good soundtrack: https://www.youtube.com/watch?v=86fZ50TysOg
(thanks to Dmytro and MusicPerson and I guess Udio's engineers.)
It begins.
Real computers pull several kilowatts and can be heard from several rooms away. Real computers need GPU power viruses to even out variations in power draw in order to not take down the grid. Real computers have to have staggered boot sequences to avoid destabilizing the radiation pressure/gravity equilibrium in the Sun.
Apparently the CalDAV server I use, Radicale, can in some circumstances permanently lock up and begin rejecting all requests to add or edit events with a 400 error, which it then doesn't explain due to poorly configured logging, and which then turn out to be buried three layers deep in libraries. In other news, I'm wiping that install and switching to an alternative ideally not written in Python.
Georgism is not going far enough. We need to apply Georgism to the akashic records and all mathematical abstractions in order to land-value-tax domain names, copyright, etc.
This is a very clean explanation of much of the modern media ecosystem: https://cameronharwick.com/writing/high-culture-and-hyperstimulus/. My read is basically that hard-to-replicate entertainment is higher-status because if you enjoy easy-to-produce things you're more open to exploitation (spending too many resources on those easy things).