Additional Thoughts from Simon Willison's 2024 Review

Simon Willison's summary of AI developments in 2024 makes for compelling reading – as I have previously shared on my blog. Upon reviewing it again and processing my own notes, I thought I would share some additional thoughts here.

As AI models grow more sophisticated, they place increasing demands on users. Yet, as Willison observes and I’ve already linkblogged about, most people are simply presented with a prompt input box without proper guidance on how to use it effectively.

A drum I’ve been banging for a while is that LLMs are power-user tools—they’re chainsaws disguised as kitchen knives. They look deceptively simple to use—how hard can it be to type messages to a chatbot?—but in reality you need a huge depth of both understanding and experience to make the most of them and avoid their many pitfalls.

What are we doing about this? Not much. Most users are thrown in at the deep end. The default LLM chat UI is like taking brand new computer users, dropping them into a Linux terminal and expecting them to figure it all out.

It's an important democratic issue to ensure that the gap between those who can use the models smartly and those who cannot remains as small as possible.

Most people have heard of ChatGPT by now. How many have heard of Claude? The knowledge gap between the people who actively follow this stuff and the 99% of the population who do not is vast. The pace of change doesn’t help either.

An important step in deriving significant value from the models is developing methods to evaluate their responses. There are major global benchmarks, but they measure general capabilities. For users, whether we're talking about organizations or individuals, there's reason to consider how best to evaluate both different models and different system prompts.

It’s become abundantly clear over the course of 2024 that writing good automated evals for LLM-powered systems is the skill that’s most needed to build useful applications on top of these models. If you have a strong eval suite you can adopt new models faster, iterate better and build more reliable and useful product features than your competition.

The environmental aspects of the models pull in two directions simultaneously:

On one hand, the models are becoming increasingly resource-efficient, which is reflected in their usage costs:

These price drops are driven by two factors: increased competition and increased eﬃciency. The eﬃciency thing is really important for everyone who is concerned about the environmental impact of LLMs. These price drops tie directly to how much energy is being used for running prompts.

Here’s a fun napkin calculation: how much would it cost to generate short descriptions of every one of the 68,000 photos in my personal photo library using Google’s Gemini 1.5 Flash 8B (released in October), their cheapest model?[...] That’s a total cost of $1.68 to process 68,000 images. That’s so absurdly cheap I had to run the numbers three times to confirm I got it right.

For less efficient models I find it useful to compare their energy usage to commercial flights. The largest Llama 3 model cost about the same as a single digit number of fully loaded passenger flights from New York to London. That’s certainly not nothing, but once trained that model can be used by millions of people at no extra training cost.

On the other hand, gigantic data centers are currently being built around the world – and not always in locations where conditions for minimizing environmental impact are optimal:

The much bigger problem here is the enormous competitive buildout of the infrastructure that is imagined to be necessary for these models in the future.

In this context, it's interesting that increasingly powerful models are becoming possible to run on a regular computer.

That same laptop that could just about run a GPT-3-class model in March last year has now run multiple GPT-4 class models!

Just before writing this blog post, I installed Llama 3.2 Vision on my own computer (a MacBook Pro M1) and had it write captions for some of my photographs. It's not as fast and doesn't produce quite as good results as if I were to use one of the cloud services, but it happens entirely on my own hardware. And there are many use cases where speed doesn't matter, and where the quality of the text is sufficient.