AIQ - a proposed benchmark standard

Originally posted on 2023-04-21

AI is everywhere and this time it is for real. Computer vision did take a massive jump a few years ago but in my mind it’s nothing like the leap we’ve just seen in large language models (LLMs). OpenAI’s ChatGPT with GPT-3.5 can do some programming tasks pretty well, and their ChatGPT with GPT-4 is even better. Literally every single day I use it it impresses me with something. This has been non-stop, everyday for at least two weeks now.

Unfortunately, running these models on your home computer isn’t really feasible. They require too much storage, too much RAM, and on top of that they’re not publicly available (despite the name OpenAI having the word “Open” in it…).

There are open source models that you can use but in my testing they don’t really compare. I still need to try a few more but there are a few very big problems with the current open-source LLM ecosystem.

Complexity

Good luck getting one of these things up and running. You’ll find lots of guides, lots of dead ends, and you will spend a LOT of time tweaking things. When you get tired of it you’ll realize you could just used ChatGPT to do whatever you wanted to do faster.

Performance

Even when you think you have something working you’ll have to wait a long time for it to run to find out if it’s worth it. A smaller model took about 88 seconds on my M1 Mac to respond to the prompt “Make a todo list”. And the response was gibberish.

The larger model I tried took 50 minutes (not a typo, 3000 seconds) to respond to the same prompt. That response looked more like English but was still completely incoherent and unusable.

Um, benchmarks?

Yes, you’ve finally gotten to my point here. I propose a new benchmark called AIQ that will give people an idea of how useful a particular configuration will be when they’re working with AI systems. I say “AI systems” instead of LLMs because who knows if the next big leap will be another LLM or not.

Let’s start with LLMs though. The easiest values to pick out to do a comparison between models are the number of parameters and the precision.

Let’s now create a hypothetical benchmark. Assume the model is called “SPeCTRE” which stands for “Semantic Parsing, Contextualization, and Text Reasoning Engine”. It has 56 billion parameters that are 32-bit floating point precision. The specific task will be the TODOLISTv1 task. This is the first version of the model. We would name this benchmark AIQ-TODOLISTv1-LLM-SPeCTREv1-56B-32bFP.

This would refer to a specific model running on bare metal. If the model is running inside of a container we would add the suffix -CONTAINER.

What values would we expect to get out of this? This is where it gets fun. The obvious ones we can pull from langchain are:

Load time - time
Sample time - time per run
Prompt evaluation time - time per token
Evaluation time - time per run
Total time - time

But I’m more interested in putting the model through an actual time limited IQ test. Obviously people would eventually game the system by training their models on the IQ test, and obviously the IQ test at some point will hit its limit, but let’s forget about that for now since we can adjust for that later. It would be really interesting to see what IQ a faster computer has than a slower computer. We’d name these similarly but instead of TODOLISTv1 we’d have AIQv1 or something and we could expand the names if we want to use

So the report might look like this:

System type: M1 Mac Mini
Benchmark: AIQ-TODOLISTv1-LLM-SPeCTREv1-56B-32bFP
Load time: 12418.97 ms
Sample time: 0.97 ms per run
Evaluation time: 10614.80 ms per run
Prompt evaluation time: 1932.82 ms per token
Total time: 3115398.08 ms

AIQ results
AIQ-AIQv1-LLM-SPeCTREv1-56B-32bFP: 11

Since I never really got the model to do anything exciting I think an IQ of 11 might be a little generous but I had to put something.

Some day I hope I can buy a computer and sort them by IQ.