A Very, Very Surface-Level Look At ChatGPT
A window into AI
If you have been following the news (barring everything about Trump), you have probably heard that there is now a waitlist for the GPT-4 API, but that you can use the ChatGPT API right now. From the official OpenAI site:
The ChatGPT model family we are releasing today,
gpt-3.5-turbo
, is the same model used in the ChatGPT product. It is priced at $0.002 per 1k tokens, which is 10x cheaper than our existing GPT-3.5 models.
— Source
Your OpenAI account may already have free tokens — mine didn’t. I have a graph that displays API usage, and I have the ability to set limits on how much of a bill my account can rack up before it freezes API usage. In other words, this still relies on reaching their resources using the Internet — in spite of Medium articles like this, there is no easy way to run ChatGPT locally. I had a very short look at Dolly, an open source alternative to ChatGPT, but it requires the use of eight A100 GPUs, preferably accessed via a Databricks account.
This post is going to be a surface-level look at how ChatGPT works, primarily drawing from a blog post by Stephen Wolfram and a second blog post by Molly Ruby. It will contain quotes and summaries, but will in no way be comprehensive.
If you have not heard about ChatGPT, I have a blog post here about it solving a rudimentary coding interview question…but at this point in time, Medium is saturated with ChatGPT content.
Quick Overview
ChatGPT is a chat-generated, pre-trained transformer. At the simplest level, it strings sentences together using probabilities. Every time you add a word to a sentence, you can make a decision on what word to add next; you can generate a list of probabilities for how likely some words are to come next. There are two obstacles: One, simply adding the highest-probability word every time produces a surprisingly flat sentence. Two, we do not have the computational resources to calculate every possible combination of every possible word in the English language. ChatGPT uses a Large Language Model to estimate probabilities.
A neural net was first invented in the 1940s, and today it is the most successful approach to problems such as image recognition. The human brain has 100 billion neurons in a complex net, and connections within this network have different “weights.” Wolfram goes into more detail in his post, articulating how a neural net could be used to find the closest of three points given an (x,y) input. Wolfram then segues to image recognition, and explains a major advantage of machine learning: A program can distinguish between a cat and a dog, but does not need explicit programming to look for features such as…say…cat ears. Instead, it learns by example. ChatGPT is a giant neural net with 175 billion weights that’s particularly set up to deal with language. Its most notable feature is its transformer.
I will take this opportunity to now transition into Molly Ruby’s post in Towards Data Science. A Large Language Model digests huge quantities of text data, infers relationships between words, and predicts what word should come next in a sequence of words. The model has some limitations, so in 2017 Google Brain introduced transformers. Transformers have “attention.” Certain portions of the net are given more “attention,” and receive more weight.
All GPT models use transformer architecture. InstructGPT was GPT-3’s successor, and OpenAI documented its training in this 68-page paper. InstructGPT incorporated human feedback into the training process — 40 contractors fine-tuned GPT-3 by writing appropriate responses to prompts, effectively creating a supervised training set in which the output from a given input is known. OpenAI then applied a reward model that ranked outputs from best to worst.
The final step was reinforcement learning. The model received random prompts and returned responses using policies— a “policy” is a mapping from the current environment observation to a probability distribution of the actions to be taken; a much simpler definition is that a policy is a strategy used to obtain a goal. Rewards were assigned to various policies.
Taking A Step Back
CGPGrey made this YouTube video before ChatGPT came out, then edited the title to have ChatGPT in it.
A bit like the Wolfram blog post, CGPGrey used the example of image recognition. Their contribution, besides the aesthetically-pleasing graphics, was this analogy: You have a bunch of dumb bots, and you train them to distinguish between pictures of 3s and pictures of bees. You do this by giving them the answer keys. Some of them are really bad at their jobs, and seemingly random — you evaluate which ones are bad and destroy them, then continue to work with the ones who are more successful. Eventually one of your bots is more than just lucky, so you clone that bot and continue to derive from that bot.
And now, here’s a catch: In this analogy you have trained a very capable bot that can distinguish between 3s and bees. What you have achieved works well, but you cannot really explain how it works. You know your training data. You know your training methodology. No explicit code exists that tells you exactly what was used to decide between 3s and bees.
Closing Thoughts
Wolfram’s blog post was somewhat optimistic — or pessimistic, depending on your viewpoint. He was impressed by ChatGPT, but considers it an exploration of how essay-writing is “shallower” than we had anticipated. Put another way — and these are my words, not his — he thinks ChatGPT demonstrates how a dumb AI can write passable essays, not how ChatGPT is a truly smart AI that has unlocked the essence of human reasoning.
GPT-4, on the other hand…