The company hired a panel of professors to create writing prompts for essays on US history, research methods, creative writing, and law.
They fed the prompts to GPT-3, and also gave them to a group of recent college graduates and undergrad students.
The anonymized papers were then marked by the panel, to test whether AI can get better grades than human pupils.
Some of the results could unnerve professors — and excite unscrupulous students. But others showed GPT-3 still has a lot to learn.
Its human rivals earned similar marks for their history papers: a B and a C+. But only one of three students got a higher grade than the AI for the law assignment.
GPT-3 also received a solid C for its research methods paper on COVID-19 vaccine efficacy, while the students got a B and a D.
However, the AI’s creative writing abilities couldn’t match its technical skills. Its story received the model’s solitary fail, while the student writers’ grades ranged from A to D+.
Overall, GPT-3 showed an impressive grasp of grammar, syntax, and word frequency. But it failed to craft a strong narrative for the creative writing assignment.
Project manager Sam Larson told TNW that this could be due to how GPT-3 recalls information:
The creative task asked for memories and stories using the five senses, which GPT-3 has no direct experience with, so it (probably) would have to iterate through a different type of information search, which the prompt was not designed to activate. AI is drawing from a repository of events, history, and law information (because it’s drawing its language prediction from readily available databases). So for law and history topics, this is useful and relevant to cite this data for those topics. But creative writing rests more on imagination and synthesis of thoughts into ideas, versus a regurgitation of pre-existing data.
Still, what GPT-3 lacked in craft it made up for in speed. The model spent between three and 20 minutes generating content for each assignment, while the humans took three days on average.
Assessing the assessment
EduRef stressed that the experiment was only an exploratory study. GPT-3’s outputs were lightly edited for length and repetition, although its content, factual information, and grammar were left untouched.
In addition, the AI produced two papers for the history, research, and law assignments. Larson then picked which ones to use:
Whichever one was more essay-like was the one I picked to expand on. One of the two usually ended up simply restating the prompt or outputting a longer version of the prompt, so picking the more essay-like ‘answer’ was pretty straightforward.
Larson said the creative writing task required additional human interference:
Eventually I gave in and encouraged it with a ‘Once upon a time…’ leader and it got the hint I wanted an actual story. After a couple of generations, I had two variations that looked like creative essays, but one got stuck in a loop describing the same scene on repeat so I picked the other!
Larson — who is himself an academic — was nonetheless impressed by GPT’s performance. He hopes that this type of AI-generated content gives instructors and policy-makers pause for thought about how they quantify what makes a successful student.
But students may be more interested in AI’s ability to lend them a devious helping-hand.