“That’s when we first noticed it, with Woody.”
“[Larry Cutler] was in that directory and happened to be talking about installing a fix to Woody or Woody’s hat. He looked at the directory and it had like 40 files, and he looked again and it had four files.”
“The most awesome stage”
Last year, Facebook's VP of Design thought the TNW Conference main stage was the best she'd ever been on.
“Then we saw sequences start to vanish as well and we were like, “Oh my god”
“I grabbed the phone…unplug the machine!””
That’s Oren Jacob, former Chief Technical Officer of Pixar—then an associate technical director for Toy Story 2—recounting the moment they discovered that the movie was being deleted off of the company’s servers after an erroneous command was executed, erasing two months and hundreds of man-hours worth of work.
You might have heard something about this lately, as a clip from the special features of the movie has been making the rounds after being posted on Tested. It’s narrated by Jacob himself, and the movie’s Supervising Technical Director Galyn Susman.
The story struck me as interesting, so I reached out to Jacob, who is now the CEO of ToyTalk, a digital entertainment startup that is in the process of readying its first project for launch. I wanted to get the story right from the horse’s mouth, to see if the situation was really as dramatic as it had sounded, how the staff coped and whether or not they ever discovered exactly who deleted the files in the first place. As you can hear in the video, Jacob has a hyperkinetic conversational patois and, despite what he says, a great memory for the details of the situation.
A huge chunk of Toy Story 2 was indeed deleted and was only recovered by a stroke of luck and the intense efforts of the Pixar staff.
But what most people don’t know is that the whole movie was actually tossed out again, not by the computers, but by the filmmakers themselves. It was then completely remade with mere months to go before a release date that was set in stone, cementing Pixar’s legacy as a crucible of commitment to quality.
The story that Jacob shared with me ended up containing some interesting lessons for people working with large amounts of technical data, but more than that, it has a lot to say about just how much of what makes Pixar’s movies so great has to do with the people who work there and their insane dedication to making things great.
/bin/rm -r -f *
The story likely takes place in 1998, though Jacob admits he’s foggy on the exact date. The Toy Story 2 crew, about 150 people in the animation, lighting and modeling departments of Pixar, had been hard at work for some time on the movie. Simultaneously another 200-250 people were at work finishing up Bug’s Life, which would be released that Fall.
One day, Jacob (pictured right) was in the office of Larry Cutler—along with Larry Aupperle, who was also an associate Technical Director working under Susman. In what is a crazy stroke of luck, they happened to be looking at a directory in which the assets for the character Woody were stored, when they noticed, on a refresh, that there were suddenly less and less files.
“He had an error, I forget the exact [one]. It was like, “Directory no longer valid,” because he’s in a place that had just been deleted. Then he thought to walk up [a directory] and he walked back up and then we saw Hamm, Potato Head and Rex. Then we looked at it again and there was just Hamm and then nothing.”
The command that had been run was most likely ‘rm -r -f *’, which—roughly speaking—commands the system to begin removing every file below the current directory. This is commonly used to clear out a subset of unwanted files. Unfortunately, someone on the system had run the command at the root level of the Toy Story 2 project and the system was recursively tracking down through the file structure and deleting its way out like a worm eating its way out from the core of an apple.
That’s when the panicked call was made to the machine room where the main server was located and the instruction given to just yank the power and network connection of the server. This is simply not done in environments with hundreds of clients connected to the machine, it’s as if someone asked you to throw your main breaker to shut off your blender.
“The master machine goes down,” says Jacob. “Some people are animating a shot and they can work for like a minute or five minutes, but eventually you’ll have to pull data from the master machine for some reason or another, which your machine will freeze.”
“Eventually every animator and , every TD, everyone working on the show goes, “Oh, all machines down. Lets go to lunch. Ha, ha.”
The machine was eventually brought up a few hours later and they took a poll of the damage. When a size command was run on the Toy Story 2 directory, it was only 10% of the size it should have been.
90% of the movie had been deleted by the stray command.
“The show’s been trashed”
When this story originally started making the rounds, one of the other big questions was ‘how did this happen?’.
I asked Jacob about the ‘how’ and he told me that it was actually largely a function of how a company like Pixar works on projects.
“You have 400 people on the network and they all have to have like pretty massive access across the board to the whole project, so it’s hard to like, limit the damage,” Jacob said. “It could happen from almost any terminal.”
“Pixar being a wide open Unix environment meant that it was very promiscuous. You could [change directory] ‘slash’, net ‘slash’ or walk across the network and log into Ed Catmull’s machine or Steve Job’s machine if you wanted to. Not that Steve ever did do any work on the film directly, but you could do that.”
The common way to prevent an accidental command like this being run on an entire project is to lock users down with permissions to only the files they need. But, because of the way a project like a Pixar film works, almost everyone working on the show needed permissions to read and write to the master machine. This was their job.
Assigning micro-managed permissions would have eaten up administrative resources, especially in crunch time.
So at this point, most of the film had been deleted or otherwise compromised. But that wasn’t a big deal. Things had been deleted before, it’s just something that would happen from time to time. During the production of A Bug’s Life, most of the ants got deleted and had to be restored, which wasn’t a problem because, of course, Pixar backs up its data.
In 1998, the most common way to back up a bunch of data was on tape, which is the system that Pixar was using. Unfortunately, these backups were not continuously tested, as the company does today and is the universally recommended best practice.
Typically, to make sure your backups are good, you have to use them. Every few days or weeks, you swap your backups with your currently running setup and keep going, in order to make sure that your data is all there. This is a practice called ‘live backups’.
Pixar, at the time, did not continuously test their backups. This is where the trouble started, because the backups were stored on a tape drive and as the files hit 4 gigabytes in size, the maximum size of the file was met. The error log, which would have told the system administrators about the full drive, was also located on the full volume, and was zero bytes in size.
This meant that new data was being written to the drive, but it was ‘pushing’ the older files off. But no-one at Pixar knew this yet.
It’s worth mentioning at this point, because some of you may be wondering, that the whole movie encompassed no more than around 10 gigabytes of information. That may seem crazy considering the size of textures for many newer flicks, but you have to remember that the backup tape had a file-size limit of 4GB and it didn’t become a problem for many months on the project. The entirety of the data for the movie could have been fit on a couple of dual-layer DVDs.
So, they grabbed the backups, went to work and restored the show. Within a couple of days they had what they thought was a completely restored version of the files for TS:2.
To test it, they submitted around 2,000 frames to render, one for each ‘shot’ in the movie (the bits between ‘cuts’). This would effectively pull on every resource involved in the film because those stills would need all of the models, lighting and textures in order to render properly.
Everything looked fine. “We lost a week of a work,” Jacob says. “So those last 10 shots are the last week, but other than that…O.K..”
Fast-forward to the end of that week. The crew has been back at work using the newly restored files for many days now. But, over the course of that week, there had been a few oddities. Weird ‘attach’ errors kept cropping up.
An attach is when a character, like Woody for instance, takes off his hat. The hat transfers from being a part of his head to being a part of his hand, this is a tricky procedure and very ‘fragile’.
“We started doing comparisons of the shots and realized that the show was incomplete. How it had worked for that week and how such renders came out, I can’t explain.”
By the end of that week enough things had broken here and there that the team realized there was a problem. In addition to the attaches, some people working on a version of their shot noticed that the current version was far lower than where they had left off. They were working on number 420 and now it said it was version 20. Something was up.
This is when the tape backup issue was discovered, after a full week’s work.
“That work is definitely wasted, because it’s on top of an unreliable restoral,” recalls Jacob. “Now sadly, what’s happened is that there is zero confidence in any solution, because the restoral is bad, the work on it is bad, the deletion was horrible, and the backup tapes are busted.”
“All possible directions to move are broken and, maybe worse. We don’t quite understand how they’re broken. If only 10 percent of the show is not on the tape, which 10 percent? I don’t know.”
“That was the big meeting, in the conference room back in Bugville (Pixar’s corporate complex). All the big brains in the studio are like, “Uh, I don’t know. Oh my God!”
That’s when Susman said, “I have a machine back at my house.”
The $100,000,000 Volvo
Susman, the Supervising Technical Director on Toy Story 2 (pictured below), had given birth to her son Eli shortly before, and had been working from home. This meant that she had a Silicon Graphics workstation at her house. It was either an Indigo 2 or an Octane, pictured right, and it was loaded up with a full copy of the movie.
In order for her to work on the movie while out, they had plugged the machine up to the local network and copied the whole file tree over. Then she would receive incremental updates over her ISDN internet connection. For those not in the know, that was like two 56kbps modems duct taped together (welcome to 1998).
The last update that her machine had gotten could have been as old as a couple of weeks, but at this point the Pixar team had an incomplete backup and a corrupt tree full of files, and they needed anything they could get their hands on to fix the problem. This was the difference between rebuilding every missing file from scratch and, well, shipping the movie on time.
So Jacob and Susman hopped into her Volvo and shot back over the bridge from Richmond to her house to retrieve the computer. They hauled it out to the car and carefully placed it in the back seat, wrapping it in blankets and strapping it in tightly with seatbelts.
“There was nothing else to do,” Jacob says of the session described earlier. “We were dead. We’d been in the meeting for like 45 minutes. There was 30 of us, all the biggest brains Pixar can bring to the problem.”
That’s when Susman remembered her machine at home.
“She and I just stood up and walked out, back to her Volvo, drove across the bridge, got the machine, got some blankets, I hugged it with seatbelts, across the back seat. Drove at like 35 with blinking lights on, hoping to get a police escort. No cops saw us, so it didn’t help us.”
At that point, the Volvo had become a $100M machine, as the entirety of the team’s efforts so far on the project were ensconced on its drives.
They made it back to Richmond in safety. “Eight people met us with a plywood sheet out in the parking lot and, like a sedan carrying the Pharaoh, walked it into the machine room.”
They sweated as the machine booted up, as that’s exactly when most drives crash. It booted. They didn’t pass go, they just plugged it into the network and copied the entire drive off immediately, then starting picking apart what they had.
The backup was about two weeks old, but they were able to make it the ‘B’ tree to compare with the ‘A’ backup from two months ago and a third ‘C’ source that was cobbled together out of any local backups animators or modelers had made on their personal terminals, collected by ‘groveling’ for .old, .sav, .bak and any other old file they could find.
They managed to verify about 70,000 files, leaving 30,000 files to check, and they had to be done by hand. “We worked from Friday till Monday morning, nonstop through, in rotating shifts with food and sleeping bags, with about 10 or 12 of us,” recalls Jacob.
“When people came in on Friday, when they showed up. We’d hand them a printout of, “Here are the 500 things you’re checking in the next eight hours. Start running xdiff commands. Go for it.”
“Quickly, within a couple of hours, the scripting nerds had put together scripts that would ingest the list and spawn off XF windows, 20 deep. Close them all, 20 deep. Close them all, so you could look at them really fast.”
They all had to be looked at with human eyes, to see which ones were shorter or more current. They did it over the next few days. The feeling of sympathy and support are what Jacob recalls most vividly. Not just that employees had to sacrifice a weekend with their family and come in on the weekends, staying up late hours and even sleeping on site, but the sense of ‘digging in’ to fix the problem.
“We dug so deep at that point in time. People who were on the “Toy Story” crew, people on “A Bug’s Life,” and the folks in the studio at large. The whole community, whether that was supporting people who were working late, or on the keyboards who typing, or folks sending us food by messenger pigeon.”
“At some point, even some of the neighboring small little sandwich shops in Point Richmond said, “You need free food today? I know you guys aren’t sleeping right now.”
The insane amount of focus needed to pull off comparing those files shows just how deep the dozen or so people on the project had to dig. The experience transcended employment and moved into the realm of true dedication to the movie, to their digital friends and to each other.
“The parts of the weekend that I can remember specifically are the trays of cookies, the lemonade, the pizza, and flowers that were sent,” Jacob recalls. “Somebody hired a masseuse on a Sunday to walk around for a while. Some other person happened to work at an emergency shelter and brought in blankets.”
They then rebuilt and tested the project and it ended up working, sort of. To this day, Jacob can’t explain the fact that more than several thousand files were missing from the tree by the time they were done.
“Where the files went, we don’t know. The fact that it still worked without them is totally unexplainable.”
But the project worked, the frames rendered, Toy Story 2 lived again.
One of the big questions that I wanted to pose was whether or not it was ever discovered who was responsible, and whether they were punished. Normally when something like this happens, there’s a cry for accountability. Item one on the agenda is typically ‘who do we blame?’ Not so at Pixar.
“There was no attempt to hide it,” says Jacob. “We sent email out like 10 minutes later to everyone in the building. “Help. Holy s**t!”
Aside from the fact that there was immediate chatter about who might have made such a dumb move, the discussion quickly moved right on to how to fix the problem.
“Let’s put the witch hunt away. We’ve got to get the show back first. Let’s not go spend a week of our time trying to kill somebody. Where’s the movie?”
“Obviously, five minutes in the meeting, you’re all sweating and red-faced. And somebody will say, “Let’s go kill somebody and lynch them. Now,” says Jacob, “I support lynching on our agenda. But, number one is, just get the movie back and work on Buzz and Woody again.We’ve lost our friends.”
With this many man-years, or even man-decades, worth of work on a project, the temptation to find someone to blame, to expend effort on hunting down the person responsible, is intense.
But that kind of negative thought process doesn’t help anyone and it just removes focus from what matters most: moving forward.
The systems administrators definitely “went through some deep soul-searching” about the backup plans and came to the big production meeting with a new backup plan in place, which was talked over very thoroughly. But there weren’t any summary firings or screaming matches.
Jacob can’t recall who on the executive staff was on staff the day that the backups were being restored, but he says that whoever was there, Steve Jobs, founder Ed Catmull and the rest of the executive staff were very supportive of the restoral efforts, rather than focusing on slashing and burning staff over the error. They bought the team Pizza that weekend, got them anything they wanted and were generally supportive.
During the big meeting over the backup problem Catmull, who is known for leading an ‘incredibly calm and zen-like existence’, simply asked what the team was doing about the issue.
Jacob recalls the exchange:
“Ed, we’re doing everything we can right now. OK?,”
“You guys keep on that problem?
“OK, thank you, Ed.”
The thing about a disaster like this one is that the technical directors and staff at Pixar had to trust one another to fix the issue, even though there were several mistakes made and one of them was responsible. “If you can’t sit down and calmly engage that meeting, you can’t be in that meeting with them,” says Jacob. “Because the circumstances were so incredibly unusual. Black Swan events do occur.”
Instead of dwelling on pinning the blame or lamenting the loss of time and effort, the team made sure to alter the backup strategy so that something like that didn’t happen again, and it went about making up for lost time.
Toy Story 2 gets trashed again
After the deletion and restoration of Toy Story 2, the team was likely hoping for an uneventful path to release, but it was not to be.
In the Christmas of ’98, after the release of A Bug’s Life and the promotional tour was done, John Lasseter, Andrew Stanton, Pete Docter and legendary story man Joe Ranft all came to the production team to take a look at Toy Story 2.
It was not a good film. They dedicated the winter vacation to re-writing the project almost entirely from the ground up. Production shut down on December 15th and came back after New Year’s in January, when the story team re-pitched the movie.
Lasseter and Lee Unkrich ended up co-directing the film along with Ash Brannon as it was seen in the theaters.
Among the things that stayed? The main characters, of course. Buzz, Woody, Hamm, Potato Head, Rexx. Andy’s room stayed. The Al’s Toy Barn sequence stayed. That’s it, nearly everything else you see in the film as it is new.
Jacob explains what was added, including the entire character and animation for Buster the dog:
Effectively all animation was tossed. Effectively, all layout was tossed. So all camera work would start from scratch. Lighting was in the film a little bit, but that was tossed as well. We had to build new characters.
So at that point, Buster showed up at that point. And that character went from being out to being in the screenplay to in the final screen in nine months.
That’s a fully animated quadriped…On the fly. And most of the humans in the film and show. All the background extras in the airport at the end”
They were all built and assembled then. And all the effects work was added to the film. The opening of the film, which is Buzz, Buzz playing with the robots, which I spent a lot of my time working on, where Buzz blows up a quarter-million robots with that crystal…that explosion. That was all added in that pitch as well. It started from ground zero in January.
So the story, effectively. And the film. And that was probably one of the biggest tests of what Pixar was as a company and a culture we ever went through.
The big deal about re-building the movie? It had a hard-set release date of November 22, 1999. That date was set in stone. A big-budget movie like Toy Story 2 has countless marketing tie-ins, promotional efforts and more that had to be timed perfectly with the release of the movie.
Moving the release date of the movie within a year is insanely difficult. Moving it within 6 months is impossible. This meant that the team had to re-make Toy Story 2 in 9 months. All because they wanted to make the best thing they could possibly make.
“At that point, you’ve still got to go, “How far are we going to take this?” Right?,” Jacob adds. “I mean, that’s balls to the wall.”
“January to September ’99 was an unbelievable Herculean effort to pull that show off from ground up again. That was one of the defining cornerstones to what Pixar as a corporate culture meant to itself inside. And what it could produce was that. Just because of that.”
Pulling a hundred-hour week now and then is tough enough on morale and physical well-being. But to do it for 9 months in a row, that was beyond the call of duty.
Pixar was also an independent and publicly traded company at that point in time. To blow a film like Toy Story 2, not release it on time, would have decimated the studio’s credibility, and detonated the Disney movie economy.
“Save Buzz and Woody. Save the franchise. Save the movie. Save the company. It was an all-in bet.”
Toy Story 2 was indeed finished and released on time. It grossed nearly $500M worldwide, was nominated for an Academy Award and cemented Pixar’s reputation as the studio that wouldn’t compromise.
In closing, Jacob told me that the most important thing that he took away from it was the sense of camaraderie from the crew at Pixar at the time.
“I’d never really experienced that before, at that level, because it was such a loss that you didn’t need to have a meeting to explain how bad it was. People just knew. And both in the company but also in families and friends around us, and in the people in Point Richmond too. Maybe we got it back as good as we did, as close as we did, because of that. That’s a very emotional thing to say. It’s not technical.”
The thing that I take away about these experiences is that the spontaneity of the communal support speaks to the culture of Pixar the rest of the time. That kind of thing just doesn’t happen all of a sudden. You can’t have a disaster and instantly develop this kind of community and camaraderie.
It has to seep out. It has to be in the soil. You don’t just plant it and watch it grow in a day. It has to be cultivated over time, as it obviously was at Pixar.
Jacob agrees, citing the focus needed to restore the film. “To be there and not blink for 60 hours straight, while staying sharp, is effectively impossible. But suddenly you find food. You find a blanket, you find someone’s throwing you into a shower and taking you back out again. You’re like, “How’d that happen?!””
“It just worked out that way, without thinking about it. The lasting memory of the experience is the friendships that were formed through that. That journey together through that was one of the community binding together.”
“I’ll never forget ever being a part of Toy Story two. I was very lucky,” says Jacob. “I had that chance to work on a level of impact that helped keep Buzz and Woody, and “Toy Story” and the franchise, and Pixar all be a thing we talk about today.”