In a simple, residential neighborhood of San Francisco, a former church for Christian Scientists has been transformed into the library of the future. Behold the Internet Archive, the world’s only repository of its 279 billion web pages—and many other delights.
By TIM REDMOND
Photography by JESSICA BRANDI LIFLAND
“How big is the Internet? How much does it weigh?”
Brewster Kahle is talking quickly, moving his hands around and typing on a keyboard while I watch a big screen in his office. He is giddy, smiling like a boy who knows he has one of the coolest toys on the planet. He pulls up a photo from 2009—it’s a basic shipping container, and you would have no idea from the outside that it housed, for a moment, perhaps the most important experiment in the preservation of human knowledge since Egypt’s Royal Library of Alexandria in the Third Century, B.C.
The average web site exists for no more than 100 days before it changes or shuts down completely. We are,” Kahle says, “about preserving universal access to all human knowledge.”
“See?” he says. “It’s eight feet by eight feet by 20 feet, and weighs 26,000 pounds. That was the whole Internet,” he says. “It took up three petabytes.” Before googling that figure, imagine the amount of information in just one petabyte: The smallest measure of digital data is a bit, which is either a one or a zero; eight bits make a byte (the amount of data needed, roughly, to make a single typed character). A thousand bytes make a kilobyte (a few paragraphs of this story); a thousand kilobytes make a megabyte, which amounts to about 850 pages or text, or four good-sized books. A thousand megabytes makes a terabyte, which is about 4.5 million books. Multiply that by another thousand and you get one petabyte. That’s almost 5 billion books – or some 600 million web pages.
The Internet, of course, has gotten way bigger since then. It’s now about 60 petabytes, holding 279 billion web pages, which the Archive stores along with 11 million texts; three million movies and videos; and four million audio items, among them 160,000 live concerts.
All of that information is now stored—every bit, every byte, every web site, every scrap of public data in this massive human experiment in accumulation and communication—in a rack of servers that line the back of an old Christian Science church in a leafy, residential corner of San Francisco. It’s all here, Kahle says (with backup copies in Richmond, California; Amsterdam, and yes, even Alexandria, Egypt) for one reason: The Internet is not built to last.
Strangely, the average web site exists for no more than 100 days before it changes or shuts down completely. This means that vast amounts of culture, science, and history can therefore vanish before many people have even see them. That sounds odd, or irrelevant, in these days of “the cloud.” But the Internet is not just a bunch of digits floating around in the air; it is data that is recorded on physical servers—data that can get lost, damaged, or just disappear.
I understand this from personal experience. The best essay I every wrote on technology and business churn has vanished. I submitted it in 1996, just as the Archive was getting started, to an online magazine called Rewired. It crystallized a lot of my thinking about the need for stability in civilization. Rewired posted it, lots of people read it—but then the site shut down, and all of Rewired’s data was lost. I wrote the story on an old PC, saved it to a hard drive that soon crashed, and backed it up as I was supposed to—on a 5-1/4” floppy disc, which can no longer be read by any machine now on the marketplace. As it happens, I made a first-draft printout that I have somehow saved. At this point, it seems, paper is the only record that lasts.
Kahle is looking to change that. He is building the library of the future, a place where the electronic data bits that increasingly define our cultural information are housed along with digital versions of the past. We are,” Kahle says, “about preserving universal access to all human knowledge.”
I remember reading newspaper stories about the project in the mid-1990s, and a lot of them went like this: A guy named Kahle wants to download the entire Internet. How is that possible? It’s like, a terabyte of data! “I used to carry it around in a box of tapes,” Kahle remembers.
In 1980, when Steve Jobs was still trying to figure out how to sell personal computers and the first mass-market modem was still a year away, Kahle was sitting around with his friends at MIT and talking about the value of technology. He was already a pioneer, a computer prodigy, looking at a world that few others could imagine. And a friend asked him the question that’s directed his life ever since:
What will all this digital technology do to make the world a better place?
“I thought about it, and I had two answers,” says Kahle, who at 56 still has the sparkle in his eye of an obsessive genius. “The first was to protect privacy. The second was to put the vision of the old Library of Alexandria, or today’s Library of Congress, all the world’s knowledge, on your desktop.”
It turned out that privacy was rather difficult—“it was expensive, and the only ones who could afford it were the mob, big corporations, and the government, and they didn’t need my help.” So he moved on to preserving knowledge. “I was pretty sure someone else would do it,” he says. By the mid-1980s, nobody was stepping up.
After inventing a scanning system for Dow Jones in 1989, the dawn of the Internet era, Kahle invented Wide Area Information Server (WAIS). The system, which long served as an industry standard, was the first online publishing and search system, eventually purchased by America Online in 1995. “They had something we didn’t have,” Kahle says with a laugh. “They had a business plan.”
With some cash in his pocket (he and his co-founder wound up with $15 million), he could now move onto his dream. In 1996, he developed an early search tool called Alexa Internet and launched the Internet Archive.
I remember reading newspaper stories about the idea, and a lot of them went like this: A guy named Kahle wants to download the entire Internet. How is that possible? It’s like, a terabyte of data! Yes, it was a terabyte—the size of today’s typical desktop back-up drive. “I used to carry it around in a box of tapes,” Kahle remembers. These days, lights blink off and on in six stacks of servers that line the back of the church, each stack measuring two feet by 6.5 feet. The collection inside these boxes is huge, and growing at 12 terabytes a day. “It’s good that they keep making bigger hard drives,” Kahle says.
A tour of the Internet Archive starts on the main floor, just inside the enormous Doric columns that frame the entrance to the building. What you see first isn’t high-tech at all—in fact, the piece of technology that Kahle is excitedly demonstrating runs with no electricity at all.
“How many of you have ever played a 78-rpm record?” he asks. The tour group this week includes a lot of college students; most of them have no idea what he’s talking about. But Kahle is really into 78s these days. They have a lot of data on them—rare music from a long-ago era—and he wants to save it all.
“I listened to a 78 that had all of the bugle calls with different meanings,” he says. “One of them means ‘abandon ship.’ It’s scary to think that happens often enough that they need a special bugle call for it.”
In a bow to an old cartoon show and the movie, “Back to the Future,” Kahle created a Wayback Machine—an online tool that captures every website made since 1996 in the form in which it was created. In other words, an Internet time capsule.
Kahle drops a thick vinyl disc onto one of the world’s first record players—a Victrola, which comprises a turntable below a giant speaker. He asks a visitor to turn the crank on the side, then puts down a metal arm with a large needle. Out comes scratchy dance music. “Amazing how loud it comes out with no electric power,” he says. “There were about 3 million of these 78s produced,” he says. “We’re going to be able to save about 400,000.”
A lot of what the Archive is doing these days goes beyond saving web sites. The first time I came by the Archive, my tour guide noticed my Grateful Dead t-shirt and asked me to name my favorite show ever. Hartford, Connecticut, May 1980, I said—spring of my senior year in college. After a quick search I was listening to the same 11-minute version of Uncle John’s Band that I barely remembered after 37 years. The same goes for every Grateful Dead show ever performed. The Dead, unlike a lot of musicians of their time, never tried to prevent their fans from recording shows. The collectors of these tapes—some nearly concert quality, some sounding like they were captured by an ancient $20 tape recorders stuck inside a backpack—have donated every one of them to the Archive.
Old educational films, dating as far back as the 1960s, are part of the archive too. So is TV. The facility houses more than 1.3 million television news shows produced since 2009. (The shows are all searchable online via excerpted clips; since this material is copyrighted, however, you have to watch these the way you did in the old days—on DVDs in the library.) There are also 166,000 versions of old games: Through an emulator, you can play almost any of the old coin-operated video arcade games from the 1970s through the 1990s.
To preserve all this information properly, Kahle has doffed his hat to the old “Rocky and Bullwinkle” cartoon shows of the 1950s, which came up with an invention that was deliciously reprised by Christopher Lloyd in “Back to the Future.” Kahle has created a real, live Wayback Machine—an online tool that captures every website made since 1996 in the form in which it was created. In other words, an Internet time capsule.
The Archive is also redefining the concept of a library. For thousands of years, libraries have been fundamentally physical—stout buildings, typically in big cities and at universities, where reams of paper were stored and read. Early libraries were limited to scholars, but by the 1700s, the concept of the public library, open to all, started to emerge.
In 1800, the U.S. Library of Congress was established, and over the next century public libraries were built in cities and towns all over the country. Today the U.S. boasts more than 119,000 public libraries. Most still operate on the same model as they always did: They acquire and store physical things. This makes them limited by the size and space of the physical world. A few years ago, when I became the owner of the entire paper archive of the San Francisco Bay Guardian, the S.F. Public Library declined to take copies because they would fill too much space.
The Internet Archive is unlimited (and, incidentally, will soon host scanned copies of all 48 years of Bay Guardian back issues). It is also, potentially, indestructible: British troops burned the Library of Congress in 1812, but with enough backups in enough places, the data in the Internet Archive can survive almost anything.
And anyone, anywhere, can access it, without so much as walking down the street (much less travelling to a distant city) The result is that information that could be lost forever (think: Environmental Protection Agency scientific records that the Trump Administration would love to see disappear) can be put in a place where the whole universe can view it, and there’s nothing that Mr. Trump can do to stop it.
The basement of the Internet Archive looks like a typical tech office: couches, monitors, little workspace nooks. Two flights up, it’s a different world. Here, the Archive team decided to preserve what was once the sanctuary in a church that, ironically, was founded on the belief that the material world is an illusion and that sickness is just a mental, not a physical, condition. The church’s cavernous main room is still filled with enough pews to hold at least 1,000 worshippers. The servers in the back hold most of today’s Alexandria Library. It’s amazing how little space this takes up, even these days. The entire Internet, it appears, now fits into one corner of a church.
The rest is open space—room for lectures, discussions, or listening to the Archive’s massive collection of music through an amazing sound system. “Let’s dance it in,” Kahle says as we walk around looking at the artwork. Bruce Springsteen’s “Blinded by the Light” surrounds us as he shows us the artistic side of the upper room.
In 2009, when Kahle was traveling through China, he saw the famous exhibit of the Terra Cotta Army—the collection of more than 10,000 small statuettes depicting soldiers, chariots, horses, acrobats, and other members of the Chinese Army, circa 210-209 BCE. Kahle came back transfixed. “I told my wife I had a fantastic idea,” he says. “We should do terra cotta statues of everyone who’s ever worked here.”
It’s the librarian speaking again: Tech companies are full of workers who come and go in a culture where company loyalty is as ephemeral as the web itself. But the employees of the archive are here for the operation’s public-interest work; there are no stock options and instant riches in preserving the intellectual labors of humanity. “They deserve some kind of permanent recognition,” Kahle says.
Today (thanks to the work of artist Nuala Creed), the sanctuary’s pews are lined by 120 figurines, each about three feet tall, of anyone who has worked at the Archive for more than two years. It is, Kahle says, “either really cool or really creepy.”
At one point, Kahle asks himself the question that’s at the heart of his venture. “What really happened to the Library of Alexandria?” he says. “Well, the idea of access to information started to be seen as elitist, and it was driven out of existence.” Some modern version of a similar mistake is obviously Kahle’s biggest fear. “You have bit rot,” he says—if you don’t move things around every three or four years, the data tends to decay.
But that’s just technology; the real challenge is greater. It’s not about better hard drives and whatever new data storage systems the tech world develops in the future. “It’s about access,” Kahle says. “If you keep it circulating, people will keep it alive.”
Visit The Internet Archive’s website. If you’re looking for old websites, start with the Archive’s Wayback Machine, which typically delivers multiple capture dates of any website, complete with a graph showing peak activity days. Just be forewarned: the Archive’s navigation systems and user-friendliness are still somewhat under construction. So if some persistence and creative searching doesn’t deliver what you’re looking for, the Archive welcomes inquiries at: firstname.lastname@example.org