Whatever your native language, you've probably noticed that city people speak it differently than do country folk. But so what? It's also true that Chicagoans speak a bit differently than do Baltimoreans, and the French of Marseilles is not that of Paris. When it comes to differences in accent, grammar and vocabulary, you might expect that region, culture, social class and gender would count for more than the size of your town. So the people of, say, Caracas, should sound more like their fellow Venezuelans than like people in Miami. But according to this paper, you would be wrong. "The Spanish language," its authors write, "is split into two superdialects"—a city dialect in which Caracas and Miami have a lot in common, versus a dialect of rural regions and small towns.
As novel as the finding is the method that Bruno Gonçalves and David Sánchez used to distinguish the dialects: They analyzed every tweet made in Spanish over two years for which geolocation data was also available (they don't say which years). Breaking down these 50 million tweets according to different words used for "computer," "car," and other key concepts revealed the boundaries of the two dialects.
The researchers used Spanish because it is widely spoken and widely spread across several continents. Spanish also has plenty of Twitter users (unlike Chinese) to supply evidence. And written Spanish is logical—the letters you see represent the sounds you'd hear. On the other hand, in English (as noted here) the same letter combo can represent five different sounds ("Though I cough through the day, this rough bough comforts me"). Conversely, different sounds can be rendered by the same letters ("Archer, I bow to your bow, and I will lead you to the mines of lead"). That sort of thing, which has incensed sensible people for centuries, messes up textual analysis.
The researchers divided up the Spanish-tweeting world into cells of approximately 25 square kilometers each, and noted in each cell the majority-endorsed words for 131 key things. That gave them a map distinguishing, for example, places where the word for "computer" was "computadora" from those where the word is "computador" or "ordenador." They then applied their algorithms to identify cells that are closely related to each other. In this way, they discovered "a profound correlation" between one widespread dialect and areas of high population density. In other words, one of their super dialects was spoken mostly in cities—even cities as widely scattered about the globe as Buenos Aires, San Diego and San Juan. The other cluster is spoken outside major urban centers. "This suggests a natural lexical bipartition of Spanish into two superdialects," they write. "Superdialect α is utilized by speakers in main American and Spanish cities and corresponds to an international variety with a strongly urban component while superdialect β is comprised mostly of rural areas and small towns."
Why cities? Because people who move to cities want to communicate with one another (and, I am guessing, want to sound as if they didn't just step off the boat from Nowheresville). For the sake of efficiency and identity, then, city-dwellers are inclined to drop the more idiosyncratic parts of their speech. They come to talk like their fellow city-dwellers, not Mom and Pop back home. "This leveling process," write Gonçalves and Sánchez, is present throughout the Spanish-speaking cities, where it "is reinforced by the rapid increase of worldwide social ties and the powerful influence of mass media precisely located in important metropolitan areas (Madrid, Mexico City, Miami)."
That Twitter can be used to find heretofore unrecognized dialects surprised me (who knew 140-character utterances could be so revealing?) but Gonçalves and Sánchez believe it's likely to be a rich Big-Data source of insights into language. In fact, they think, the abundance of tweets worldwide, combined with GPS data, could soon permit linguists to track language differences in real time, as they arise and evolve among different regions.
I was tempted to call their paper a "Big Data" approach to language analysis. But the term is almost a misnomer. They made a new finding not because their data was abundant but because it was different. Instead of having to go out and interview (often male, often rural) people to ask about their language use, the researchers had an immense river of language use ready and waiting for them. This is the new kind of data all of us are generating every day, in tweets, Facebook likes, YouTube clicks and so on. Where once we had to be asked about a topic, and think about our answers, we now reveal ourselves without thinking. This may not be great for our notions of personal autonomy, but it is going to be a great source of insight into human behavior for a long time to come.
Illustration: Geographical distribution of the dominant word for the concepts Computer (left) and Car (right), from the paper.
Follow me on Twitter: @davidberreby