When you spend hours each day hunting for news reports and other public digital traces of protest events, you become acutely sensitive to the many ways in which the information you find may fail to tell the whole story, the accurate story, or even the story at all.
There are lots of reasons to care about these gaps in the record, but one that should concern scholars, data scientists, and journalists who try to learn things from protest event data is selection bias. If these bits of information were missing completely at random, we could consider our sample to be representative in spite of them and ignore the gaps when analyzing the data at scale. If, however, the gaps result from implicit or explicit filtering processes that allow certain types of information to seep through more often than others, then we have to worry about how these omissions could bias the inferences we draw.
So, what are some of those filtering processes that distort the picture we see of protest activity in the United States? Here is a non-exhaustive list of sources of selection bias that come up in the Crowd Counting Consortium’s work, and that we design our collection strategies to overcome or to mitigate as much as we can.
- If it bleeds, it leads. In local TV news, gruesome stories often get top billing. The broader principle here is that sensational events are more likely to draw audiences’ attention, ergo to draw journalists’ attention, ergo to get covered. Other things being equal, a group of people marching politely with signs is less interesting than a similar-sized group shouting at diners while they march, or blocking an intersection, or brandishing guns. That means we’re more likely to hear about the latter than the former, and that selection effect distorts our view—not just of the incidence of protest activity overall, but also of the prevalence of confrontational or disruptive behavior within it.
- Squirrel! Novelty draws attention. The other side of this coin is that familiar and routine things do not. In press coverage of protest activity and related conflict processes, this means that waves of activism often garner a lot of attention when they first emerge, but that attention tends to wane over time. So, other things being equal, events early in the wave are more likely to get reported (and thus encoded in datasets like ours) than later ones. With bursts of activism like the George Floyd uprising, this selection effect can make it harder to tell how much of the observed ebbing of activism represents a real decrease in the frequency of protest activity and how much is just the press (or their viewers and readers) getting bored and moving on to the next new thing.
- Copaganda. Some news outlets (hi, New York Post) adopt a pro-police editorial line in their reporting on protest activity, and I think most readers and viewers know it. What many readers and viewers may not know is that coverage of supposedly unruly protest activity in many other news outlets also tilts towards the local police departments’ understanding and description of it, in no small part because the police department is sometimes the main or even the only source of detailed information about protest events. Of course, everyone’s got an angle, including protest participants. What’s matters here, though, is that they aren’t usually the ones issuing press releases that news outlets read and sometimes regurgitate. (For an excellent longer and broader discussion of this type of bias in protest reporting, see this June 2020 essay by Kendra Pierre-Louis.)
- Paywalls. This is kind of boring to point out, but it’s not trivial: some news sources are paywalled (with good reason; journalism is not cheap); CCC operates on a shoestring, so we can’t afford subscriptions to every informative source; and we can’t encode what we can’t read. We try to minimize this problem with a blanketing strategy, searching as many national, local, and social-media sources as we can. Sometimes, though, an event only gets reported in one source, and that source is paywalled, so we can’t quite see it to encode it, or we can only capture some of the relevant information.
- The social media Red Queen’s Race. One way to mitigate the bias caused by the filtering processes described above is to scour social media and public aggregators for reports of events that news outlets didn’t cover, or for additional information and perspectives about events they did cover. For CCC, that means following open Twitter and Instagram accounts of scores of activist organizations, aggregators of information about local protest scenes, and many different independent journalists, whose reporting often provides some of the richest coverage of these events. This strategy helps quite a bit, but it also takes a lot of time to sustain, because the array of relevant sources and the platforms themselves are constantly evolving. The result is a Red Queen’s Race in which—on top of doing the actual reading and encoding—you can never stop working at identifying useful accounts and figuring out how to extract information from them. Partial automation of this process can greatly reduce the workload, but shifts in relevant social-media ecosystems will eventually cause data drift, implying that the pipeline itself should be maintained or even overhauled at frequent intervals. So, that work never really ends, either. We would love to experiment with developing open-source tools to track protest activity via these platforms, but the start-up costs are steep. So, for now, we do it the old-fashioned way and continue to identify and review by hand as many of these accounts as we can.