What Really Happened With the DNC’s “Datagate”?

The definitive explanation of the Democratic National Committee's "Datagate" scandal and what the mainstream media got wrong.

Last Wednesday morning around 10:40 AM, NGP VAN, the company whose software hosts the Democratic National Committee’s voter file, released a routine software update. The update introduced a bug that allowed members of Hillary Clinton and Bernie Sanders’s presidential campaigns, among others, to filter the voter records they share using “scores” they do not share (about which more shortly).

For the next hour or so, members of Sanders’ staff ran twenty-five searches using scores generated by the Clinton campaign; their intentions in doing so are now the subject of heated dispute. By noon, NGP VAN staff were aware of the issue and had taken steps to fix it.

By Friday, the DNC — which brokers access to VAN/VoteBuilder and mediates disputes between its users — went public with the story and ordered NGP VAN to deny the Sanders campaign access. Clinton campaign manager Robby Mook accused the Sanders campaign of deliberately stealing data to gain a competitive edge. The Sanders campaign fired its data director, Josh Uretsky, and Uretsky, taking full responsibility for the actions of his subordinates, insisted they had only intended to document the problem. Hoping for an injunction to regain access, the Sanders camp sued the DNC in federal court.

By Friday evening, the DNC had given the go-ahead for access to be restored, and by Saturday morning it had been. As a practical matter, the story ends here. But recriminations continue.

I was a software developer for NGP VAN from the summer of 2011 through the spring of 2015, having first heard of the company as a volunteer in the 2008 Obama primary campaign. While at NGP VAN, I contributed to a system through which the scores at the center of this dispute can be loaded into the voter file. The Voter Activation Network (“the VAN”), which the DNC brands as “VoteBuilder,” is also employed by foreign political parties, the AFL-CIO, NGOs, and — surprisingly, for a brief period in 2013 — Uber.

Amusing as it’s been to find that some of the technical minutiae of my old job has become a hot topic of conversation, coverage of the story so far has been tendentious and often plainly inaccurate. The Clinton campaign has exploited the obscurity of the software and the institutional context in which it’s used to grossly mischaracterize the actions of Sanders staffers. That said, Uretsky’s statements of intent do not accord with the logs released by NGP VAN.

To assess the plausibility of the competing narratives on offer — from the campaigns and Uretsky himself — a bit of background on the mechanics of a contemporary Democratic voter outreach operation is needed.

What Is the VAN?

As early as the 1960s Democrats began systematically assessing which precincts should be allocated campaign resources using statistics aggregated over fairly wide geographic areas. By the 1990s, the precinct was being supplanted by the individual voter as the unit of analysis, just as wall maps and clipboards were giving way to web applications and Palm Pilots.

The Help America Vote Act of 2002, which imposed standard formatting on voter registration information collected by the states, paved the way for party-maintained, nationally comprehensive registries of voters. By the 2006 midterm elections, Sasha Issenberg writes in The Victory Lab, Voter Activation Network’s eponymous system had “emerged from a pack of state-specific interfaces to become the national standard for voter contact on the left.” The VAN was a cornerstone of Obama’s unprecedentedly large and sophisticated Get Out The Vote (GOTV) efforts in 2008, ensuring its centrality to party infrastructure for years to come.

VoteBuilder is comparable to a shared Google spreadsheet, each row of which represents one of the tens of millions of persons currently registered to vote. The DNC is responsible for compiling this list from voter registration data available to the public from state governments, which they clean up in various ways (e.g., removing duplicate records and the records of the deceased).

The ideal upshot of this process is a unique “VANID” for each registered voter in the country that can be used to track the person from election to election, potentially from state to state, and determine whether and how to attempt to persuade the voter to support a particular candidate. At a minimum, each record will include the voter’s name and address. In some states, party affiliation, voting history (whether, but not how, one voted), gender, and phone number are also available. This is the core data to which all Democratic campaigns have access; think of them as the columns in the spreadsheet everyone in the party can see.

It’s worth noting that NGP VAN’s relation to this data is exactly analogous to Google’s relation to your data when you paste it into Google Docs. The VAN provides a bespoke means of accessing and manipulating the voter file, but the DNC retains all intellectual property rights.

Particular campaigns can, so to speak, add columns to the spreadsheet that only their staff can see and act upon. For example, when a volunteer contacts a voter, she’ll typically rate the voter’s level of enthusiasm for the candidate on a numeric scale. It is not unusual to see a half-dozen such ratings associated with the record of a voter in an important district.

Additional columns are often derived from data on consumer habits (e.g., magazine subscriptions) available from data brokers or polling done at the behest of the campaign. Large campaigns may have an in-house team to enrich their voter files, but most of this work is conducted by outside firms specializing in political marketing analytics like Catalist or TargetSmart.

The considerable cost of compiling and maintaining this information is justified by the promise of efficient employment of the campaign’s volunteer resources in the pursuit of vote totals. When field staffers in a regional office are confronted with fifty volunteers three months before a primary, they turn to the VAN to generate marching orders. Who should get a phone call? Who should get a visit from a volunteer?

As Election Day nears, the pace at which such decisions are made quickens, with the available data and prevailing strategic thinking (e.g., “ought we target women more aggressively?”) changing all the while.

What Are Scores?

Which brings us to scores, the central term of art in this story and the columns of the shared spreadsheet most often hidden from others. A score is an estimate of the probability that a voter possesses some attribute. The two essential scores are those for support — how likely the voter is to be supportive of the candidate — and “turnout” — how likely the person is to vote (the product of these terms being the likelihood that they turn out and vote for the campaign’s candidate).

Beyond these, scores are computed for a dizzying array of voter attributes salient to campaigns: Democratic Party Support, Likely College Graduate, Frequency of Church Attendance, Likely Gun Owner, Source of TV (Likely Cable, Likely Satellite, Likely Broadcast), Spanish Language Preference, Fiscal Progressive, Choice Support, Immigration Progressive, Climate Change Priority, Progressive Activist Score, Unmarried Score, etc. (these examples are drawn from (Clarity Campaigns and a VAN training manual). A former VAN engineer recalled scores assessing likelihood of dog ownership.

The stock-and-trade of campaign field staff is finding ways to narrow the universe of voters. Here is how a Democratic Party of Wisconsin training manual describes the process:

Creating a universe for a national or statewide campaign is a complex process that campaigns hire teams of experts to create — think the data team from the Obama campaign in 2012 — but spending time deciding on a universe for your campaign is an important planning step. You’ll want to make sure that you think through why you contact certain voters and certain times. While there is no generalizable universe that will work for all campaigns, a good place for a Democrat running in a non-partisan race to start would be with all voters identified as Strong Democrats, Leaning Democrats, Undecided, or Leaning Republican for their Likely Party. Adding in voters who are listed as “Unknown” or “No Data” can help expand beyond the current information in VoteBuilder, but will inherently mean that you risk speaking to more strong opponents than otherwise.

Let’s say you’ve been tasked with sending canvassers to find young people in Columbus, OH who might themselves volunteer to canvass. The volunteers you have on hand are in the Iuka Ravine neighborhood, so you begin by restricting your query to the surrounding area. The VAN informs you your query resulted in a list of several thousand doors you could knock. That being too many, you begin narrowing down your query: residents of the area surrounding Iuka Ravine, younger than thirty, who have at least 90 percent chance of being Democratic voters, an 8 percent chance of being progressive activists, and have not yet been visited by a volunteer.

If the count is still too high for your liking, you might lower the age or raise a score threshold. Below are screenshots, taken from a publicly available training manual, of the interface used to do this.

"Create a New List"ScoresAge

What staff are doing here is no more remarkable than selecting rows in a large spreadsheet by the values in a few columns. As a walk-on volunteer for the Obama campaign in Northern Virginia and Ohio during the 2008 primary race, I spent a great deal of time running such queries and segmenting them geographically to generate lists of addresses to hand to canvassers (a process known as “cutting turf”). Logged in using the credentials of a low-level staffer who’d taken a semester off college to work on the campaign, I could filter using a dozen of the campaign’s proprietary scores.

In VAN, there are two ways one can “save” the list produced by such a query. The first stores a representation of the query itself, the second a list of VANIDs. Using the former method, the set of voters included in the list may vary as, for example, scores are updated. Whichever of these Sanders staff used, no VANIDs, much less complete voter records, would leave NGP VAN servers. They would, however, have the number of voters who satisfied their query parameters.

What Happened on Wednesday Morning?

NGP VAN CEO Stu Trevelyan described the incident this way:

On Wednesday morning, there was a release of VAN code. Unfortunately, it contained a bug. For a brief window, the voter data that is always searchable across campaigns in VoteBuilder included client scores it should not have, on a specific part of the VAN system. So for voters that a user already had access to, that user was able to search by and view (but not export or save or act on) some attributes that came from another campaign.

In other words: after the update that introduced the bug, someone in the Sanders campaign noticed that Clinton scores, including support scores, were available in the query builder described and depicted above. Uretsky and his staff then filtered lists of voters that all Democratic campaigns share using these scores.

The access logs the VAN automatically generates have not found their way to into the press, and this isn’t surprising, as they would be near-unintelligible to all but a few NGP VAN engineers. What has been released is a summary of the logs, presumably written by just such an employee.

They tell us that Uretsky and his staff created a series of lists with names like “HFA Support <30”, which we can assume is a list of voters whose Clinton support score is below 30 percent. The searches are semi-systematic: in one series of queries for voters in South Carolina, they first include supporters who meet a very lax cutoff (60–100 percent), then a marginally more stringent one (50–100 percent), then those least likely to support (>30 percent), and finally those who could go either way (30–70 percent).

In addition to the standard turnout and support scores, they employed variants titled “Primary Prioritization” and “Combined Persuasion.” They only looked at states with primaries before or on Super Tuesday (Alabama, Texas, New Hampshire, South Carolina, Iowa, Colorado, Arkansas, Virginia, Texas, and Tennessee) or shortly afterward (Utah and Florida).

In an interview with MSNBC’s Steve Kornacki, Uretsky justified the actions of Sanders staff this way:

Kornacki: “What we’re able to see from these documents is that people from your campaign — for over forty minutes – were able to access . . . were able to look at, search, and make copies of . . . Clinton supporter lists, from her side of the wall. What is the justification for doing that?”

Uretsky: “. . . So I guess what I want to say is that we knew that what we were doing was trackable and we wanted to create a clear record of the problem before reporting it so we could make sure we weren’t crying wolf . . . so that the extent of the exposure of our data, to the other political entities . . . We had to assume that our data was equally exposed and updated reports prove . . . show that it was. We wanted to document and understand the scope of the problem so that we could report it accurately.”

Similarly, Uretsky told CNN:

In retrospect, I got a little panicky because our data was totally exposed, too. We had to have an assessment, and understand of how broad the exposure was and I had to document it so that I could try to calm down and think about what actually happened so that I could figure out how to protect our stuff.

That he was concerned the problem was symmetrical — that Clinton’s staff could see their scores if they could see Clinton’s — is believable enough. It beggars belief to claim that twenty-five searches were conducted in states Sanders field staff would be most anxious about their opponent’s prospects in, with somewhat systematically varied thresholds, all to file an accurate bug report with NGP VAN. The expected documentation under these circumstances would be a screenshot of the score filtering interface showing Clinton scores.

A reference to a list or two created using the scores might be helpful, but once it has been established that one is in fact able to filter, additional lists are not in any way useful to the engineer debugging. If the thinking was that the score access bug had only affected certain states, surely after the third or fourth attempt it would be clear this was not the case.

Later in the interview, Uretsky took issue with the characterization of his actions as “making copies”:

Kornacki: You were making copies of her voter list, weren’t you?

Uretsky: I guess you could phrase it that way, but we never . . . but those systems were all within the VoteBuilder/VAN system . . . the Voterbuilder/DNC system . . . it was all within their custodianship. If that makes any sense. So we didn’t . . . at least to my knowledge, we did not export any records of voter file data that were based on those scores. So yes we did establish proof there was a problem so that A: we understood what that problem was and B: we could accurately report that up the chain.

Uretsky was entirely justified in pushing back on this. Running queries to build lists is not necessarily a step toward exporting them.

But the focus on the possibility of an export is misplaced. NGP VAN’s official statement makes clear that he did not have sufficient permissions to do so. But even if he had, the scores are only useful for filtering VAN entries, so there is no reason to export them. It would not be feasible for the Sanders campaign to write its own software to make use of such an export in time for the primaries. It was the VAN or nothing.

It would have been useful to have lists within the VAN of, for example, those voters who were fervent Hillary supporters, such that they could be excluded from future campaign interventions. And Uretsky and his staff created several such lists.

But as we’ve seen, the VAN automatically logs information about queries executed, such that any lists generated with the Clinton scores could have been identified as such. Uretsky claimed to know as much in the MSNBC interview, and to judge by his LinkedIn profile, this must be true — he has been working as a VAN administrator on and off since at least 2008.

Clinton Spin

From the beginning, the Clinton camp has portrayed Wednesday’s events as a heist. Here’s spokesman Brian Fallon on CNN this past Friday:

“I’d be happy to send a copy to [Sanders spokesman] Jeff Weaver . . . of the audit reports that show multiple attempts, twenty-four in fact, by four different employees . . . by the Sanders campaign to steal data from the Clinton campaign on Wednesday of this week. It was an egregious breach, it was in violation of the rules.”

Presumably reading from the same memo, Robby Mook told reporters:

This was not an inadvertent glimpse into our data, it was not as the Sanders campaign has described it, as a mistake . . . They have, in fact, stored the data that they found on the file, data that belonged to us . . . The Sanders campaign has tried to downplay what this means, so I want to be very, very clear: This was data that took millions of dollars and hundreds of thousands of volunteer hours to build.

According to Politico, he went on to say the Sanders campaign had access to the “fundamental keys of [the Clinton] campaign,” its “strategic road map.” David Plouffe, once of Obama for America and now president of policy and strategy at Uber, likened the queries to a transgression of bourgeois norms of fair play in commercial competition: “Think if one company accessed and stole another’s customer data. This is no small thing.”

To hear the Clinton camp tell it, something analogous to discovering an open filing cabinet and frantically photocopying its contents took place. As the foregoing explanation should make clear, this is a false equivalence.

The repeated calls for the Sanders campaign to prove, in the words of DNC CEO Amy Dacey, that it is no longer “in possession” of the “data that was inappropriately accessed” compounded confusion. Only NGP VAN administrators could ensure this, first by fixing the bug, thereby preventing any future filtering from occurring, and then by deleting those lists that were the product of filtering in the interim.

What Was Uretsky Thinking?

To summarize: the Sanders campaign had no feasible way to exploit the bug to more effectively employ campaign volunteers by, for example, excluding strong Hillary supporters from their canvassing walk lists after the bug was fixed. Uretsky likely knew this, and would have expected the bug to be quickly remedied. And finally, the logs show searches that, in quantity and character, are not consistent with an effort to document the problem.

So what was Uretsky thinking? He most likely ran the queries to sound out the Clinton campaign’s estimation of its chances in upcoming races. With every query they ran, Sanders staff learned something about what the Clinton campaign thought the distribution of support among voters in early primary states looked like.

Their query for voters with support scores less than 30 percent in South Carolina might tell them how many voters the campaign had given up on, just as one for voters with a score above 70 percent might tell them roughly how many votes they felt were assured. After running queries in New Hampshire, they navigated to a crosstab view (pictured below) repeatedly to see a detailed breakdown of the results by demographic variables like age and ethnicity.

crosstab

Perhaps Uretsky really did begin searching with the intent of assessing the scope of the breach and his curiosity simply got the better of him. In doing this, he touched a very sensitive nerve indeed. But his actions were less comparable to copying sheets from the opposing team’s playbook than opportunistic eavesdropping on its pre-game chatter.