Skip to content

Group Call Summary 2017 02 23

dcwalk edited this page Mar 9, 2017 · 1 revision

From @ambergman on February 24, 2017 9:46

Wanted to summarize my thoughts here after a great conversation with @danielballan last night and hearing about the great work he and @Mr0grog are doing to coordinate their efforts. Apologies to @Mr0grog, @danielballan, and others if this issue frustrates work at all - happy to take it down and let you all lead:

The conversation about building EDGI's web monitoring software, and a diff database in particular, has been framed by @titaniumbones, myself, and others as a migration from using Versionista to using PageFreezer's snapshots. I wanted to suggest that that may have been a mistake and that, instead of a migration, we'd actually like to integrate our two sources and build a diff database that can store data coming from Versionista, PageFreezer, and any other credible source. This will be important in the short term, as we have snapshot history going further back and at a higher frequency with many of the pages we're watching with Versionista - so we don't want to lose that information after we start using PageFreezer's snapshots. As conversations with the Internet Archive progress, we'd definitely also like to make sure all of the material from the Wayback Machine can be read into our DB as well.

Because Versionista and PageFreezer output different data in different formats, reading in data from the two sources will require different interfaces. And so it's great that the two interfaces are being developed separately in different apps for now - see @Mr0grog's repo for the Versionista app here - and it's great that Dan and Rob are working together to determine how to combine their efforts. In both cases the html data taken in can be converted to a series of diffs, and those diffs can be stored in one big database - with one additional column to denote where the material used to produce the diff came from. Down the road, we can even decide to store diffs made from two html snapshots from two different sources - but I think we can save that for later, perhaps if we've loaded everything into one snapshot database at IA at some point.

So, in short, I think it would be great to think about how to integrate the Versionista and PageFreezer diff databases, not just migrate between them. I know I haven't been specific about interfaces at all here, so I'm sure this wasn't all that helpful in terms of considering what schema to actually use to integrate the two sources - but that's probably the topic of a series of other issues. Let me know what you all think.

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#29

Clone this wiki locally