Skip to content

Latest commit

 

History

History
145 lines (104 loc) · 5.55 KB

13a-wikipedia_other.asciidoc

File metadata and controls

145 lines (104 loc) · 5.55 KB

Wikipedia Metadata

Wikipedia Pageview Stats (importing TSV)

This dataset is as easy as it gets. Well, that is, until you hit the brick wall of having to work around strings with broken encodings.

It is a .tsv file with columns for lang_and_project, page_id, request_count, transferred_bytesize. Since it’s a tsv, parsing is as easy as defining the model and calling from_tuple:

class Wikipedia::RawPageview < Wikipedia::Base
  field :lang_and_project,     String
  field :id,                   String
  field :request_count,        Integer
  field :transferred_bytesize, Integer
end
mapper do
  input > from_tsv >
    ->(vals   ){ Wikipedia::RawPageview.from_tuple(vals) } >
    to_json > output
end

We’re going to make the following changes:

  • split the lang_and_project attribute into in_language and wp_project. They’re different properties, and there’s no good reason to leave them combined.

  • add the numeric id of the article

  • add the numeric id of the redirect-resolved article: this will make it easy to group page views under their topic

Assembling the namespace join table

  • Take the pages metadata table,

    • get just the distinct pairs

    • verify

  • 101 is "Book" — at least in English it is; in XX wikipedia it’s XX, and in XX it’s XX

    • pull in header from top of XML file

    • FIXME: is there a simpler script

Getting file metadata in a Wukong (or any Hadoop streaming) Script

TODO:

Translation

The translation is light, and the original model is of little interest, so we can just put the translation code into the raw model. First, we’ll add the in_language and wp_project properties as virtual accessors:

class Wikipedia::RawPageview < Wikipedia::Base
  # ... (cont) ...

  def in_language() lang_and_project.split('.')[0]       ; end
  def wp_project() lang_and_project.split('.')[1] || 'a' ; end

  def to_wikipedia_pageview
    Wikipedia::WikipediaPageview.receive(
      id: id, request_count: request_count, transferred_bytesize: transferred_bytesize,
      in_language: in_language, wp_project: wp_project)
  end
end

mapper do
  input > from_tsv >
    ->(vals   ){ Wikipedia::RawPageview.from_tuple(vals) } >
    ->(raw_rec){ raw_rec.to_wikipedia_pageview }           >
    to_json > output
end

Wikipedia Article Metadata (importing a SQL Dump)

link:code/munging/wikipedia/wp_pagelinks_extract.rb[role=include]
link:code/munging/wikipedia/wp_pages_extract.rb[role=include]

Necessary Bullcrap #76: Bad encoding

Encoding errors (TODO: LC_ALL).

Scrub illegal urf-8 contenf from contents.

link:code/munging/wikipedia/char_filter.rb[role=include]

Wikipedia Page Graph

The wikipedia raw dumps have a pagelinks

INSERT INTO `pagelinks` VALUES (11049,0,'Rugby_union'),(11049,0,'Russia'),(11049,0,'Scottish_Football_Association'),(11049,0,'Sepp_Blatter'),(11049,0,'Simon_Hill'),...

The Wikipedia datasets a bug that is unfortunately common and appallingly difficult to remediate: the

Perhaps, people from the future, Wikipedia will have a cut: stdin: Illegal byte sequence -e:1:in <main>: invalid byte sequence in UTF-8 (ArgumentError)

[source, ruby]
class Wikipedia::Pagelink < Wikipedia::Base
  field     :name, String
end
  • SQL parser to tsv

    • no need to assemble source domain model (yet) so don’t

    • emits from_page_id into_namespace into_title

  • pig script to tack on page name for the from, page id for the into

    • emits from_page_id into_page_id from_namespace from_title into_namespace into_title

Target Domain Models

First step is to give some thought to the target domain model. There’s a clear match to the Schema.org Article type, itself a subclass of CreativeWork, so we’ll use the property names and descriptions:

class Wikipedia::WpArticle
  field      :id,                     String,    doc: "Unique identifier for the article; it forms the tail-end of the traditional URL"
  field      :wp_page_id,             Integer,   doc: "Numeric serial ID for the page (as opposed to the article's topic)"
  field      :name,                   String,    doc: "Topic name (human-readable)"
  field      :description,            String,    doc: "Short abstract of the content"
  field      :keywords,    Array, of: String,    doc: "List of freeform tags for the topic"
  field      :article_body,           String     doc: "Contents of the article"
  field      :coordinates, Geo::Coordinates,     doc: "Primary location for the topic"
  field      :content_location, Geo::Place,      doc: "The location of the topic"
  field      :in_language,            String,    doc: "Language identifier"
  collection :same_as_ids,            String,    doc: "Articles that redirect to this one"
  collection :wp_links,               Hyperlink, doc: "Links to Wikipedia Articles from this article"
  collection :external_links,         Hyperlink, doc: "Links to external sites from this article"
  field      :wp_project,             String,    doc: "Wikipedia project identifier; main wikipedia a, wikibooks b, wiktionary d, wikimedia m, wikipedia mobile mw, wikinews n, wikiquote q, wikisource s, wikiversity v, mediawiki w"
  field      :extended_properties,    Hash,      doc: "Interesting properties for this topic, extracted by DBpedia. For example, the topic 'Abraham Lincoln' has properties vice_president:         \"Andrew_Johnson\", spouse: \"Mary_Todd_Lincoln\" and so forth."
end