Skip to content

Contributing CharacterSubstituters

Florian Hanke edited this page Apr 14, 2011 · 5 revisions

Glad you’d like to add a character substituter to Picky!

What is it?

A character substituter is what you use to normalize single characters. For example, you’d like your indexer or query to convert umlauts into the non-umlaut version:
ü -> ue (Yes, a single character can be normalized into multiple ones)
This is already built in, and you can use it as follows:

  # For indexing:
  indexing substitutes_characters_with: CharacterSubstituters::WestEuropean.new

  # For querying:
  searching substitutes_characters_with: CharacterSubstituters::WestEuropean.new

However, the west european character substitution just changes ö into oe, and ç into c, and similar. For example, there is no conversion defined for polish or russian characters.

So you might want to do your own. How to do it? It’s easy.

How to do it

  1. Check the available character substituters if the substituter already exists.
  2. If yes, use (and improve) it :)
  3. If not, fork the repository and follow the instructions below.
  4. Add a new load statement to lib/picky/loader.rb

Every character substituter should implement the substitute(text) method. This method is called by the indexer and/or query.

  • substitute(text) # Substitute characters in the text and return a new text.

Example

This is how the west european character substituter implements the substitute method (at the time of this writing). See also the spec to see what it does.

  def substitute text
    trans = @chars.new(text).normalize(:kd)

    # substitute special cases
    #
    trans.gsub!('ß', 'ss')

    # substitute umlauts (of A,O,U,a,o,u)
    #
    trans.gsub!(/([AOUaou])\314\210/u, '\1e')

    # get rid of ecutes, graves and …
    #
    trans.unpack('U*').select { |cp|
      cp < 0x0300 || cp > 0x035F
    }.pack('U*')
  end