- Result: Know when to stop
- (1) Popular kerning pairs
- (2) Potential kerning pairs
- (3) Frequently occurring kerning pairs
- (4) Relevant kerning pairs
- (5) Sample Text
- Personal Whitelisting
- Notes
Find a list of kerning pairs below, ordered by relevance. The relevance score is determined by frequency of occurrence in texts, popularity among type designers and the necessary kerning value. This repository is intended for type designers looking for sample texts for the most complete kerning necessary. This list will help you (or your neural network) assess which kerning pairs you should look at and where to stop. It takes into account the Unicode blocks from Basic Latin to Latin Extended A (plus all kinds of quotes), 24 languages so far and various typeface styles.
- 🍀✨ result/relevant_kerning_raw.json
- 🍀✨ result/relevant_kerning_with_padding.txt
- 🍀✨ result/relevant_kerning_clustered.txt
- 🍀✨ result/relevant_words.txt
The first 1000 to 2000 pairs should have you covered sufficiently in all languages, see (4) for details. Depending on your typeface’s style you don't have to kern each and everyone of them, of course.
In case you want to reproduce parts of this work: All scripts are self-explanatory and numbered according to the following article. Some parts may also be useful for other works, e.g. the large sample size of texts in many languages.
Let’s start by looking at kern tables of existing fonts to collect more or less common kerning pairs statistically. As a sample I was content with the about 2700 font files of the Google Fonts Archive. But you can apply this script to any font collection you might want to look at yourself. It uses parts of the Adobe Type Tools, see directory "kernDump". 72377 unique kerning pairs were used in these fonts. This means, for the set of 300 characters considered, dazzling 80 % of the possible 90000 combinations are (more or less) used.
This use count contains a lot of noise caused by subjective decisions and/or group based kerning, possibly. It helped a lot to cut off the long tail at some place. In the end, I removed all values lower than 20 percent of the maximum value. Which sounds a lot, but it leaves us with 11899 remaining most popular kerning pairs. In Script 3G you can decide for yourself where you would like to make this cut, if desired. Find the uncut list here. Screenshot. See docs/01-kern-dump.html
Another approach to determine possible kerning pairs is combinatorial. The form of each letter can be described schematically by simple numerical values. The values on the adjacent sides or two neighboring letters can be summed up to determine whether this is a possible kerning pair. Like everything that is considered here, this also depends on the design of the font. For a grotesque sans serif typeface—as a proof of concept—a resolution of 3 by 5 units was sufficient for each side (-1, 0 or 1 for ascender, x-height, half-x-height, baseline, descender). More complex (serif) typefaces might need a higher resolution. 20554 kerning pairs could be generated this way. Compared to the 11899 most "popular" kerning pairs, 11924 other possible ones could be added to the list. I found this approach to generate way less noise than the popular vote before. Excerpt. Find full set here: docs/00-letter-shapes.html
To bring down the runaway number of 23823 pairs to something more manageable let's count how frequent they occur in real-world texts. All samples were fetched from Wikipedia, mostly within their "Featured Article" category to assure a certain level of quality. In total, the letter pairs of about a 1 GB of text (respectively 2683 books) in 24 languages were counted so far. Unfortunately GitHub truncates directories to 1000 files, so I had to group the individual text files to work around that.
Language | Code | Total Count* | … in books | Letter Pairs* | Common Kerning Pairs | … in % |
---|---|---|---|---|---|---|
Czech | CS | 44338357 | ≈ 148 | 12288 | 4379 | ≈ 36 % |
Danish | DA | 22486111 | ≈ 75 | 8753 | 3164 | ≈ 36 % |
German | DE | 115862487 | ≈ 386 | 13209 | 4656 | ≈ 35 % |
English | EN | 146375753 | ≈ 488 | 11681 | 4171 | ≈ 36 % |
Spanish | ES | 57164706 | ≈ 191 | 10044 | 3587 | ≈ 36 % |
Estonian | ET | 6948357 | ≈ 23 | 6870 | 2501 | ≈ 36 % |
Finnish | FI | 23858390 | ≈ 80 | 8438 | 3070 | ≈ 36 % |
French | FR | 95981287 | ≈ 320 | 11468 | 4139 | ≈ 36 % |
Croatian | HR | 19644191 | ≈ 65 | 8505 | 3048 | ≈ 36 % |
Hungarian | HU | 62246898 | ≈ 207 | 12476 | 4382 | ≈ 35 % |
Italian | IT | 47654490 | ≈ 159 | 10126 | 3626 | ≈ 36 % |
Lithuanian | LT | 1963754 | ≈ 7 | 5852 | 2188 | ≈ 37 % |
Latvian | LV | 1397605 | ≈ 5 | 4658 | 1617 | ≈ 35 % |
Dutch | NL | 27789615 | ≈ 93 | 9099 | 3241 | ≈ 36 % |
Norwegian | NO | 32943445 | ≈ 110 | 10134 | 3656 | ≈ 36 % |
Polish | PL | 28486143 | ≈ 95 | 9202 | 3294 | ≈ 36 % |
Portuguese | PT | 20138072 | ≈ 67 | 7550 | 2758 | ≈ 37 % |
Romanian | RO | 10043888 | ≈ 33 | 7671 | 2834 | ≈ 37 % |
Sami | SE | 541248 | ≈ 2 | 4160 | 1571 | ≈ 38 % |
Slovak | SK | 6673205 | ≈ 22 | 7763 | 2812 | ≈ 36 % |
Slovenian | SL | 6978876 | ≈ 23 | 7156 | 2575 | ≈ 36 % |
Albanian | SQ | 8481442 | ≈ 28 | 7276 | 2710 | ≈ 37 % |
Swedish | SV | 10436272 | ≈ 35 | 7936 | 2973 | ≈ 37 % |
Turkish | TR | 6338028 | ≈ 21 | 6598 | 2320 | ≈ 35 % |
* All counts cleaned: Characters outside the mentioned Unicode blocks got filtered out.
The more text you examine, the more letter pairs you find. The total count is directly proportional to the number of letter pairs found. For each language the sorted counts form a hyperbola with a very long tail, see the exemplary chart below.
Screenshot: Final result. 3736 pairs only, tail already cut off (it gets up to 4 times as long) docs/04-relevant-count.html
The long tail of less frequent pairs gets the longer the more you count in total. All values are normalized to an average number per book to better compare the languages. One book stands for 100 pages of text of 3000 characters each. I decided to cut off all unusual pairs which occur less than once per book. You can move that line if desired.
More frequently occurring pairs of letters quickly find a more stable place in the head of this ranking. The accuracy of this count for the more common pairs increases less and less the more text you look at. Looking at the ("only") couple of thousand kerning pairs I end up with, it is not necessary to examine more texts.
All lists of kerning pairs by language get merged into a global total. This resulting list contains the high-score values to avoid discrimination of local peculiarities. For example, the potential kerning pair "ij" appears remarkably frequently in Dutch only and should be given appropriate attention, even if it appears negligibly rarely on the global average.
Uppercase variants of all letter pairs are also taken into account.
8558 of these kerning pairs occur at least once per book, 5298 at least once in 10 pages, 2542 at least once per page.
[…] Juxtaposition of kerning value (size of left bar), popularity (color intensity of left bar) and frequency of occurence in all languages (size of right bars). See full count here: docs/02-count-by-language.html
From the previous statistics I conclude that the relevance of a potential kerning pair should be judged by occurrence, popularity and kerning value itself.
All three variables get normalized before adding them to a relevance score, that I call Kern Score. This means that the values of each variable is scaled to the corresponding total average value of 1. And a Kern Score of 3 could mean the sum of just average occurrence + average popularity + average kerning value (1 + 1 + 1). It could also mean an extremely rare pair, which requires a very strong kerning and is considered by many typographers for some reason (0.00001 + 1.499999 + 1.5). Both are equally ranked in the charts.
Looking at the resulting chart by frequency values only, it is noticeable that about the first 1000 kerning pairs occur at least once per page; About the first 2000 once in 10 pages; And the total 3736 once in 100 pages. You decide where to stop and how much group based kerning you want to apply.
[…] Excerpt. After the upper 400 pairs occurence is not dominating the other summands and the little personal whitelist bonus gets noticeable. See full chart here: docs/03-kern-scores.html
The sample texts are composed with the most common words in the most respective languages for each kerning pair. I am aiming for reasonably compression here. These compositions should resemble real life examples of texts. They get more fuzzy/messy/noisy towards the end due to the more noisy "long tail" discussed before. This noise could be reduced best by manual editing, I think.
Useful by-products might be the lists of most common words in each language, for instance: count/by_language/en/words.json
From personal experience of setting cultural studies texts most often, I would like to pay more attention to the quotation marks. First all of their stylistic alternates should be taken into account. So I summed up all comparable pair counts and generated all possible variants. In the end I wasn't fully convinced by their frequency of occurrence in these sample texts from Wikipedia. Different kinds of texts – other than encyclopedic – would be necessary to cover more special cases. Instead, in order to keep this project manageable, I prefer a personal whitelist of kerning pairs, which I would like to see ranked higher. A bonus value 0.5 added to the score works just fine for my purposes. This does not affect the top of the ranking, as you can see in the charts. Feel free to edit this whitelist in script 4A.
Let's try to avoid over-engineering.
It is not necessary to kern all possible kerning pairs. A relatively small fraction of them gets the job done just fine. For ever smaller marginal gains the corresponding effort rises more and more unreasonably. This is called diminishing returns in economics.
Besides the many unknown kerning pairs added by foreign languages, I am surprised especially by pairs including the space character. Secondly uppercase letter pairs are more relevant, of course, followed by inter-punctuation ones.
Screenshot of docs/05-pie-charts.html
The Google Fonts kern dump collected 80 % of all possible letter pairs to be potentially kerned! This rather overwhelming number allows the question of whether spacing a font can be solved in a fundamentally different way. The current system is still based on movable type letterpress printing. Made out of metal or wood the sides of a glyph were made straight to keep them in line/register. In a digital format it might make more sense to describe the sides of a glyph as a more complex shape (by the designer) to allow for an (automatically generated) advanced spacing without any kerning necessary.
This project's approach combines more objective factors (occurrence, kerning value) and more subjective ones (popularity, personal whitelist). Popularity is measured statistically though and the whitelist bonus is rather small.
- Sorting out the essential must-haves/misfits for a less intimidating, beginner friendly version
- I might favor HTML tables over these canvas based charts to add more interactive features with DataTables.js ... not sure if it's more snappy on weaker machines though.