Menu

SVG and Typography: Characters

May 12, 2004

Fabio Arciniegas A.

In the second part of our discussion of SVG and typography we explore some time-honored practices of typographic excellence; as we go along, each “type issue” will lead to the discussion of relevant technical aspects of SVG. The typography issues covered are listed below. Beside each one of them is the associated technical SVG issue discussed:

  • Quotes, Hyphens, Ellipses (Character References)
  • Fonts (Embedding SVG Fonts, Creating SVG fonts from True Type)
  • Non-Latin Scripts (Proper fonts vs. cheating fonts, encodings, bidirectionality)
  • Ligatures and the Euro Sign

Quotes, Hyphens, Ellipses

“The devil is in the details,” details such as the quotes surrounding the previous phrase.

Figure 1
Fig. 1 Smart vs Dumb quotes

There are two kinds of quotes: “straight” or “dumb” quotes and “curly” or “smart” quotes. As you can see in Fig. 1 on the left, using smart quotes gives the text a professional look. This is because the paired quotation marks are specifically designed for each font, unlike the neutral “dumb” marks which are often considered a faux pas in professional typesetting.

The common misuse of dumb quotes is just the most popular version of a larger problem: using characters which are not appropriate but are similar to the correct ones and easier to input with a keyboard. For some, this problem is taken care of by their editing software like MS Word, which automatically will convert straight to curly quotes; but since a large part of SVG development is manual or the output of our custom software, we cannot afford to hope the editor will fix it. We need to understand the details of how these characters are included.

Numeric Character References

The correct way to introduce curly quotes in SVG documents is through numeric character references. The Left Double Quotation Mark is character U+201C in the Unicode standard so it can be included in your SVG document via the decimal numeric character reference “; its right counterpart, character U+201D, is included via ”. It is also possible to use hexadecimal character references, in which case you would use “ and ”. The SVG code for Fig. 2 (quotes.svg) mixes the two approaches.




<?xml version="1.0"?>

<svg xmlns="http://www.w3.org/2000/svg"

     width="220" height="220" version="1.1">

  

  <rect x="1" y="20" width="110" height="110" 

           style="fill:black;"/>
<text x="80" y="40" style="font-family:Arial; font-size: 10pt; fill:red;">Eno</text>
<text x="3" y="110" style="font-family:Arial; font-size: 12pt; fill:white;"> &#8220; Fabrication &#x201D; </text> </svg>

Fig 2. Quotes.svg

There are several reasons for using numeric character references in SVG instead of other methods such as HTML character entity references. Let's examine them briefly:

  • HTML character entity references (&ldquo; and &rdquo; for double curly quotes) are not defined in SVG. If you were to try to include them in an SVG document, compliant viewers such as the Adobe SVG viewer v3.01 will not show the character, because such entities are not pre-defined like they are in HTML.

  • Specifying an encoding such as UTF-8 and including the character directly in the document is a technically valid alternative. However, many programming tools have difficulties showing and manipulating UTF-8 and other encodings. This difficulty is relevant not only for curly quotes but also for any character that is difficult to input or display in common programming tools, including non-Latin alphabetic characters.
    Part of the beauty of SVG is that you can write simple programs to generate it and use common text programs to manipulate it; however, that simplicity is blurred when common operations like searching via the command line become difficult because the characters in question are not supported by your input methods. In other words, you can use any good old terminal to write grep -n "&#x0424D" code/*.svg, but you would have to go through contortions to get grep -n "Ф" code/*.svg

  • Non-standard character sets such as Microsoft windows-1252 are being deprecated and should be avoided because of their conflict with Unicode. For a more detailed explanation of the problems related with windows-1252 please refer to David Wheeler's article “Curling Quotes in HTML,SGML, and XML”, which also mentions (in a slightly different light) the two points above.

XML allows both decimal and hexadecimal numeric character references, so just as shown in Fig.2 you can use either one in SVG. I prefer and recommend hexadecimal references. Some respectable sources advocate the exclusive use of decimal references to keep “maximum backwards compatibility with SGML” because before XML, SGML only supported decimal references. In practice, however, one is more likely to use XML tools to process SVG, and there are ways in most modern SGML tools to enable hexadecimal references. More important is the argument that the Unicode standard and literature refers to every character by its hexadecimal code, making hexadecimal references very convenient.

Commonly Bungled Characters and their Correct Codes

Now that we know the how and why of inserting special characters in SVG, lets go back to typography and some variations of the bungled curly quotes syndrome, including single quotes, hyphens, and ellipses.

Character(s) Common Error Examples
Single Quotes Using the ASCII grave accent (U+0060) and a “corresponding” acute accent (U+00B4) is a common error, which looks about as bad as using two apostrophes (U+0027). The correct single quote marks are U+2018 and U+2019 (except when writing code that uses apostrophes).
  • `this is a hack from typewriter days´
  • 'This is also wrong'
  • ‘Nice, no?,’ she asked
Double Quotes Using the ASCII quotation mark (a.k.a. dumb quotes) when quoting text is a common mistake. Instead, use smart quotes, characters U+201C and U+201D inserted in SVG documents via their corresponding hexadecimal numeric character references &#x201C; and &#x201D;. "this is a common typographic typo too"
print OUT "in code it is ok";
“This is not an exit,” Pat says
Hyphen, n-dash, m-dash The character U+002D is the plain hyphen accessed on your keyboard. It's typographic purpose is to break words at the end of a line (to hyphenate); however, the hyphen is commonly abused to indicate ranges or to break the flow of a sentence.
The correct characters for such purposes are, respectively, the n-dash (U+2014), and the m-dash (U+2015).
The hyphen is the shortest of the three characters, the n-dash is larger and commonly about as wide as the letter “n”. The m-dash is the longest of the three and should not be replaced by two hyphens, as I'm sure you've seen done before.
  • Using an n-dash in March 3—8 is a subtle but elegant improvement over March 3-8
  • this is wrong -- and ugly --
  • Boggart―or at least his presence―will remain.
Ellipses Although many people are used to create ‘faux—ellipses’ using three dots ―something that some packages like MS Word automatically correct―, horizontal ellipses have their own character, and we must include it explicitly using &#8230; Isn't that special…
This isn't...

The SVG graphic in Figure 3 and its associated code illustrate the points above.

<?xml version="1.0"?>

<svg xmlns="http://www.w3.org/2000/svg" height="400" width="400" 

        xmlns:xlink="http://www.w3.org/1999/xlink">



<image xlink:href="triceratops.png"

        width="303" height="216" x="1" y="1"/>



<text x="55" y="45" style="font-family:Arial; font-size: 24pt;
fill:#F8431C;"> sands of time&#8230; </text> <text x="2" y="60" style="font-family:Times New Roman;
font-size:14pt; fill:#F8431C;"> &#8220;Not an experience-a revelation&#x201D; </text> <text x="125" y="185" style="font-family:Times New Roman;
font-size: 14pt; fill:#F8431C;"> Stefan George Institute </text> <text x="210" y="205" style="font-family:Times New Roman;
font-size: 14pt; fill:#F8431C;"> June 10&#x2013;24 </text> <image height="16" width="16" y="192" x="99"
xlink:href="triceratops.png"/> </svg>

Fig 3. triceratops.svg

Before moving on, a word of caution about smarts quotes: always use curly quotes except when showing code. String literals in programming languages, attributes in XML, and other such technical code is only correctly presented in dumb quotes, the way it would compile/parse. Using curly quotes to show code is not only incorrect but looks cluelessly affected, roughly similar to eating a Snickers bar with fork and knife.

Fonts

So far, we have avoided the problem of character availability in specific typefaces by using typefaces such as Arial which are available in most computers; now we need to see how to embed fonts if we are to create specific work that can be reliably displayed in all SVG-compliant viewers. For this purpose, SVG provides a font tag that describes “SVG fonts”, which are collections of font outlines described in SVG paths.

Our three goals regarding fonts are to see how an SVG font can be embedded, to note a couple of important extra sub-elements such as missing-glyph, and finally to see how to convert True Type Fonts into SVG fonts.

Embedding an SVG font

An SVG font is embedded in a document via the font tag. Inside the font tag, there is information about the name of the font, an ID we can use to reference it, and the definition of each glyph.

The following Listing shows how to embed the font “Space Toaster” by Chank Diesel (originally designed for the Cartoon Network's “Space Ghost Coast to Coast”). Note how we only need to embed the glyphs for the characters we are going to use.


<?xml version="1.0"?> 

<svg width="100" height="100">

<defs>

<font id="SpaceToaster" horiz-adv-x="730">

  <font-face

    font-family="Space Toaster"

    units-per-em="2048"

    panose-1="2 0 5 9 0 0 0 0 0 5"

    ascent="1835"

    descent="-547"

    alphabetic="0"/>



<glyph unicode="!" glyph-name="exclam" horiz-adv-x="420" d="M408 1532Q367 
1170 203 520L147 512Q174 860 174 1251Q174 1398 168 1497L408 1532ZM184
434Q217 434 234 408T252 340Q252 266 203 196T104 125Q70 125 50 160T29
238Q29 309 80 371T184 434Z"/> <glyph unicode="W" glyph-name="W" horiz-adv-x="1069" d="M1040 1446L870
-152L487 604L279 -111L-4 852L147 1020L328 252L487 786L850 285L954
1409L1040 1446Z"/> <glyph unicode="o" glyph-name="o" horiz-adv-x="715" d="M344 555Q344 583
392 645T459 720T504 738T555 745Q626 745 655 682T684 526Q684 380 631 254T445
40T221 -47Q166 -47 101 32T35 252Q35 345 71 479T199 706T442 831L389 764Q259
709 181 593T102 287Q102 103 199 92L223 90Q353 90 448 221T543 483Q543 533
529 571T499 629T467 649Q442 649 382 590Q355 563 344 555Z"/> <missing-glyph horiz-adv-x="2048" d="M256 0V1536H1792V0H256ZM384
128H1664V1408H384V128Z"/> </font> </svg>

Missing Glyphs

As you can see in the code above, each glyph is defined separately, using the glyph element. Now when we use the font in any text element, the correct glyphs will be shown as demonstrated in the graphic below (the glyph definitions have been omitted from the listing to avoid repetition). This selective inclusion of characters allow us to keep the file as small as possible. On the other hand, if we use a character that is not defined in the SVG font, the figure defined by the missing-glyph element will appear:


<?xml version="1.0"?> 

<svg width="100" height="100">

<!-- font definition omitted -->



<g style="font-family: Space Toaster; 

             font-size:30pt; fill:black; 

             filter:url(#shadow)"> 

        <text x="10" y="40">  WoW!  </text> 

</g>



<g style="font-family: Space Toaster; 

             font-size:20pt; fill:black;"> 

        <text x="10" y="90">  Whoa!  </text> 

</g>



</svg>


Fig 4. wow.svg

Note that there is more metadata about the font encoded as attributes in the font element. For details about these attributes you can refer to the SVG specification, in particular section 20 (Fonts); however, normally these are values you don't concern yourself with, they are generated by the tools we use to create the SVG font from True Type.

Generating SVG fonts from True Type

Currently the best known tool for the generation of SVG fonts is Apache Project's Batik. Batik has a very simple command-line utility that allows you to create SVG Fonts from a True Type font.

Batik can bedownloaded at no cost. Using Batik's ttf2svg utility is a very simple process that can be summarized by the following invocation which creates an SVG document with the SVG version of a font called “Hollywood Hills”, giving it the id HollywoodHills, saving it to the specified file, including a testcard, that is text elements with each converted glyph:


 java -jar "c:\lib\batik-1.5.1\batik-ttf2svg.jar" 
"c:\windows\fonts\HOLLH__.ttf"-id HollywoodHills
-o HollywoodHills.svg -testcard

A final word of advice: although it's very easy to create SVG fonts from True Type you must be careful to check you have permission to do so. Most great fonts are not free.

Non-Latin Scripts

So far we've talked about characters outside the Latin alphabet as an exception, in which case we have recommended hexadecimal numeric references. But what if your SVG document is contains a large portion of text in some other script, like Cyrillic or Hebrew?

There are three main technical issues associated with type in the non-latin SVG content scenario: appropriate fonts, encodings, and bidirectionality.

Appropriate Fonts

To include large portions of non-Latin text in SVG documents, we have two options: the first one, the legitimate one, is to use the correct characters and specify a font that has glyphs for them. The second option is to use Latin characters with a font that matches Latin characters to non-Latin glyphs (e.g. you write in your XML document an “X” and it shows the Cyrillic glyph “Ж”). The second option is a bad hack that you should avoid completely as it goes against the notions of search, select, and copy available through SVG.

Without the right characters, all the niceties of SVG -- being indexable by search engines, and searchable within the browser -- are lost. If one is to cheat with the original characters, one might as well put a JPG with the rasterized glyphs.

The good news, on the other hand, is that many great―and free―fonts have support for non-Latin scripts. Take for example the following SVG document which uses the Cyrillic support of Arial to show Russian text:


<?xml version="1.0" encoding="utf-8"?>

<svg xmlns="http://www.w3.org/2000/svg" height="221" 

    width="237" xmlns:xlink="http://www.w3.org/1999/xlink">

<image height="221" width="237" y="0" x="0" xlink:href="lo.png"/>

  <text x="10" y="150"

           style="font-family: Arial; font-size:20pt;">

     И как они сумели  

   </text>

  <text x="70" y="190"

           style="font-family: Arial; font-size:30pt;fill:red">

        Лолите <tspan style="fill:black">?</tspan></text>

  <text x="10" y="170"

           style="font-family: Arial; font-size:20pt;">

   сделать фильм о </text>

</svg>

Rasterization for Russian
rasterized version for users
without Russian support
Fig 5. lolita.svg

Note how the SVG source above uses real Cyrillic characters to write the Russian slogan, which in turn is real, selectable text in the final display (that is if your system has an Arial font that supports Cyrillic characters, otherwise you will likely see only a question mark). To see the nice side-effects of using real text try selecting the text and copy/pasting it to an application such as MS Word. Search engines can benefit also from the fact that is real text to index and search it. By the way, in case you are curious, the tagline reads “How did they ever make a film about Lolita?”

Encodings

Since we talk about including non-Latin characters directly in SVG we must talk of encodings. I'm confident you know XML documents can be written in a variety of encodings such as US-ASCII, UTF-8, and ISO-8859-1 and the encoding of your document must match the value specified in the XML declaration (e.g. <?xml version="1.0" encoding="utf-16"?>). The real question for our discussion is: “which encodings are supported by SVG implementations?”, or in other words “which encodings can I reliably use?”

The answer may be affected by the viewer you are expecting to use, but for the sake of portability you should limit to UTF-8, UTF-16, ISO-8859-1, and US-ASCII. These are the encodings supported in versions 2.0 and 3.0 of Adobe's SVG viewer, they are also supported in Batik, and can be reasonably expected in all other future implementations.

Bidirectionality

So far we've used scripts like Latin and Cyrillic which have a left to right direction. It is important to briefly discuss, however, typesetting of scripts written right to left like Arabic or Hebrew, and how to deal with them in mixed-language environments.

The short version of the story is that this is something well handled automatically and a bad idea to tinker with. More precisely, Unicode defines (in section 3.12, page 55 of the 3.0 print copy) a very precise way to deal with directionality which includes an implicit mechanism based on the characters used and the possibility to override it using the characters RLO (right to left override) and LRO (left to right override).

Unicode provisions for directionality are complex and work consistently well in most scenarios. If you should ever need to override the defaults (I don't recommend it even for aesthetic design purposes), the way to do it is via the direction and unicode-bidi properties, as shown in the graphic below:


<?xml version="1.0" encoding="utf-8"?>

<svg width="400px" height="300px" version="1.1"

 xmlns = 'http://www.w3.org/2000/svg'>  

 

  <text x="10" y="40"

           style="font-family: 'Arial'; font-size:20pt; ">



   <b> Arabic text: يونِكود </b>



 </text>

  <text x="10" y="70"

           style="direction:rtl; unicode-bidi:bidi-override; 

           font-family: 'Arial'; font-size:20pt; ">

   <b>Arabic text: يونِكود </b>

  </text>

  <text x="10" y="100"

           style="font-family: 'Arial'; font-size:20pt; ">



   <b>Arabic text: 

    <tspan style="direction:rtl; unicode-bidi:embed;">يونِكود

    </tspan>/<b>

</text>

</svg>

Rasterization for Arabic
rasterized version for users
without Arabic support
Fig 6. arabic.svg

In the graphic above, the Arabic text is always correctly displayed: in the first line, because the Unicode algorithm automatically displays the Arabic characters right; in the second line because we are forcing everything to go from right to left; and in the third line because we redundantly make a span with an embedded piece of text from right to left.

The moral of the story is to not complicate matters when dealing with bidirectional text unless you really have a reason to override Unicode's solid provisions. To explicitly override the Unicode bidirectionality algorithm in SVG, specify your desired direction (ltr or rtl) and set unicode-bidi to bidi-override.

Other Special Characters

To wrap this section, lets review two other important problems with characters and their issues inside SVG. The types of characters we will discuss are: ligatures and the euro sign

Ligatures

Ligatures are one of the favorite tools used by typographers to make text readable and beautiful. A ligature is a special character that mixes two or three characters like ‘f’ and ‘i’ to produce a more visually appealing combination, in this case ‘fi’.

Depending on the script used you will have different examples of ligatures, but lets assume you are working with latin characters, in which case the most common and important ligatures are the f-ligatures. The f-ligatures are specified in unicode in characters U+FB00 through U+FB05. The following graphic illustrates their use: ff


<?xml version="1.0" encoding="utf-8"?>

<svg width="150px" height="200px" version="1.1"

 xmlns = 'http://www.w3.org/2000/svg'>  

<g style="font-family: 'Times new roman'; font-size:20pt;">

  <text x="10" y="20">f + f = &#xFB00;</text>



  <text x="10" y="50">f + i = &#xfb01;</text>



  <text x="10" y="80">f + l = &#xfb02;</text>



  <text x="10" y="110">f + f + i = &#xfb03;</text>



  <text x="10" y="140">f + f + l = &#xfb04;</text>



  <text x="10" y="170">f + t = &#xfb05;</text>

</g>

</svg>


Fig 7. ligatures.svg

Unfortunately, as you can see in figure 7, support for ligature characters is limited even in very popular fonts. To correctly display all ligatures you will probably need to use the embedded fonts mechanism as described in the Fonts section.

The Euro Sign

The Euro sign was introduced to Unicode in 1998. Its point is U+20AC, so its hexadecimal numeric reference is &#x20AC; (which displays like this: €).

Avoid using the letter e as a replacement for the Euro Sign in your SVG. Avoid also using the letters EUR as a replacement for €. By using the tspan element you can include the character using a font that supports it, even if the main font you are using doesn't. This technique is illustrated in the graphic below:


<?xml version="1.0" standalone="no"?> 

<svg width="280" height="100">

<defs >

<font id="macswiss" horiz-adv-x="904" ><font-face

    font-family="E-font"

    units-per-em="2048"

    panose-1="2 11 6 4 2 2 2 2 2 4"

    ascent="1854"

    descent="-434"

    alphabetic="0" />

    <missing-glyph horiz-adv-x="1536" 

    d="M256 0V1280H1280V0H256ZM288 32H1248V1248H288V32Z" />



<glyph unicode="€" glyph-name="Euro" horiz-adv-x="1139" 

d="M790 1325Q622 1325 508 1244Q440 1196 385

 1108Q329 1017 319 935H1001L974 801H303Q302 780 

302 761Q302 684 303 669H947L919 535H324Q366 306 526 

210Q641 141 775 141Q962 141 1067 239V33Q942

-25 791 -25Q339 -25 180 351Q148 427 125 535H-28L0 669H105Q102

711 102 760Q102 780 103 801H-28L0 935H116Q178 1261 439 1403Q600 

1491 794 1491Q980 1491 1107 1410L1067 1224Q945 1325 790 1325Z" />



</font>

</defs>



<g style="font-family: FontWithNoEuro; font-size:18;fill:black"> 

<text x="20" y="60"> Only <tspan style="font-family:E-font">&#x20AC;</tspan>29.99!! 

 for a limited time</text> 



</g>

</svg>


Fig 8. euro.svg

This concludes the second part of our exploration of SVG and typography. All the examples from this installment can be downloaded in this zip

The final two articles of this series deal with techniques to make effects on both static and animated type. There is plenty more fun ahead so stay tuned and see your SVG grow.