What is a script?
The written word is pervasive and scripts are the basis on which we communicate in writing. With the publication of Unicode 14.0.0 on 14 September 2021, the standard now supports 159 scripts.
This is how Unicode defines a script: “A collection of letters and other written signs used to represent textual information in one or more writing systems. For example, Russian is written with a subset of the Cyrillic script; Ukrainian is written with a different subset. The Japanese writing system uses several scripts” (Glossary of Unicode terms). Within this framework, the different scripts in the world, historical and contemporary, present wide variations.
About two-thirds of the writing systems in the world today use alphabetical scripts
Latin script
This is the script that you are likely to use most often: it is the one in which English and many European languages are written, and is the most widely script used today.
Scripts can be defined according to a number of characteristics, using Unicode or typographical terminology. Here are a number of these characteristics in the case of Latin script:
- Latin script is alphabetical
- it is bicameral, meaning that it has upper-case and lower-case characters, and is case-sensitive — so, we recognize brown to be an adjective and Brown to be a proper noun
- it is a left-to-right script
- it uses spaces as word-separators
- Latin script uses hyphenation
- Latin script uses what is termed a mid-baseline, with some characters having elements that descend below the base
- Latin script has what are termed its own native digits, or numerals
You can see in the example above that the character d is used it its upper case form. The bounding boxes allow us to see that words are separated by spaces and that some characters, for example, d, h, k, b and f, have ascenders, or parts of the character that extend about what is termed the script’s x-height. Likewise, j, p, y and g have descenders. Note too that the interrogation mark extends about x-height.
Arabic script
Arabic scripts present several more distinguishing features than Latin script. After the Latin alphabet, it is the second most widely used script in the world.
- Arabic is a right-to-left mid-baseline script
- the script directly represents only consonants and long vowel sounds; in other words, it is an abjad
- short vowel sounds and other phonetic information are denoted by diacritics
- it is a cursive script; in other words, the characters “join up”
- the shape of cursive characters can be determined by the characters to which they are joined
- characters can also overlap
- unlike Latin script, it is not case-sensitive
- like Latin script, it has native digits
- spaces are used as word-separators
Note the Arabic numerals on the second row from the top of this lower case keyboard.
CJK scripts
CJK scripts refer to Chinese, or Han, ideographs used in the writing systems of Chinese and Japanese, and to a more limited extent in Korean. Unicode supports more than eighty thousand Han characters.
Here is an example of a sentence using the Simplified Chinese script.
Can you identify the use of Traditional Chinese quotation marks here, as well as the European comma and full point?
- Han scripts are ideographic, with characters usually representing a spoken syllable
- for this reason, Han script is also referred to as a logosyllabary
- Japanese script features both syllabic and ideographic-syllabic text, with word-spacing being used with the former
- CJK scripts generally are left-to-right and can also be written vertically
- they are not case-sensitive
- Han does not use spaces as word-separators, though the justification of lines leads to adjustments in the placement of characters within their frames
- Korean, by contrast, does feature spacing between words
- both Han and Japanese script uses a centred baseline, whereas in Korean a bottom baseline is used
A Han ideogram can be thought of as contained a uniform square frame: here, the characters are displayed in visible bounding boxes to illustrate the absence of features like case and word separation. Note that punctuation marks imported from European scripts are full-width rather than half-width, and therefore do not require additional spacing.
A historical script: Ogham
Unicode extends also to historical scripts that are no longer current, one example being the medieval script of Ogham, which was widely used for inscriptions in Ireland, and also in parts of Britain. Here is an example of an Ogham inscription.
- Ogham is an alphabetical script, with incisions corresponding to characters in the Latin alphabet
- it is a left-to-right script, with a mid-baseline
- many original Ogham inscriptions are vertical, reading from bottom to top, as in this example
- Ogham inscriptions did not make use of word-spacing
- Ogham forms a block in the Basic Multilingual Plane in Unicode
- the Noto Project includes a font for Ogham