Getting Started: Basic Concepts

From Symgate Developer's Wiki
Jump to: navigation, search

Contents

About Symgate

Symgate is the name for Widgit's suite of online technologies, deployed to bring symbol content to the web.

It's main component is the symboliser, the purpose of which is to turn unsymbolised text (in plain-text, or HTML format) and symbolise it. It does this using an intermediate format called CML. (See 'About CML')

It uses Widgit's "Smart Symbolising" technology, which uses linguistic techniques to determine which symbol is most appropriate when the language is ambiguous. However, there are many occasions where the best symbol cannot be automatically detected, and the user must make a decision. (See 'Understanding Ambiguity')

CML cannot be understood directly by the end user. It contains all of the information on the meaning of a document required to generate a page of text with symbols, but not the information on how that document should look. This is due to the fact that users often have different requirements when viewing symbolised content, such as a different preferred symbol system. - Therefore, CML must be rendered before it can be viewed. The symboliser provides several functions to assist with this process. (See 'Rendering CML')

As the symboliser is part of a live web service, it is accessible from several platforms including .NET and PHP. 'Using The Symboliser In Practice' contains links to the technical information you will need to start building your first Symgate-enabled application.

About CML

Concept Markup Language (CML) is an HTML-like, XML-based language designed to facilitate the transfer of symbolised information over the internet.

It's main function is to serve as a way of providing documents where the text is mapped to a series of concepts, which can be rendered as symbols.

CML documents do not contain information on the actual symbols themselves. Instead, text is mapped to concepts, identified primarily by a concept code. The reason for this is that a concept can be rendered differently depending on what symbol system is being used, and CML is designed to provide maximum portability across languages and symbol systems.

CML does however support basic formatting information such as bold, italics, and size changes within a document, if required.

An example CML Document:

The following document could be created by the symboliser from the input text "The red clock.":

<cml>
 <body>
  <cp size="0" language="English_UK" >
   <cs>
    <cmap>
     <cc base="the" tags="Det+Def+SP" code="10900100010000000" pos="DET" />
     <text>The</text>
    </cmap>
    <cmap>
     <cc base="red" tags="Adj" code="40340010140000000" pos="ADJ" />
     <text>red</text>
    </cmap>
    <cmap>
     <cc base="clock" tags="Noun+Sg" code="30310700050000000" pos="NOUN" />
     <text>clock.</text>
    </cmap>
   </cs>
  </cp>
 </body>
</cml>

You can get more information on the elements in the above document by looking at the WSDL and CML Schema Reference Documentation, but here is a quick overview:

Every CML document starts with a top-level cml element, which must contain a body element. A CML body contains a list of paragraphs, identified by cp elements, which in turn contain a list of sentences (cs).

Sentences then contain a series of elements, describing the content. These are usually either cmap or altlist elements (although it is also possible to have certain formatting-related elements here, see here for a list of what's permitted).

The cmap element, is a mapping from some text, onto one or more concepts, which will later be rendered by symbols. The altlist element defines a list of possible linguistic alternatives for that section of the document (see Understanding Ambiguity).

Inside a cmap are zero more cc elements describing the concepts, and a single text element, containing the text that is associated with the concepts. For most uses it is not necessary to understand what the attributes of the cc element mean, as you will normally simply pass them back to the symboliser to generate a symbol.

Understanding Ambiguity

Written language can often mean more than one thing. For example, the word match can have several different meanings, including a football match, or something that you use to light fires. When we read just the written text, this is not a problem as we are able to work out the meaning for ourselves. However when symbolising, we have to ensure that the correct symbol is displayed or the meaning is muddled. (The symbols for the two above versions of match are quite different.)

The symboliser solves this problem by using alternatives. Several alternatives are presented within an altlist element, and each individual alternative is defined by a calt element. The following example shows this in practice. It is an example sentence, symbolised from the phrase "The match.":

   <cs>
    <cmap>
     <cc base="the" tags="Det+Def+SP" code="10900100010000000" pos="DET" />
     <text>The</text>
    </cmap>
    <altlist>
     <calt priority="0" >
      <cmap>
       <cc base="match" tags="Noun+Sg" code="30900400470000000" pos="NOUN" />
       <text>match.</text>
      </cmap>
     </calt>
     <calt priority="1" >
      <cmap>
       <cc base="match" tags="Noun+Sg" code="30230020390000000" pos="NOUN" />
       <text>match.</text>
      </cmap>
     </calt>
     <calt priority="2" >
      <cmap>
       <cc base="match" tags="Verb+Trans+Pres+Non3sg" code="20230020390000000" pos="VPRES" />
       <text>match.</text>
      </cmap>
     </calt>
    </altlist>
   </cs>

Initially, a single cmap is presented for the word "The". This is because there is only one possible symbol for this word.

For the word "match", there are three alternatives. They are all presented within an altlist element. For each possible alternative, there is one calt element. A single calt from the altlist should be displayed when the document is rendered.

Widgit's "Smart Symbolising" technology attempts to suggest the most appropriate alternative for each list. calt elements are assigned a priority when they are generated. The one with the lowest priority is the one the symboliser thinks is most appropriate, and should be displayed by default.

If you wish to specify a specific alternative to use for the document, as the result of a user selection, you may use the preferred attribute of the calt. By setting 'preferred="true"' for the calt in question, any renderer will know to display that particular calt (instead of the one with the lowest priority).

For example, if you wished to use the second symbol for 'match' from the above document, you would modify it like so:

    <altlist>
     <calt priority="0" >
      <cmap>
       <cc base="match" tags="Noun+Sg" code="30900400470000000" pos="NOUN" />
       <text>match.</text>
      </cmap>
     </calt>
     <calt priority="1" preferred="true" >
      <cmap>
       <cc base="match" tags="Noun+Sg" code="30230020390000000" pos="NOUN" />
       <text>match.</text>
      </cmap>
     </calt>
     <calt priority="2" >
      <cmap>
       <cc base="match" tags="Verb+Trans+Pres+Non3sg" code="20230020390000000" pos="VPRES" />
       <text>match.</text>
      </cmap>
     </calt>
    </altlist>

It is important to note that one calt may contain more than one cmap, and there may not be the same number in different calts within the same altlist. For example, in the phrase "frying pan" we have a single symbol for "frying pan" itself, but the symboliser also provides an alternative containing the symbols for "frying" and "pan" separately, in case this is more appropriate.

The following example demonstrates this:

    <altlist>
     <calt priority="0" >
      <cmap>
       <cc base="frying pan" tags="Noun+Sg" code="30410010070000000" pos="NOUN" />
       <text>frying pan</text>
      </cmap>
     </calt>
     <calt priority="1" >
      <cmap>
       <cc base="fry" tags="VProg+Adj" code="20410100080000000" pos="ADJ" />
       <text>frying</text>
      </cmap>
      <cmap>
       <cc base="pan" tags="Noun+Sg" code="30410010360000000" pos="NOUN" />
       <text>pan</text>
      </cmap>
     </calt>
    </altlist>

Note that the first calt has one cmap, while the second one has two.

Rendering CML

Once you have used the symboliser to convert your text data into CML, you will want to display it to the user.

For web-based platforms, the symboliser contains a reference XHTML renderer, which takes a CML document and converts it to XHTML, suitable for displaying in a web browser, or embedding into a web page. Using this process is a simple matter of calling the cmlToXHTML symboliser method, and displaying the result. (Note that the necessary CSS file must be included in your pages.)

This mechanism is somewhat limited for more advanced applications, however. It also does not provide the user with any mechanism to select alternatives, it simply displays the one with the lowest priority (or which has it's preferred attribute set).

We currently do not have a reference implementation for enabling the end user to select alternatives, so you will have to provide your own. (We are working on a browser-based CML editor which should provide this functionality, see the Development Roadmap for more information.) To do this, you will need to parse the CML yourself as it is being generated or edited, and display the available choices to the user.

In order to do this, you will need to display the individual cc elements in your CML documents as symbols. Calling the conceptToSymbol symboliser operation will supply you with the URLs for the graphic files necessary to display the symbol.

Symbols sometimes have qualifiers, which are additional symbols displayed next to the main graphic to add meaning. For example, there are plural and past-tense qualifiers. It is important to render these in the correct place, so that the symbol content is displayed as the user expects.

In order to best understand how you should render symbol qualifers, please look at the How To Render Symbols section for a more detailed explanation.

Using The Symboliser In Practice

The Reference Documentation contains detailed information on the operations and types available as part of the symboliser API. However, this is quite a complex document and you may find the following starting points useful:

The main symboliser operations you will be interested in will be textToCML (or htmlToCML), cmlToXHTML, and conceptToSymbol. Getting Started With Visual Studio gives some simple tutorials for using these with Miscrosoft .NET based languages. Getting Started With PHP shows similar examples for use with PHP.

Alteratively, if you want to dive straight in you might want to look at the Example Clients for a quick introduction.

Personal tools