Notes on Data Types in Haskell



Data Types

Let us try to answer the following questions in the most general and simple manner (it is possible to give very precise mathematical definitions, but we will not do that).
What is a Data Type?
Well, it is a set of elements and a set of operations over these elements. That's all.
What does it mean for a programming language to provide a Data Type?
Well, it means that the language provides you syntactical means to denote the elements of the set and to denote, or define, the operations on these elements.
C is a language providing you the Natural Numbers, since you have the possibility of denoting its elements (3, 665, 14, 2378, ...), denoting some operations (+,*,-,..) or define them (you can write a C function that computes n to the fifth, etc). Scheme has also heterogeneous lists (list of arbitrary elements) as Data Type

Sometimes, even if a language has not a particular Data Type, it can give you the possibility to define it by yourself.
How could one define a new Data Type?
Well, by describing how to denote its elements and by enabling you to denote and define the possible operations on them.
The Data Types one can define in a functional language are usually Inductive Data Types (you can call them also recursive Data Types or Algebraic Data Types).



Inductive data Types

An Inductive Data Type is a Data Type whose elements can be "built" starting from some basic elements by means of some "data constructors". The constructors can use elements of the inductive Data Type itself as arguments (that's why we call them "recursive" Data Types; in Haskell they are also called "algebraic" Data Types), but also elements of of other data types. As example, let us consider the Data Type of Binary Trees with Numerical Labels (Natural Numbers as labels).
It is easy to see that all its elements can be built out of the Emptytree using the constructor that, given two trees and a natural number, "builds" the tree whose root is labeled by the given number and whose subtrees are the given trees. Haskell is a programming language that really enables you to define a new Data Type not already present in the language. In fact it is possible to define the above described Data Type by means of this simple definition:

data NumBinTree = EmptyNBTree |
                  MkNBTree Integer NumBinTree NumBinTree

It is easy to see that what this definition gives you:

  • a way to denote the basic element, the empty tree (EmptyNBTree)
  • a way to denote a tree built out of a number and two trees
    For instance the expression (MkNBTree 4 expT1 expT2) denotes the tree whose root has label 4 and whose subtrees are expT1 and expT2, where expT1 and expT2 are expressions denoting two other trees.
    And what about the functions working on elements of NumBinTree ?
    When dealing with inductive Data Types, the most important functions are those that return the "components" an element of the Data Type is built from, and the predicates to check whether an element is a basic element or a "compound" element. The first sort of functions are usually called "selectors".
    For NumBinTree the only predicate is the one that checks whether a given tree is the emptytree, while the three selectors are those that, given a non empty tree, return, respectively, the label of the root, the left subtree and the right subtree.
    In Haskell, predicates and selectors of an Inductive Data Type are implicitly defined by the mechanism of pattern matching. Let us see with an example (the function for the inorder visit of a tree) how pattern matching works.

    inorder EmptyNBTree = [] 
    inorder (MkNBTree n t1 t2) = (inorder t1)++[n]++(inorder t2)
     

    where ++ is the predefined concatenation (append) operator in Haskell.

    You notice that, by the possibility of describing the "shape" of the input of the function, we can easily refer to the components of the input inside the body of the function.

    Exercise:write the functions for the other sorts of visits for a tree.

    Let us see a different definition in Haskell for NumBinTree

    data  NumBinTree = EmptyNBTree |
                       Leaf  Integer |  
                       MkNBTree Integer  NumBinTree  NumBinTree 
    

    If we took out "EmptyNBTree" from the above definition, we would get full binary trees (i.e. binary trees in which nodes have always zero or two children). By doing that we would not consider emptytrees to exist anymore, and hence the basic elements of the Data Types would be the leaves.

    It is important to stress that, once we have defined a data type (and its selectors if we are in Scheme), all the possible functions on the data type, can be defined by the user using only the selectors.

    Haskell is a language that really gives you the possibility of defining new Data Types.
    A language really enables you to define a new Data Type, when it enables you to be (in the small context of a programming language, of course) like God. God gave to the man the numbers. We can represent numbers, we can work with numbers, we can perform operations on numbers, but... does anybody know what a number is?
    The same happens with the above NumBinTree. We can denote trees by means of expressions, work with them but... do we know anything else about a tree besides that it is made out of emptytrees using constructors and numerical labels?
    No, we know nothing else. But what we know is enough. The human beings have always been able to use numbers by simply knowing that a number is either zero or the successor of a number, nothing more.

    The above discussion is a very rough and intuitive view of Data Types, but it is useful to keep it in mind when working with languages that do not really enable us to define new Data Types, like Scheme. In fact in Scheme, if we wish to introduce a Data Type like NumBinTree, we first need to decide what is their internal representation and afterwards we have to define the basic elements, the constructors, the selectors and the predicates in terms of such an internal representation. The internal representation of the Data Type will always remain visible to the user and such visibility is completely useless, if not harmful sometimes.
    In a sense, it is like we knew that the Numbers that God gave us are, in their essence, heaven-photon-sun-blasted-rays. Would that be of any use for us? And even if one found out that it is possible to get 2 by phito-conspiring the heaven-photon-sun-blasted-ray of the number 1, would it not be better to simply add 1 to 1?

    So Scheme does not allow to define new Data Types. What one can do is just to "represent" a Data Type with one of the predefined Data Types and work with such a representation.

    Being trees inductively defined, it is not difficult to understand that the best way to define functions on trees is by recursion.

    Let us write an expression denoting a tree.

     MkNBTree 3 EmptyNBTree (MkNBTree 4 EmptyNBTree EmptyNBTree)
    

    The tree denoted by this expression is

                  3 
                   \  
                    4 
    
    

    Remind that in functional languages any function, and hence also the constructors of a Data Type, returns a new object and never the input modified.
    That is why it is sometimes a bit strange to work in functional languages with data structures that are intrinsically imperative, like the stacks. In functional languages the operator push does not modify the stack given as input, but produces a new stack, similar to the input one, but for its top.
    That's why, intuitively, garbage collection plays a relevant role in the implementation of functional languages.

    Of course we could also implement modifiable objects in functional languages, but that can be achieved only by using the imperative features that sometimes functional languages have. In Scheme the keywords related to imperative operators are followed by a "!", like set!
    In Haskell it is possible to treat modifiable object in a pure functional style by means of the use of Monads.
    It is not advisable to use the imperative features provided by functional languages unless we have strong efficiency motivations or unless what we have to implement cannot be realized otherwise.

    Let us define the Data Type of stacks of numbers in Haskell.

    data NumStack =  EmptyNStack |  PushNS Integer  NumStack
     


    The data type of sets of natural numbers in Haskell:

    data SetNum  =  SetN [Integer]
     

    The only constructor is SetN that takes a list of elements of numbers and returns the set of the numbers contained in the list.

    Another choice can be

    data SetNum =  EmptySN |  InsertSN Integer SetNum  
    


    We define now the set of the lambda-terms in Haskell

    We first define the data type of variables. This could be done by means of an enumerated type

    data Var =  x |  y |  z |  w
    

    but so we would have only a finite number of variabless. It is better to define the data type of variables as follows, using the intuition of looking at the variables as x1, x2, x3, ...

    data Var = MkVar Int (there is no base element, only a constructor)

    We can now define the lambda-terms as

    data LambdaTerm =  MkLamdaTerm Var |  
                       Appl LambdaTerm LambdaTerm |
    		   LambdaAbs Var LambdaTerm 
    


    Of course the general idea behind any inductively defined data type is the same. Also the following one is the description of a set of rather strange objects that indeed are elements of an inductive data type:
    A sarchiapone can be either the manzanillo or the mininollo. Otherwise a sarchiapone can be the tirulero of a natural number and a sarchiapone. Another possibility for a sarchiapone is being the zumpappero of two sarchiapones and a numerical list. You see, the sarchiapones, whatever they be, form a recursively defined data type. We can define this data type in Haskell and we can define recursive functions working on sarchiapones.

    data Sarchiapone = Manzanillo |
                       Minollo |
                       Tirulero Integer 
    		            Sarchiapone |
                       Zumpappero Sarchiapone 
    		              Sarchiapone 
    			      [Integer]
    

    Parametric Data Types

    In Haskell we can define data types parametrized over all possible types. For instance we can define not only the data type of binary trees with numerical labels, but it is possible to define it for any possible type of label, by simply writing

    data LabBinTree a  = EmptyLBTree |
                         MkLBTree a 
    		              (LabBinTree a) 
    			      (LabBinTree a)
    

    where a is a type variable, representing the type of the label of the data type.

    A generic stack can be defined as

    data  Stack  a =  EmptyStack |  Push  a (Stack  a)
     

    In Scheme it is not possible to define such parametric data type, since in Scheme, you know, it is not even possible to really define a new data type. In fact the stack we define in Scheme can contain even heterogeneous elements. And this is something you cannot have in Haskell. Well, but we can have a certain amount of flexibility in Haskell, by a smart use of type variables. For instance, by defining the parametric data type below, (AlternatingBinTree Int Bool) denotes the type of labelled binary trees with integer labels on nodes at even depth and boolean labels at odd depth.

    data AlternBinTree a b  = EmptyABTree |
                                   MkABTree a 
    			                (AlternBinTree b a) 
    					(AlternBinTree b a)
    

    EXERCISES:

    • Define the Haskell data type of Simple Types of the typed Lambda-Calculus.
    • Implement in Haskell the Unification algorithm for Simple Types.
    • Implement in Haskell the Principal Pair algorithm.



    Types in Haskell

    One can wonder why we have used the word "Data Type" instead of simply "Type". Well the motivation is that a type is something different. A type is indeed something more than a set of element, it is a partial specification of a program. In Haskell we have particular types, providing a particular sort of partial specifications. These partial specifications describe mainly the domain and the codomain of a function. It is not a complete specification, but the use of these sort of types helps in preventing a whole bunch of errors in our programs.
    In Haskell, before writing a function, you can declare its type. This consists in declaring that the program you are about to write is meant to satisfy the partial specification described by the type. The interpreter will be able to check whether your function satisfies the partial specification you have written.

    Let see this through a few examples

    Before writing the factorial function in Haskell we can declare its type

    fact :: Integer -> Integer
    
    fact 0 = 1
    fact n = n * (fact (n - 1))
    

    The interpreter when it evaluates this type declaration and this function definition, checks whether the fact you have defined is really defined on integers and returns an integer. In fact, if we had written

    fact :: Integer -> Integer
    
    fact 0 = []
    fact n = n * (fact (n - 1))
    

    ([] denotes the empty list)
    we would have got an error message from the interpreter!

    Another possibility in Haskell is to first define a function and then ask the interpreter what is its type(that is its partial specification.)

    We can write

    fact 0 = 1
    fact n = n * (fact (n - 1))
    

    and, if no problem is encountered by the interpreter, we can write

    :type fact
    

    and the interpreter will return us

    fact :: Integer -> Integer
    

    (actually it returns something more general, but the essence is the above one)

    What happens when we define the following function

    id x = x
    

    and afterwards we ask the interpreter which is its partial specification?

    :type id
    

    Well, we get

    id :: a -> a
    

    In Haskell letters like a, b, c etc. are intended to represent type variables, and a type variable is a variable that can be replaced by any type. So, what does the interpreter wish to tell us by writing "id :: a -> a"?
    Well, it tells us that id is a function that can be applied to elements of any type T and as result will give us an element of type T.
    Types like a->a are called "polymorphic types"
    Notice how the partial specification a -> a for the function id is "the most general" among those satisfied by the function id.

    The notion of type and polymorphic type is not present just in Haskell, but in many other functional languages.
    That is why we introduced the notion of type in the in the context of the lambda-calculus: in order to study types and their properties independently from a particular programming language.

    Read the Notes about polymorphsm in Haskell [HPF] available in the reading material.

    Types in Scheme.

    Well, Scheme is an untyped functional language! Also for Data Types we know that it is impossible to really define Data Types in Scheme. So, no types in Scheme, sorry.
    This is a disadvantage because types help to prevent writing some errors in programs and enable us to partially check whether a function behaves how we intend it to behave.
    Is there any advantage in not having types?
    Well, you are more "free" in what you write. For instance, a stack defined in Scheme can have any possible sort of elements stacked in it. The absence of types gives you also more freedom in writing "tricky" programs. But.... be careful! errors are difficult to detect....

  •