In the topics we covered so far, we hardly ever talked about primitive types in Idris. They were around and we used them in some computations, but I never really explained how they work and where they come from, nor did I show in detail what we can and can't do with them.
module Tutorial.Prim
import Data.Bits
import Data.String
%default total
According to Wikipedia, a compiler is "a computer program that translates computer code written in one programming language (the source language) into another language (the target language)". The Idris compiler is exactly that: A program translating programs written in Idris into programs written in Chez Scheme. This scheme code is then parsed and interpreted by a Chez Scheme interpreter, which must be installed on the computers we use to run compiled Idris programs.
But that's only part of the story. Idris 2 was from the beginning
designed to support different code generators (so called backends),
which allows us to write Idris code to target different platforms,
and your Idris installation comes with several additional
backends available. You can specify the backend to use with the --cg
command
line argument (cg
stands for code generator). For instance:
idris2 --cg racket
Here is a non-comprehensive list of the backends available with a standard Idris installation (the name to be used in the command line argument is given in parentheses):
- Racket Scheme (
racket
): This is a different flavour of the scheme programming language, which can be useful to use when Chez Scheme is not available on your operating system. - Node.js (
node
): This converts an Idris program to JavaScript. - Browser (
javascript
): Another JavaScript backend which allows you to write web applications which run in the browser in Idris. - RefC (
refc
): A backend compiling Idris to C code, which is then further compiled by a C compiler.
I plan to at least cover the JavaScript backends in some more detail in another part of this Idris guide, as I use them pretty often myself.
There are also several external backends not officially supported by the Idris project, amongst which are backends for compiling Idris code to Java and Python. You can find a list of external backends on the Idris Wiki.
A primitive data type is a type that is built into the Idris compiler together with a set of primitive functions, which are used to perform calculations on the primitives. You will therefore not find a definition of a primitive type or function in the source code of the Prelude.
Here is again the list of primitive types in Idris:
- Signed, fixed precision integers:
Int8
: Integer in the range [-128,127]Int16
: Integer in the range [-32768,32767]Int32
: Integer in the range [-2147483648,2147483647]Int64
: Integer in the range [-9223372036854775808,9223372036854775807]
- Unsigned, fixed precision integers:
Bits8
: Integer in the range [0,255]Bits16
: Integer in the range [0,65535]Bits32
: Integer in the range [0,4294967295]Bits64
: Integer in the range [0,18446744073709551615]
Integer
: A signed, arbitrary precision integer.Double
: A double precision (64 bit) floating point number.Char
: A unicode character.String
: A sequence of unicode characters.%World
: A symbolic representation of the current world state. We learned about this when I showed you howIO
is implemented. Most of the time, you will not handle values of this type in your own code.Int
: This one is special. It is a fixed precision, signed integer, but the bit size is somewhat dependent on the backend and (maybe) platform we use. For instance, if you use the default Chez Scheme backend,Int
is a 64 bit signed integer, while on the JavaScript backends it is a 32 bit signed integer for performance reasons. Therefore,Int
comes with very few guarantees, and you should use one of the well specified integer types listed above whenever possible.
It can be instructive to learn, where in the compiler's source
code the primitive types and functions are defined. This source
code can be found in folder src
of the Idris project
and the primitive types are the constant constructors of
data type Core.TT.Constant
.
All calculations operating on primitives are based on two kinds of primitive functions: The ones built into the compiler (see below) and the ones defined by programmers via the foreign function interface (FFI), about which I'll talk in another chapter.
Built-in primitive functions are functions known to the compiler the definition of which can not be found in the Prelude. They define the core functionality available for the primitive types. Typically, you do not invoke these directly (although it is perfectly fine to do so in most cases) but via functions and interfaces exported by the Prelude or the base library.
For instance, the primitive function for adding two eight bit
unsigned integers is prim__add_Bits8
. You can inspect its
type and behavior at the REPL:
Tutorial.Prim> :t prim__add_Bits8
prim__add_Bits8 : Bits8 -> Bits8 -> Bits8
Tutorial.Prim> prim__add_Bits8 12 100
112
If you look at the source code implementing interface Num
for Bits8
, you will see that the plus operator just invokes
prim__add_Bits8
internally. The same goes for most of the other
functions in primitive interface implementations.
For instance, every primitive type with the exception of
%World
comes with primitive comparison functions.
For Bits8
, these are:
prim__eq_Bits8
, prim__gt_Bits8
, prim__lt_Bits8
,
prim__gte_Bits8
, and prim__lte_Bits8
.
Note, that these functions do not return a Bool
(which
is not a primitive type in Idris), but an Int
. They are
therefore not as safe or convenient to use as the corresponding
operator implementations from interfaces Eq
and Comp
.
On the other hand, they do not go via a conversion to Bool
and might therefore perform slightly better in performance
critical code (which you can only identify after some
serious profiling).
As with primitive types, the primitive functions are listed as
constructors in a data type (Core.TT.PrimFn
) in the compiler
sources. We will look at most of these in the following sections.
Primitive functions and types are opaque to the compiler in most regards: They have to be defined and implemented by each backend individually, therefore, the compiler knows nothing about the inner structure of a primitive value nor about the inner workings of primitive functions. For instance, in the following recursive function, we know that the argument in the recursive call must be converging towards the base case (unless there is a bug in the backend we use), but the compiler does not:
covering
replicateBits8' : Bits8 -> a -> List a
replicateBits8' 0 _ = []
replicateBits8' n v = v :: replicateBits8' (n - 1) v
In these cases, we either must be content with just a
covering function, or we use assert_smaller
to
convince the totality checker (the preferred way):
replicateBits8 : Bits8 -> a -> List a
replicateBits8 0 _ = []
replicateBits8 n v = v :: replicateBits8 (assert_smaller n $ n - 1) v
I have shown you the risks of using assert_smaller
before, so we
must be extra careful in making sure that the new function argument
is indeed smaller with relation to the base case.
While Idris knows nothing about the internal workings of primitives
and related functions, most of these functions still reduce during
evaluation when fed with values known at compile time. For instance,
we can trivially proof that for Bits8
the following equation holds:
zeroBits8 : the Bits8 0 = 255 + 1
zeroBits8 = Refl
Having no clue about the internal structure of a primitive nor about the implementations of primitive functions, Idris can't help us proofing any general properties of such functions and values. Here is an example to demonstrate this. Assume we'd like to wrap a list in a data type indexed by the list's length:
data LenList : (n : Nat) -> Type -> Type where
MkLenList : (as : List a) -> LenList (length as) a
When we concatenate two LenList
s, the length indices
should be added. That's how list concatenation affects the
length of lists. We can safely teach Idris that this is true:
0 concatLen : (xs,ys : List a) -> length xs + length ys = length (xs ++ ys)
concatLen [] ys = Refl
concatLen (x :: xs) ys = cong S $ concatLen xs ys
With the above lemma, we can implement concatenation of LenList
:
(++) : LenList m a -> LenList n a -> LenList (m + n) a
MkLenList xs ++ MkLenList ys =
rewrite concatLen xs ys in MkLenList (xs ++ ys)
The same is not possible for strings. There are applications where pairing a string with its length would be useful (for instance, if we wanted to make sure that strings are getting strictly shorter during parsing and will therefore eventually be wholly consumed), but Idris cannot help us getting these things right. There is no way to implement and thus proof the following lemma in a safe way:
0 concatLenStr : (a,b : String) -> length a + length b = length (a ++ b)
In order to implement concatLenStr
, we have to abandon all
safety and use the ten ton wrecking ball of type coercion:
believe_me
. This primitive function allows us to freely
coerce a value of any type into a value of any other type.
Needless to say, this is only safe if we really know what we are doing:
concatLenStr a b = believe_me $ Refl {x = length a + length b}
The explicit assignment of variable x
in {x = length a + length b}
is necessary, because otherwise Idris will complain about an unsolved
hole: It can't infer the type of parameter x
in the Refl
constructor. We could assign any type to x
here, because we
are passing the result to believe_me
anyway, but I consider it
to be good practice to assign one of the two sides of the equality
to make our intention clear.
The higher the complexity of a primitive type, the riskier it is to assume even the most basic properties for it to hold. For instance, we might act under the delusion that floating point addition is associative:
0 doubleAddAssoc : (x,y,z : Double) -> x + (y + z) = (x + y) + z
doubleAddAssoc x y z = believe_me $ Refl {x = x + (y + z)}
Well, guess what: That's a lie. And lies lead us straight
into the Void
:
Tiny : Double
Tiny = 0.0000000000000001
One : Double
One = 1.0
wrong : (0 _ : 1.0000000000000002 = 1.0) -> Void
wrong Refl impossible
boom : Void
boom = wrong (doubleAddAssoc One Tiny Tiny)
Here's what happens in the code above: The call to doubleAddAssoc
returns a proof that One + (Tiny + Tiny)
is equal to
(One + Tiny) + Tiny
. But One + (Tiny + Tiny)
equals
1.0000000000000002
, while (One + Tiny) + Tiny
equals 1.0
.
We can therefore pass our (wrong) proof to wrong
, because it
is of the correct type, and from this follows a proof of Void
.
Module Data.String
in base offers a rich set of functions
for working with strings. All these are based on the following
primitive operations built into the compiler:
prim__strLength
: Returns the length of a string.prim__strHead
: Extracts the first character from a string.prim__strTail
: Removes the first character from a string.prim__strCons
: Prepends a character to a string.prim__strAppend
: Appends two strings.prim__strIndex
: Extracts a character at the given position from a string.prim__strSubstr
: Extracts the substring between the given positions.
Needless to say, not all of these functions are total. Therefore, Idris must make sure that invalid calls do not reduce during compile time, as otherwise the compiler would crash. If, however we force the evaluation of a partial primitive function by compiling and running the corresponding program, this program will crash with an error:
Tutorial.Prim> prim__strTail ""
prim__strTail ""
Tutorial.Prim> :exec putStrLn (prim__strTail "")
Exception in substring: 1 and 0 are not valid start/end indices for ""
Note, how prim__strTail ""
is not reduced at the REPL and how the
same expression leads to a runtime exception if we compile and
execute the program. Valid calls to prim__strTail
are reduced
just fine, however:
tailExample : prim__strTail "foo" = "oo"
tailExample = Refl
Two of the most important functions for working with strings
are unpack
and pack
, which convert a string to a list
of characters and vice versa. This allows us to conveniently
implement many string operations by iterating or folding
over the list of characters instead. This might not always
be the most efficient thing to do, but unless you plan to
handle very large amounts of text, they work and perform
reasonably well.
Idris allows us to include arbitrary string expressions in a string literal by wrapping them in curly braces, the first of which has to be escaped with a backslash. For instance:
interpEx1 : Bits64 -> Bits64 -> String
interpEx1 x y = "\{show x} + \{show y} = \{show $ x + y}"
This is a very convenient way to assemble complex strings
from values of different types.
In addition, there is interface Interpolation
, which
allows us to use values in interpolated strings without
having to convert them to strings first:
data Element = H | He | C | N | O | F | Ne
Formula : Type
Formula = List (Element,Nat)
Interpolation Element where
interpolate H = "H"
interpolate He = "He"
interpolate C = "C"
interpolate N = "N"
interpolate O = "O"
interpolate F = "F"
interpolate Ne = "Ne"
Interpolation (Element,Nat) where
interpolate (_, 0) = ""
interpolate (x, 1) = "\{x}"
interpolate (x, k) = "\{x}\{show k}"
Interpolation Formula where
interpolate = foldMap interpolate
ethanol : String
ethanol = "The formulat of ethanol is: \{[(C,2),(H,6),(O, the Nat 1)]}"
In string literals, we have to escape certain characters like quotes, backslashes or new line characters. For instance:
escapeExample : String
escapeExample = "A quote: \". \nThis is on a new line.\nA backslash: \\"
Idris allows us to enter raw string literals, where there is no need to escape quotes and backslashes, by pre- and postfixing the wrapping quote characters with the same number of hash characters. For instance:
rawExample : String
rawExample = #"A quote: ". A blackslash: \"#
rawExample2 : String
rawExample2 = ##"A quote: ". A blackslash: \"##
With raw string literals, it is still possible to use string interpolation, but the opening curly brace has to be prefixed with a backslash and the same number of hashes as are being used for opening and closing the string literal:
rawInterpolExample : String
rawInterpolExample = ##"An interpolated "string": \##{rawExample}"##
Finally, Idris also allows us to conveniently write multiline strings. These can be pre- and postfixed with hashes if we want raw multiline string literals, and they also can be combined with string interpolation. Multiline literals are opened and closed with triple quote characters. Indenting the closing triple quotes allows us to indent the whole multiline literal. Whitespace used for indentation will not appear in the resulting string. For instance:
multiline1 : String
multiline1 = """
And I raise my head and stare
Into the eyes of a stranger
I've always known that the mirror never lies
People always turn away
From the eyes of a stranger
Afraid to see what hides behind the stare
"""
multiline2 : String
multiline2 = #"""
An example for a simple expression:
"foo" ++ "bar".
This is reduced to "\#{"foo" ++ "bar"}".
"""#
Make sure to look at the example strings at the REPL to see the effect of interpolation and raw string literals and compare it with the syntax we used.
In these exercises, you are supposed to implement a bunch of utility functions for consuming and converting strings. I don't give the expected types here, because you are supposed to come up with those yourself.
-
Implement functions similar to
map
,filter
, andmapMaybe
for strings. The output type of these should always be a string. -
Implement functions similar to
foldl
andfoldMap
for strings. -
Implement a function similar to
traverse
for strings. The output type should be a wrapped string. -
Implement the bind operator for strings. The output type should again be a string.
As listed at the beginning of this chapter, Idris provides different
fixed-precision signed and unsigned integer types as well as Integer
,
an arbitrary precision signed integer type.
All of them come with the following primitive functions (given
here for Bits8
as an example):
prim__add_Bits8
: Integer addition.prim__sub_Bits8
: Integer subtraction.prim__mul_Bits8
: Integer multiplication.prim__div_Bits8
: Integer division.prim__mod_Bits8
: Modulo function.prim__shl_Bits8
: Bitwise left shift.prim__shr_Bits8
: Bitwise right shift.prim__and_Bits8
: Bitwise and.prim__or_Bits8
: Bitwise or.prim__xor_Bits8
: Bitwise xor.
Typically, you use the functions for addition and multiplication
through the operators from interface Num
, the function
for subtraction through interface Neg
, and the functions
for division (div
and mod
) through interface Integral
.
The bitwise operations are available through interfaces
Data.Bits.Bits
and Data.Bits.FiniteBits
.
For all integral types, the following laws are assumed to
hold for numeric operations (x
, y
, and z
are
arbitrary value of the same primitive integral type):
x + y = y + x
: Addition is commutative.x + (y + z) = (x + y) + z
: Addition is associative.x + 0 = x
: Zero is the neutral element of addition.x - x = x + (-x) = 0
:-x
is the additive inverse ofx
.x * y = y * x
: Multiplication is commutative.x * (y * z) = (x * y) * z
: Multiplication is associative.x * 1 = x
: One is the neutral element of multiplication.x * (y + z) = x * y + x * z
: The distributive law holds.y * (x `div` y) + (x `mod` y) = x
(fory /= 0
).
Please note, that the officially supported backends use
Euclidian modulus for calculating mod
:
For y /= 0
, x `mod` y
is always a non-negative value
strictly smaller than abs y
, so that the law given above
does hold. If x
or y
are negative numbers, this is different
to what many other languages do but for good reasons as explained
in the following article.
The unsigned fixed precision integer types (Bits8
, Bits16
,
Bits32
, and Bits64
) come with implementations of all
integral interfaces (Num
, Neg
, and Integral
) and
the two interfaces for bitwise operations (Bits
and FiniteBits
).
All functions with the exception of div
and mod
are
total. Overflows are handled by calculating the remainder
modulo 2^bitsize
. For instance, for Bits8
, all operations
calculate their results modulo 256:
Main> the Bits8 255 + 1
0
Main> the Bits8 255 + 255
254
Main> the Bits8 128 * 2 + 7
7
Main> the Bits8 12 - 13
255
Like the unsigned integer types, the signed fixed precision
integer types (Int8
, Int16
, Int32
, and Int64
) come with
implementations of all integral interfaces and
the two interfaces for bitwise operations (Bits
and FiniteBits
).
Overflows are handled by calculating the remainder
modulo 2^bitsize
and subtracting 2^bitsize
if the result is still out of
range. For instance, for Int8
, all operations calculate their results modulo
256, subtracting 256 if the result is still out of bounds:
Main> the Int8 2 * 127
-2
Main> the Int8 3 * 127
125
Module Data.Bits
exports interfaces for performing bitwise
operations on integral types. I'm going to show a couple of
examples on unsigned 8-bit numbers (Bits8
) to explain the concept
to readers new to bitwise arithmetics. Note, that this is much easier
to grasp for unsigned integer types than for the signed versions.
Those have to include information about the sign of numbers in their
bit pattern, and it is assumed that signed integers in Idris use
a two's complement representation,
about which I will not go into the details here.
An unsigned 8-bit binary number is represented internally as
a sequence of eight bits (with values 0 or 1), each of which
corresponds to a power of 2. For instance,
the number 23 (= 16 + 4 + 2 + 1) is represented as 0001 0111
:
23 in binary: 0 0 0 1 0 1 1 1
Bit number: 7 6 5 4 3 2 1 0
Decimal value: 128 64 32 16 8 4 2 1
We can use function testBit
to check if the bit at the given
position is set or not:
Tutorial.Prim> testBit (the Bits8 23) 0
True
Tutorial.Prim> testBit (the Bits8 23) 1
True
Tutorial.Prim> testBit (the Bits8 23) 3
False
Likewise, we can use functions setBit
and clearBit
to
set or unset a bit at a certain position:
Tutorial.Prim> setBit (the Bits8 23) 3
31
Tutorial.Prim> clearBit (the Bits8 23) 2
19
There are also operators (.&.)
(bitwise and) and (.|.)
(bitwise or) as well as function xor
(bitwise exclusive or)
for performing boolean operations on integral values.
For instance x .&. y
has exactly those bits set, which both x
and y
have set, while x .|. y
has all bits set that are either
set in x
or y
(or both), and x `xor` y
has those bits
set that are set in exactly one of the two values:
23 in binary: 0 0 0 1 0 1 1 1
11 in binary: 0 0 0 0 1 0 1 1
23 .&. 11 in binary: 0 0 0 0 0 0 1 1
23 .|. 11 in binary: 0 0 0 1 1 1 1 1
23 `xor` 11 in binary: 0 0 0 1 1 1 0 0
And here are the examples at the REPL:
Tutorial.Prim> the Bits8 23 .&. 11
3
Tutorial.Prim> the Bits8 23 .|. 11
31
Tutorial.Prim> the Bits8 23 `xor` 11
28
Finally, it is possible to shift all bits to the right or left
by a certain number of steps by using functions shiftR
and
shiftL
, respectively (overflowing bits will just be dropped).
A left shift can therefore be viewed as a multiplication by a
power of two, while a right shift can be seen as a division
by a power of two:
22 in binary: 0 0 0 1 0 1 1 0
22 `shiftL` 2 in binary: 0 1 0 1 1 0 0 0
22 `shiftR` 1 in binary: 0 0 0 0 1 0 1 1
And at the REPL:
Tutorial.Prim> the Bits8 22 `shiftL` 2
88
Tutorial.Prim> the Bits8 22 `shiftR` 1
11
Bitwise operations are often used in specialized code or certain high-performance applications. As programmers, we have to know they exist and how they work.
So far, we always required an implementation of Num
in order to
be able to use integer literals for a given type. However,
it is actually only necessary to implement a function fromInteger
converting an Integer
to the type in question. As we will
see in the last section, such a function can even restrict
the values allowed as valid literals.
For instance, assume we'd like to define a data type for representing the charge of a chemical molecule. Such a value can be positive or negative and (theoretically) of almost arbitrary magnitude:
record Charge where
constructor MkCharge
value : Integer
It makes sense to be able to sum up charges, but not to
multiply them. They should therefore have an implementation
of Monoid
but not of Num
. Still, we'd like to have
the convenience of integer literals when using constant
charges at compile time. Here's how to do this:
fromInteger : Integer -> Charge
fromInteger = MkCharge
Semigroup Charge where
x <+> y = MkCharge $ x.value + y.value
Monoid Charge where
neutral = 0
In addition to the well known decimal literals, it is also
possible to use integer literals in binary, octal, or
hexadecimal representation. These have to be prefixed
with a zero following by a b
, o
, or x
for
binary, octal, and hexadecimal, respectively:
Tutorial.Prim> 0b1101
13
Tutorial.Prim> 0o773
507
Tutorial.Prim> 0xffa2
65442
-
Define a wrapper record for integral values and implement
Monoid
so that(<+>)
corresponds to(.&.)
.Hint: Have a look at the functions available from interface
Bits
to find a value suitable as the neutral element. -
Define a wrapper record for integral values and implement
Monoid
so that(<+>)
corresponds to(.|.)
. -
Use bitwise operations to implement a function, which tests if a given value of type
Bits64
is even or not. -
Convert a value of type
Bits64
to a string in binary representation. -
Convert a value of type
Bits64
to a string in hexadecimal representation.Hint: Use
shiftR
and(.&. 15)
to access subsequent packages of four bits.
We often do not want to allow all values of a type in a certain
context. For instance, String
as an arbitrary sequence of
UTF-8 characters (several of which are not even printable), is
too general most of the time. Therefore, it is usually advisable
to rule out invalid values early on, by pairing a value with
an erased proof of validity.
We have learned how we can write elegant predicates, with
which we can proof our functions to be total, and from which we
can - in the ideal case - derive other, related predicates. However,
when we define predicates on primitives they are to a certain degree
doomed to live in isolation, unless we come up with a set of
primitive axioms (implemented most likely using believe_me
), with
which we can manipulate our predicates.
String encodings is a difficult topic, so in many low level routines it makes sense to rule out most characters from the beginning. Assume therefore, we'd like to make sure the strings we accept in our application only consist of ASCII characters:
isAsciiChar : Char -> Bool
isAsciiChar c = ord c <= 127
isAsciiString : String -> Bool
isAsciiString = all isAsciiChar . unpack
We can now refine a string value by pairing it with an erased proof of validity:
record Ascii where
constructor MkAscii
value : String
0 prf : isAsciiString value === True
It is now impossible to at runtime or compile time create
a value of type Ascii
without first validating the wrapped
string. With this, it is already pretty easy to safely wrap strings at
compile time in a value of type Ascii
:
hello : Ascii
hello = MkAscii "Hello World!" Refl
And yet, it would be much more convenient to still use string
literals for this, without having to sacrifice the comfort of
safety. To do so, we can't use interface FromString
, as its
function fromString
would force us to convert any string,
even an invalid one. However, we actually don't need an implementation of
FromString
to support string literals, just like we didn't
require an implementation of Num
to support integer literals.
What we really need is a function named fromString
. Now, when
string literals are desugared, they are converted to invocations
of fromString
with the given string value as its argument.
For instance, literal "Hello"
gets desugared to fromString "Hello"
.
This happens before type checking and filling in of (auto) implicit
values. It is therefore perfectly fine, to define a custom fromString
function with an erased auto implicit argument as a proof of
validity:
fromString : (s : String) -> {auto 0 prf : isAsciiString s === True} -> Ascii
fromString s = MkAscii s prf
With this, we can use (valid) string literals for coming up with
values of type Ascii
directly:
hello2 : Ascii
hello2 = "Hello World!"
In order to at runtime create values of type Ascii
from strings
of an unknown source, we can use a refinement function returning
some kind of failure type:
test : (b : Bool) -> Dec (b === True)
test True = Yes Refl
test False = No absurd
ascii : String -> Maybe Ascii
ascii x = case test (isAsciiString x) of
Yes prf => Just $ MkAscii x prf
No contra => Nothing
For many use cases, what we described above for ASCII strings can take us very far. However, one drawback of this approach is that we can't safely perform any computations with the proofs at hand.
For instance, we know it will be perfectly fine to concatenate
two ASCII strings, but in order to convince Idris of this, we
will have to use believe_me
, because we will not be able to
proof the following lemma otherwise:
0 allAppend : (f : Char -> Bool)
-> (s1,s2 : String)
-> (p1 : all f (unpack s1) === True)
-> (p2 : all f (unpack s2) === True)
-> all f (unpack (s1 ++ s2)) === True
allAppend f s1 s2 p1 p2 = believe_me $ Refl {x = True}
namespace Ascii
export
(++) : Ascii -> Ascii -> Ascii
MkAscii s1 p1 ++ MkAscii s2 p2 =
MkAscii (s1 ++ s2) (allAppend isAsciiChar s1 s2 p1 p2)
The same goes for all operations extracting a substring from
a given string: We will have to implement according rules using
believe_me
. Finding a reasonable set of axioms to conveniently
deal with refined primitives can therefore be challenging at times,
and whether such axioms are even required very much depends
on the use case at hand.
Assume you write a simple web application for scientific discourse between registered users. To keep things simple, we only consider unformatted text input here. Users can write arbitrary text in a text field and upon hitting Enter, the message is displayed to all other registered users.
Assume now a user decides to enter the following text:
<script>alert("Hello World!")</script>
Well, it could have been (much) worse. Still, unless we take measures to prevent this from happening, this might embed a JavaScript program in our web page we never intended to have there! What I described here, is a well known security vulnerability called cross-site scripting. It allows users of web pages to enter malicious JavaScript code in text fields, which will then be included in the page's HTML structure and executed when it is being displayed to other users.
We want to make sure, that this cannot happen on our own web page.
In order to protect us from this attack, we could for instance disallow
certain characters like '<'
or '>'
completely (although this might not
be enough!), but if our chat service is targeted at programmers,
this will be overly restrictive. An alternative
is to escape certain characters before rendering them on the page.
escape : String -> String
escape = concat . map esc . unpack
where esc : Char -> String
esc '<' = "<"
esc '>' = ">"
esc '"' = """
esc '&' = "&"
esc '\'' = "'"
esc c = singleton c
What we now want to do is to store a string together with
a proof that is was properly escaped. This is another form
of existential quantification: "Here is a string, and there
once existed another string, which we passed to escape
and arrived at the string we have now". Here's how to encode
this:
record Escaped where
constructor MkEscaped
value : String
0 origin : String
0 prf : escape origin === value
Whenever we now embed a string of unknown origin in our web page,
we can request a value of type Escaped
and have the very
strong guarantee that we are no longer vulnerable to cross-site
scripting attacks. Even better, it is also possible to safely
embed string literals known at compile time without the need
to escape them first:
namespace Escaped
export
fromString : (s : String) -> {auto 0 prf : escape s === s} -> Escaped
fromString s = MkEscaped s s prf
escaped : Escaped
escaped = "Hello World!"
In this massive set of exercises, you are going to build a small library for working with predicates on primitives. We want to keep the following goals in mind:
- We want to use the usual operations of propositional logic to combine predicates: Negation, conjuction (logical and), and disjunction (logical or).
- All predicates should be erased at runtime. If we proof something about a primitive number, we want to make sure not to carry around a huge proof of validity.
- Calculations on predicates should make no appearance
at runtime (with the exception of
decide
; see below). - Recursive calculations on predicates should be tail recursive if
they are used in implementations of
decide
. This might be tough to achieve. If you can't find a tail recursive solution for a given problem, use what feels most natural instead.
A note on efficiency: In order to be able to run
computations on our predicates, we try to convert primitive
values to algebraic data types as often and as soon as possible:
Unsigned integers will be converted to Nat
using cast
,
and strings will be converted to List Char
using unpack
.
This allows us to work with proofs on Nat
and List
most
of the time, and such proofs can be implemented without
resorting to believe_me
or other cheats. However, the one
advantage of primitive types over algebraic data types is
that they often perform much better. This is especially
critical when comparing integral types with Nat
: Operations
on natural numbers often run with O(n)
time complexity,
where n
is the size of one of the natural numbers involved,
while with Bits64
, for instance, many operations run in fast constant
time (O(1)
). Luckily, the Idris compiler optimizes many
functions on natural number to use the corresponding Integer
operations at runtime. This has the advantage that we can
still use proper induction to proof stuff about natural
numbers at compile time, while getting the benefit of fast
integer operations at runtime. However, operations on Nat
do
run with O(n)
time complexity and compile time. Proofs
working on large natural number will therefore drastically
slow down the compiler. A way out of this is discussed at
the end of this section of exercises.
Enough talk, let's begin! To start with, you are given the following utilities:
-- Like `Dec` but with erased proofs. Constructors `Yes0`
-- and `No0` will be converted to constants `0` and `1` by
-- the compiler!
data Dec0 : (prop : Type) -> Type where
Yes0 : (0 prf : prop) -> Dec0 prop
No0 : (0 contra : prop -> Void) -> Dec0 prop
-- For interfaces with more than one parameter (`a` and `p`
-- in this example) sometimes one parameter can be determined
-- by knowing the other. For instance, if we know what `p` is,
-- we will most certainly also know what `a` is. We therefore
-- specify that proof search on `Decidable` should only be
-- based on `p` by listing `p` after a vertical bar: `| p`.
-- This is like specifing the search parameter(s) of
-- a data type with `[search p]` as was shown in the chapter
-- about predicates.
-- Specifying a single search parameter as shown here can
-- drastically help with type inference.
interface Decidable (0 a : Type) (0 p : a -> Type) | p where
decide : (v : a) -> Dec0 (p v)
-- We often have to pass `p` explicitly in order to help Idris with
-- type inference. In such cases, it is more convenient to use
-- `decideOn pred` instead of `decide {p = pred}`.
decideOn : (0 p : a -> Type) -> Decidable a p => (v : a) -> Dec0 (p v)
decideOn _ = decide
-- Some primitive predicates can only be reasonably implemented
-- using boolean functions. This utility helps with decidability
-- on such proofs.
test0 : (b : Bool) -> Dec0 (b === True)
test0 True = Yes0 Refl
test0 False = No0 absurd
We also want to run decidable computations at compile time. This
is often much more efficient than running a direct proof search on
an inductive type. We therefore come up with a predicate witnessing
that a Dec0
value is actually a Yes0
together with two
utility functions:
data IsYes0 : (d : Dec0 prop) -> Type where
ItIsYes0 : {0 prf : _} -> IsYes0 (Yes0 prf)
0 fromYes0 : (d : Dec0 prop) -> (0 prf : IsYes0 d) => prop
fromYes0 (Yes0 x) = x
fromYes0 (No0 contra) impossible
0 safeDecideOn : (0 p : a -> Type)
-> Decidable a p
=> (v : a)
-> (0 prf : IsYes0 (decideOn p v))
=> p v
safeDecideOn p v = fromYes0 $ decideOn p v
Finally, as we are planning to refine mostly primitives, we will at times require some sledge hammer to convince Idris that we know what we are doing:
-- only use this if you are sure that `decideOn p v`
-- will return a `Yes0`!
0 unsafeDecideOn : (0 p : a -> Type) -> Decidable a p => (v : a) -> p v
unsafeDecideOn p v = case decideOn p v of
Yes0 prf => prf
No0 _ =>
assert_total $ idris_crash "Unexpected refinement failure in `unsafeRefineOn`"
-
We start with equality proofs. Implement
Decidable
forEqual v
.Hint: Use
DecEq
from moduleDecidable.Equality
as a constraint and make sure thatv
is available at runtime. -
We want to be able to negate a predicate:
data Neg : (p : a -> Type) -> a -> Type where IsNot : {0 p : a -> Type} -> (contra : p v -> Void) -> Neg p v
Implement
Decidable
forNeg p
using a suitable constraint. -
We want to describe the conjunction of two predicates:
data (&&) : (p,q : a -> Type) -> a -> Type where Both : {0 p,q : a -> Type} -> (prf1 : p v) -> (prf2 : q v) -> (&&) p q v
Implement
Decidable
for(p && q)
using suitable constraints. -
Come up with a data type called
(||)
for the disjunction (logical or) of two predicates and implementDecidable
using suitable constraints. -
Proof De Morgan's laws by implementing the following propositions:
negOr : Neg (p || q) v -> (Neg p && Neg q) v andNeg : (Neg p && Neg q) v -> Neg (p || q) v orNeg : (Neg p || Neg q) v -> Neg (p && q) v
The last of De Morgan's implications is harder to type and proof as we need a way to come up with values of type
p v
andq v
and show that not both can exist. Here is a way to encode this (annotated with quantity 0 as we will need to access an erased contraposition):0 negAnd : Decidable a p => Decidable a q => Neg (p && q) v -> (Neg p || Neg q) v
When you implement
negAnd
, remember that you can freely access erased (implicit) arguments, becausenegAnd
itself can only be used in an erased context.
So far, we implemented the tools to algebraically describe and combine several predicate. It is now time to come up with some examples. As a first use case, we will focus on limiting the valid range of natural numbers. For this, we use the following data type:
-- Proof that m <= n
data (<=) : (m,n : Nat) -> Type where
ZLTE : 0 <= n
SLTE : m <= n -> S m <= S n
This is similar to Data.Nat.LTE
but I find operator
notation often to be clearer.
We also can define and use the following aliases:
(>=) : (m,n : Nat) -> Type
m >= n = n <= m
(<) : (m,n : Nat) -> Type
m < n = S m <= n
(>) : (m,n : Nat) -> Type
m > n = n < m
LessThan : (m,n : Nat) -> Type
LessThan m = (< m)
To : (m,n : Nat) -> Type
To m = (<= m)
GreaterThan : (m,n : Nat) -> Type
GreaterThan m = (> m)
From : (m,n : Nat) -> Type
From m = (>= m)
FromTo : (lower,upper : Nat) -> Nat -> Type
FromTo l u = From l && To u
Between : (lower,upper : Nat) -> Nat -> Type
Between l u = GreaterThan l && LessThan u
-
Coming up with a value of type
m <= n
by pattern matching onm
andn
is highly inefficient for large values ofm
, as it will requirem
iterations to do so. However, while in an erased context, we don't need to hold a value of typem <= n
. We only need to show, that such a value follows from a more efficient computation. Such a computation iscompare
for natural numbers: Although this is implemented in the Prelude with a pattern match on its arguments, it is optimized by the compiler to a comparison of integers which runs in constant time even for very large numbers. SincePrelude.(<=)
for natural numbers is implemented in terms ofcompare
, it runs just as efficiently.We therefore need to proof the following two lemmas (make sure to not confuse
Prelude.(<=)
withPrim.(<=)
in these declarations):0 fromLTE : (n1,n2 : Nat) -> (n1 <= n2) === True -> n1 <= n2 0 toLTE : (n1,n2 : Nat) -> n1 <= n2 -> (n1 <= n2) === True
They come with a quantity of 0, because they are just as inefficient as the other computations we discussed above. We therefore want to make absolutely sure that they will never be used at runtime!
Now, implement
Decidable Nat (<= n)
, making use oftest0
,fromLTE
, andtoLTE
. Likewise, implementDecidable Nat (m <=)
, because we require both kinds of predicates.Note: You should by now figure out yourself that
n
must be available at runtime and how to make sure that this is the case. -
Proof that
(<=)
is reflexive and transitive by declaring and implementing corresponding propositions. As we might require the proof of transitivity to chain several values of type(<=)
, it makes sense to also define a short operator alias for this. -
Proof that from
n > 0
followsIsSucc n
and vise versa. -
Declare and implement safe division and modulo functions for
Bits64
, by requesting an erased proof that the denominator is strictly positive when cast to a natural number. In case of the modulo function, return a refined value carrying an erased proof that the result is strictly smaller than the modulus:safeMod : (x,y : Bits64) -> (0 prf : cast y > 0) => Subset Bits64 (\v => cast v < cast y)
-
We will use the predicates and utilities we defined so far to convert a value of type
Bits64
to a string of digits in baseb
with2 <= b && b <= 16
. To do so, implement the following skeleton definitions:-- this will require some help from `assert_total` -- and `idris_crash`. digit : (v : Bits64) -> (0 prf : cast v < 16) => Char record Base where constructor MkBase value : Bits64 0 prf : FromTo 2 16 (cast value) base : Bits64 -> Maybe Base namespace Base public export fromInteger : (v : Integer) -> {auto 0 _ : IsJust (base $ cast v)} -> Base
Finally, implement
digits
, usingsafeDiv
andsafeMod
in your implementation. This might be challenging, as you will have to manually transform some proofs to satisfy the type checker. You might also requireassert_smaller
in the recursive step.digits : Bits64 -> Base -> String
We will now turn our focus on strings. Two of the most obvious ways in which we can restrict the strings we accept are by limiting the set of characters and limiting their lengths. More advanced refinements might require strings to match a certain pattern or regular expression. In such cases, we might either go for a boolean check or use a custom data type representing the different parts of the pattern, but we will not cover these topics here.
-
Implement the following aliases for useful predicates on characters.
Hint: Use
cast
to convert characters to natural numbers, use(<=)
andInRange
to specify regions of characters, and use(||)
to combine regions of characters.-- Characters <= 127 IsAscii : Char -> Type -- Characters <= 255 IsLatin : Char -> Type -- Characters in the interval ['A','Z'] IsUpper : Char -> Type -- Characters in the interval ['a','z'] IsLower : Char -> Type -- Lower or upper case characters IsAlpha : Char -> Type -- Characters in the range ['0','9'] IsDigit : Char -> Type -- Digits or characters from the alphabet IsAlphaNum : Char -> Type -- Characters in the ranges [0,31] or [127,159] IsControl : Char -> Type -- An ASCII character that is not a control character IsPlainAscii : Char -> Type -- A latin character that is not a control character IsPlainLatin : Char -> Type
-
The advantage of this more modular approach to predicates on primitives is that we can safely run calculations on our predicates and get the strong guarantees from the existing proofs on inductive types like
Nat
andList
. Here are some examples of such calculations and conversions, all of which can be implemented without cheating:0 plainToAscii : IsPlainAscii c -> IsAscii c 0 digitToAlphaNum : IsDigit c -> IsAlphaNum c 0 alphaToAlphaNum : IsAlpha c -> IsAlphaNum c 0 lowerToAlpha : IsLower c -> IsAlpha c 0 upperToAlpha : IsUpper c -> IsAlpha c 0 lowerToAlphaNum : IsLower c -> IsAlphaNum c 0 upperToAlphaNum : IsUpper c -> IsAlphaNum c
The following (
asciiToLatin
) is trickier. Remember that(<=)
is transitive. However, in your invocation of the proof of transitivity, you will not be able to apply direct proof search using%search
because the search depth is too small. You could increase the search depth, but it is much more efficient to usesafeDecideOn
instead.0 asciiToLatin : IsAscii c -> IsLatin c 0 plainAsciiToPlainLatin : IsPlainAscii c -> IsPlainLatin c
Before we turn our full attention to predicates on strings, we have to cover lists first, because we will often treat strings as lists of characters.
-
Implement
Decidable
forHead
:data Head : (p : a -> Type) -> List a -> Type where AtHead : {0 p : a -> Type} -> (0 prf : p v) -> Head p (v :: vs)
-
Implement
Decidable
forLength
:data Length : (p : Nat -> Type) -> List a -> Type where HasLength : {0 p : Nat -> Type} -> (0 prf : p (List.length vs)) -> Length p vs
-
The following predicate is a proof that all values in a list of values fulfill the given predicate. We will use this to limit the valid set of characters in a string.
data All : (p : a -> Type) -> (as : List a) -> Type where Nil : All p [] (::) : {0 p : a -> Type} -> (0 h : p v) -> (0 t : All p vs) -> All p (v :: vs)
Implement
Decidable
forAll
.For a real challenge, try to make your implementation of
decide
tail recursive. This will be important for real world applications on the JavaScript backends, where we might want to refine strings of thousands of characters without overflowing the stack at runtime. In order to come up with a tail recursive implementation, you will need an additional data typeAllSnoc
witnessing that a predicate holds for all elements in aSnocList
. -
It's time to come to an end here. An identifier in Idris is a sequence of alphanumeric characters, possibly separated by underscore characters (
_
). In addition, all identifiers must start with a letter. Given this specification, implement predicateIdentChar
, from which we can define a new wrapper type for identifiers:0 IdentChars : List Char -> Type record Identifier where constructor MkIdentifier value : String 0 prf : IdentChars (unpack value)
Implement a factory method
identifier
for converting strings of unknown source at runtime:identifier : String -> Maybe Identifier
In addition, implement
fromString
forIdentifier
and verify, that the following is a valid identifier:testIdent : Identifier testIdent = "fooBar_123"
Final remarks: Proofing stuff about the primitives can be challenging, both when deciding on what axioms to use and when trying to make things perform well at runtime and compile time. I'm experimenting with a library, which deals with these issues. It is not yet finished, but you can have a look at it here.