Values, Equals, and Hashcodes

The issues of hash codes, equals methods, and what object identity really means is quite fundamental, and quite fraught. Certainly the Java solution of programmers defining equals and hash methods – according to some rather strict guidelines – has proven to have many problems in practice, and James has found it an annoying two lectures to teach for Java or C++.

William Cook (OOPSLA 2009) has argued that if objects are truly abstractions, equals and hash-code should be left up to each object to implement: there should be “no built-in notion of equality” because this would break the abstractions offered by objects. On the other hand, it’s not clear a language would be useful without some kind of equals for at least some kind of objects. Haskell offers another option: make an explicit interface (like the Eq type class) and only objects that provide equality should confirm to that interface. Hashcodes could be treated in the same way. While this would mean not every object had to participate in equality, we would still have to teach students about the necessarily difficult contracts of those methods. We could be more extreme: following Derek Rayside & Daniel Jackson and generate equality methods automatically based on explicit class invariants. But that seems a little far for a practical language today.

We propose that Grace will adopt a much more straightforward approach. Grace’s equals will be based on Henry Baker’s Egal predicate – structural equality of immutable state – thus making it similar to Clojure. Any mutable object (an object with at least one field) will only equal itself. Immutable objects (comprising only constants and methods, but no writeable fields) will be equal to other objects with the same structure (constants and methods) and the same values for all of those constants. Grace will provide just one equality predicate, written = (or !=) using this definition, and programmers will not be able to override it. Compatible hashcodes will also be provided.

Compared with Java or Smalltalk, this is hopefully much simpler, “does the right thing” for many common cases, and encourages immutable objects. The two main difficulties are first, that it cannot cope with objects whose abstract state may be equal but their “concrete” state is not. A linked list and an array list with identical contents will not be =, even if they are immutable. The second difficulty is that two mutable objects, two array lists, say, will never be = even if they have identical contents and are two instances of the same class. = in Grace is stable, if e1 = e2 is ever true, it will always be true. This cannot apply to mutable objects (unless e1 and e2 are actually the same object) . To compare or index using collections of objects, programmers will need use some kind of immutable objects: Grace will provide immutable tuple literals and methods to convert collections into tuples.

A key advantage of this definition of = is that immutable objects can act as pure values. Especially for (so-called) value objects — immutable objects containing only other immutable objects — Grace programs cannot distinguish between different instances with the same value, just as in most languages integers are compared by value, not by reference (you can’t tell the difference between the integer “1” even at different memory addresses). A Grace implementation can support value objects using whatever implementation is most efficient: either passing by reference always, by passing some types by value, or even by inlining fields into their containing objects, and doing a field-update if the containing object assigns in a new value.

In this way, Grace hopes to support a simple and stable definition of object equality, hashcodes, and value types – all without requiring Grace programmers to write any code.

Some references:

Comments

In most current programming languages, comments are treated as spaces; they separate lexemes, just as spaces do, but otherwise are ignored.
This is in spite of the fact that many people feel that comments are one of the most significant parts of a program. Correctly used, comments can make a program much more readable; badly used comments are an apology for bad code.

How did we get to a place where one of the most significant aspects of our languages are treated as—well—as insignificant?

In Fortran, a comments was any line with a “C” in column 6. This had the great advantage that the whole comment line could be thrown away without being processed, thus saving valuable core memory. Algol 60 followed a similar strategy, declaring that the sequence of symbols

; comment < any sequence not containing ; > ;

was equivalent to ; Both of these conventions have the problem that it is not clear if a comment refers to the statement proceeding it or the statement following it; this has to be inferred from the contents of the comment, which the computer is unlikely to understand. But they save memory and simplify the parser — obviously more important properties than clarity of intent!

So here’s an idea: suppose that we make the placement of comments syntactically meaningful? Might this help the next human reader to understand their intent?

If you think that I’m overestimating this difficulty, take a look at a paper by Michael Van De Vanter called “Preserving The Documentary Structure of Source Code” [1]. Van De Vanter concluded that in general, it’s just not possible to start with the “comments as spaces” convention, and do anything intelligent about inferring what the comment refers to.

With the advent of refactoring tools, which automatically re-arrange our code, it has become important that the computer does know which statement the comment refers to, for otherwise it can’t correctly position the comment in the refactored code. So, in Grace, comments are regarded as annotations that are attached to expressions and to statements. We note that Newspeak also attaches comments to known places in the program, so that they can be preserved during refactoring.

In an IDE, it’s a simple matter to attach a comment to a syntactic element: select that element, and use a context menu selection or a button to add a comment. We can imagine the comment appearing in a balloon, and popping up on mouse hover; we can also imagine some subtle shading or discrete tagging that will reveal the presence of latent comments. However, we also want Grace to have a representation in pure text, both as an interchange format and to permit editing in legacy text editors. So, we need some rules that will tell us exactly what syntactic element a comment is attached to when Grace code is turned into text. (I’m tempted to call this process ugly-printing, since it takes a Grace program that’s as pretty as an IDE can make it, and produces a plain-text version.)

After comments, layout is the other non-syntactically-meaningful element that is used in conventional programming languages. This implies that whatever comment convention we adopt, it should not require a particular layout. In other words, a “comment from here to the end of the line” convention won’t do, because it might require the addition of extra line breaks just to correctly terminate comments.

Thus, we are left with the need to invent a pair of comment delimiters, and a rule that tells the ugly-printer where to put the comments. The simplest rule that I can imagine is:

a comment is delimited by /* and */, and immediately follows the syntactic element to which it applies. When several nested syntactic elements end with the same character, the comment applies to the largest. Parentheses can be used to avoid this problem.

For example:

For this to work, it’s important that the heading of a method be a comment-able syntactic element For example,

Here a comment that describes the behavior of the method as a whole is attached to the method header. Note that requires clauses and types are other forms of annotation; these too must be unambiguously attached to a specific syntactic element if they are to reach their full potential.

/* and */ as comment delimiters have the advantage that they are really ugly. Whatever symbol is used to represent a comment in the textual representation of a program won’t be available as an operator symbol, so it makes sense to choose an ugly one.

For software engineering reasons, Grace will also support comments to end-of-line, probably by the // character familiar from C++ and Java.
Many (but not all) of the above examples also work with // comments:

Metadata

Postmodern programs need more than just human-readable comments – in fact, human-readable comments can be seen as as subset of wide universe of arbitrary metadata that can be attached to program elements. Experience with Java, C#, (and Scala) has shown the advantage of an extensible and flexible metadata facility in a programming language. Given that a key design goal of Grace is to support extensions via libraries, many library writers will need to interpret metadata attached to their client programs.

Grace will have a metadata facility based as a generalisation of its support for comments. We’re not yet sure how this will work.
Most likely it will follow C# in having multiple delimiters rather than Java’s @ symbol – say /: :/ and //: The problem is that might look too horrible for general use.

Perhaps permit more direct metadata in particular places in programs

  • wherever we’re writing a type
  • at the “top level??”

So e.g. rather than

or something, one could write

Other issues

For both comments and (especially) metadata, we need to consider the difference between:

  • annotating a method
  • annotating a return type of a method
  • annotating the receiver of the method.

The catch is that with our Pascal

syntax, how can be distinguish these cases?


References

[1] Michael L. Van de Vanter, “Preserving the Documentary Structure of Source Code in Language-Based Transformation Tools,” SCAM, pp.0133, First IEEE International Workshop on Source Code Analysis and Manipulation, 2001. http://research.sun.com/jackpot/COM.sun.mlvdv.doc.scam_nov01.paper_pdf.pdf

What’s in a Name: Variable and Constant bindings

As in most languages, Grace lets you bind names to values. In Grace’s case, the values are always objects. Some bindings are constant, that is, they don’t change until the names goes out of scope; others are variable, which means that the name can be re-bound to a new value. This re-binding is accomplished using assignment.

Constant Bindings

Constant bindings are used to bind method parameters to the arguments of the message that started the method execution. Grace supports constant bindings with the keyword const.

Const requires a name to define, and either or both an initialiser and a type. Const can be used to give a name to a literal, or to any other object.

If there is no initialiser, the identifier must be assigned (using =) before it is used.

The point of definitions like “const x : Rational” or const x — without an initialiser — is that in class definitions, they will be used to define per-instance constants that will be initialised when the object is constucted. For example, a point in Grace would declare two constants “x” and “y” — which would differ for each instance.

Uninitialised variables (of any type) are given a special “uninitialised” value; accessing this value is an error (caught either at run time or at compile time, depending on the cleverness of your implementor).

Variables

Naturally enough, Grace supports variable declarations via the “var” keyword.

Unlike constant name bindings, variables can be re-bound using assignment “:=”

Note that const uses = initialisation, while var uses := for both the initial and subsequent bindings.

As in Java, both const and var declarations may occur anywhere within a method or block: their scope is to the end of their defining block or method.

Fields

Within object or class constructors, fields are declared by the field keyword rather than var.

The idea here is that classes will contain methods defined by the method keyword, and fields defined by the field keyword. We are hoping that we can choose Grace’s keywords to guide the terminology used in discussing the language.

Const declarations an also appear in object and class constructors, to define per-instance constants.

Single Namespace

Grace has a single namespace for methods and variables. This means that methods should be able to override fields, and fields override methods.

A declaration of field or a constant named “x” creates a reader method called “x”. A declaration of a (writable) field creates a setter method called “set_x”. All code of the form “e1.x := e2” is executed by sending the “set_x” setter method to the object e1, passing e2 as an argument.

The upshot of this is, if a Grace program declares a pair of methods in an object:

clients (and subclasses) of that object are unable to distinguish this from a field declaration

This ensures that fields and constants are evaluated in Grace using the same message send evaluation rule as method sends.

Types

We haven’t mentioned types. That’s because we haven’t worked out the even the initial details – but that’s what we’ll look at soon.

Decisions

This note outlines our current design of Grace. We have a few questions or concerns about this design:

  1. Should Grace distinguish between var to declare temporary variables and field to declare instance fields – or if not, which one should we pick?
  2. Should Grace distinguish between := for mutable fields and variables, and = for constants?
  3. How should Grace name setter methods?  “set_x” or “s:=” or something else.

BNF


Operators are Messages Too

As well as named messages, Grace also allows operator symbols for binary messages. This allow us to use conventional symbols like + and -, and also allows the programmer to define her own operator symbols from single characters or multiple character sequences. Since the advent of Unicode, there are lots of characters to choose from.

Note that we are using the term binary message in the same sense as Smalltalk: a message with a receiver and one argument. The evaluation order is the obvious one: parenthesized sub-expressions first; then messages are sent left to right.

In the absence of parenthesis, it would be nice to leave the order of evaluation of a # b # d # e to the implementation: maybe the machine can do a # b and d # e in parallel? In that case, wouldn’t it be nice to allow it to do so? The problem is that this makes sense only for associative operators, and we have no way of knowing what # actually does. What about a – b – c? We can’t very well leave it up to the implementation to choose between (a – b) – c and a – (b – c)!

Grace’s message evaluation rules for operators also follow Smalltalk: apart from the special syntax, they follow exactly the same evaluation rules as any other message. We hope that this single evaluation rule, straightfoward syntax, and lack of implicit calls or coercions, will ensure that Grace programs that use operators will remain comprehensible.

1 + 2
1 + (2 * 3)
0 – 1
“Hello” ++ ” ” ++ “World”
bezerk !@#$%^&* istan

But this is where things get more tricky. As mentioned above, Grace operators are evaluated left-to-right. In our current design, unlike Smalltalk, (but like Self) it is a syntax error for two different operator symbols to appear in an expression without parenthesis to indicate order or evaluation — all Grace operators have the same precedence. The same operator symbol can be sent more than once without parenthesis.

1 + 2 + 3 // evaluates to 6
1 + (2 * 3) // evaluates to 7
(1 + 2) * 3 // evaluates to 9
1 + 2 * 3 // syntax error in Grace; would evaluate to 9 in Smalltalk

Named message sends bind more tightly than operator message sends: The following examples show first the Grace expressions as they would be written, followed by the parse

1 + 2.i                                     1 + (2.i)
(a * a) + (b * b).sqrt                 (a * a) + ((b *b).sqrt)
((a * a) + (b * b)).sqrt               ((a * a) + (b *b)).sqrt
a * a + b * b                           // syntax error
a + b + c                                 (a + b) + c
a – b – c                                   (a – b) – c

The One True Message Send has implications that carry over to operator messages also: an object can have only one method to respond to any given message, whether that message’s name is an operator symbol or a textual name. For example, Number can have only one definition for the “+” operator.

A second detail of this current design is that Grace has no unary operators. There can be negative numeric literals, like -2 or -345.34, but to negate an expression, programmers have to write either “e.negative” (using a named message) or “0 – e” using the binary “-” message.


We have more questions about the design for operators than some other parts of Grace’s design. The trade-offs seem more acute here than elsewhere for some reasons. This design has the key advantage that arbitrary operators can be defined, and the disadvantage that we do not support operator precedence.

Eric asks: whether it wouldn’t make more sense to require all operator sends to be parenthesized, rather than permitting multiple left associative calls. James thinks this worked OK in Self, and e.g. “a + b + c” isn’t generally a problem.

Kim says: the no-precedence design would be very confusing – we have to let people write normal arithmetic statements with the intuitive meanings.

James says: that there are advantages to requiring parentheses for novice programmers: they can be easily inserted by an IDE, many style guides recommend parens for complex expressions, and it’s worth it to permit arbitrary operators.

James thinks: if we do – after all – need operator precedence, the obvious design to fall back to is like Blue or C# – there are a fixed set of operators, with fixed precedence. The One True Message Send Rule will still apply; so operator methods would still be defined just like named methods, and operator message sends would behave just like named message sends.


We even have some draft BNF for expressions, ripped shamlessly from the Self manual in the first instance:

{ … } means zero or more occurrences of the symbols inside the braces.
[ … ] means zero or one occurrences of the symbols inside the braces.

expression ::= constant | named-send | operator-send | ‘(‘ expression ‘)’
constant ::= “self” | numberLiteral | stringLiteral | blockLiteral | objectLiteral
named-send ::= named-message | receiver “.” named-message | “super” “.” named-message
named-message ::= identifier arglist {identifier arglist}
arglist ::= “(” [{expression”,” } expression] “)”] | blockLiteral
operator-send ::= receiver operator-message | “super” operator-message
operator-message ::= operator expression
operator ::= op-char {op-char}
receiver ::= expression
blockLiteral ::= “{” [ param-decl-list => ] code “}”
code ::= { statement “;” } [“return”] expression

Built-in Objects: Booleans, Strings, and Numbers

Grace has three types of built-in objects: Booleans, Strings, and Rationals, the latter being exact. Grace will also support approximate IEEE 64-bit floating point numbers: particular implementations may support many other types of numbers. All numeric types will be sub-types of an abstract type Number.

The need for exact computation should be obvious to anyone who has tried to explain to a novice why 1/3 * 3 isn’t 1. Once ordinary arithmetic operations are exact, we don’t see a need for separate Natural or Integer types — Javascript and (early) Basic get by quite fine with only one numeric type – floating point numbers. And we don’t see how to eliminate inexact numbers, since operations like square root and sin are going to introduce them again.

Our concern is to minimize the basic types people need to learn: we shouldn’t have to teach the difference between integers & rationals & reals & inexact numbers on day 1. At some point it is important to teach these differences: in the design of Grace we want to support different pedagogical approaches to when this must be taught.

These names illustrate another (minor) design decision: Grace type names are spelled-out in full, rather than being abbreviated. So we have class Number rather than Num, and Boolean rather than Bool. So, what should we call the floating point type? FloatingPoint is ambiguous, because over time the library is likely to contain many floating point formats. DoublePrecision is rather wordy as well as being archaic; Binary64 is the official IEEE name, is shorter and more descriptive – and horrible for novices!

What does it mean for these objects to be “built in”? It means that there are denotations for them in the language syntax. (This is not true for Binary64s; there are, however, methods to generate Binary64s from other Numbers.) So Booleans in Grace are represented by the global names “true” and “false”, and programmers won’t be able to re-define what those names mean.

Grace will provide the usual operations on Booleans — which will be represented as messages. Binary operations will generally use single symbols, such as “&” for and, and “|” for or. We don’t think that the syntax will allow single-symbol unary operators, so Boolean negation will be the named message “not”.

Unlike C, Python, or Groovy, Grace supports no implicit conversions, so Numbers, Strings, Objects, or any other type cannot be used in contexts where Booleans are expected. Rather, Grace programs must explicitly test for empty Strings, Lists, and so on.

String literals in Grace are written between double quotes, as in C, Java, or Python. Strings literals support a range of escape characters such as “\t\b\n\f\r\v\\\””, and also escapes for Unicode.

Like Python, there is no Character type in Grace — rather, characters can be represented by Strings of length 1 where necessary. Like Java, Strings are immutable, and literals may be interned. Grace’s standard library will include mechanisms to support efficient incremental string construction. Strings will also conform to the protocol of an immutable IndexableCollection.

“Hello World!”
“\t”
“The End of the Line\n”
“A”

The Grace type Number is the supertype of all numeric types: we encourage programmers to use this type when writing most programs. Although Grace has no implicit conversions, subtyping means that any numeric type can be stored in a variable or passed as a parameter or return value of type Number.

Grace has three forms of Number literal: ordinary strings of digits, assumed to represent decimal (radix-10) numbers; literals with an explicit radix, indicated by a (decimal) number between 2 and 35 and a leading x; and base-exponent numerals using e as the exponent indicator, also always in decimal. All numeric literals evaluate to exact Rational numbers, so 0.2 will be exactly 1/5. So, there are no literals for inexact numbers.

1
-1
42
3.14159265
13.343e-12
-414.45e3
16xF00F00
2×10110100
0xDEADBEEF // Radix zero treated as 16

Grace will support all the usual binary arithmetic operators. However, as mentioned above, we are currently planning on not allowing unary operator symbols, so negation will be the named method “negative”. (Similarly, the Boolean “not” operator is another message send).

Messages sent to numbers support explicit conversions between number types; for example, sending the message “b64” to any Number will convert it to the 64 bit binary floating point.

Grace’s libraries may support a range of additional numeric types, such as machine integers, bytes, longer and shorter floating point numbers, and complex numbers. These types don’t need to be built-in: one of our design goals is to make library classes as convenient to use as built-in classes, so not being built-in does not mean that they are “second-class” in any way. These additional Number classes are likely to be required in some programs for efficiency, for calling external libraries, or for accessing external data formats. The aim of this design is to present beginners with well-behaved Numbers (and Booleans and Strings), while allowing their instructors to introduce more specialised subclasses as they are needed.

The plan is that most classes in the libraries will accept Numbers as arguments; for example, Indexed Collections, such as Arrays and Strings, will accept Number arguments describing string positions, but will raise an error if the index is not an integer, just as it will if the index is out of bounds.

Grace will have mechanisms (like Java’s final, Scala’s case classes or Lime’s extendedby) that will indicate that classes or interfaces may not be further extended – this should support special case code generation for efficiency.

For the hardcore, here is the BNF for numberLiterals:

numberLiteral ::= [-][radix x][digits][.digits] | [digits][.digits][e][-][digits]
radix ::= 0 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 23 | 33 | 34 | 35
digits :: = digit | digit digits
digit :: = 0 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z