General
Some tschwinge comments regarding your proposal. Which is very good, if I may say so again!
Of course, everyone is invited to contribute here!
I want to give the following methodology a try, instead of only having email/IRC discussions -- for the latter are again and again showing a tendency to be dumped and deposited into their respective archives, and be forgotten there. Of course, email/IRC discussions have their usefulness too, so we're not going to replace them totally. For example, for conducting discussions with a bunch of people (who may not even be following these pages here), email (or, as applicable, the even more interactive IRC) will still be the medium of choice. (And then, the executive summary should be posted here, or incorporated into your proposal.)
Also, if you disagree with this suggested procedure right away, or at some later point begin to feel that this thing doesn't work out, or simply takes too much time (I don't think so: writing emails takes time, too), just say so, and we can reconsider.
Of course, as this wiki is a passive medium rather than an active one as IRC and email are, it is fine to send notices like: I have updated the wiki page, please have a look.
One idea is that your proposal evolves alongside with the ongoing work, and represents (in more or less detail) what has been done and what will be done. Also, we can hopefully use parts of it for documentation purposes, or as recipes for similar work (enabling other programming languages on the Hurd, for example).
For this, I suggest the following procedure: as applicable, you can either address any comments in here (for example, if they're wrong :-), or if they require further discussion; think: email discussion), or you can address them directly in your propoal and remove the comments from here at the same time (think: bug fix).
Generally, you can assume that for things I didn't comment on (within some reasonable timeframe/upon asking me again) that I'm fine with them. Otherwise, I might say: I don't like this as is, but I'll need more time to think about it.
There is also a possibility that parts of your proposal will be split off; in cases where we think they're valuable to follow, but not at this time. (As you know, your proposal is not really a trivial one, so it may just be too much for one person's summer.) Such bits could be moved to open issues pages, either new ones or existing ones, as applicable.
GSoC Site Discussion
Discussion items from http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/jkoenig/1 should be copied here:
technical bits (obviously);
also the why do we want Java bindings reasoning;
CLISP findings should also be documented somewhere permanently.
- We should probaby open up a languages for Hurd section on the web pages (open issue documentation).
Java Native Interface (JNI)
- http://en.wikipedia.org/wiki/Java_Native_Interface
- http://download.oracle.com/javase/7/docs/technotes/guides/jni/
- http://java.sun.com/products/jdk/faq/jnifaq.html
- http://java.sun.com/docs/books/jni/
Java Native Access (JNA)
This is a different approach, and while some attention is paid to performance, correctness and ease of use take priority.
As we plan on only having a few native methods (for invoking mach_msg
,
essentially), JNA is probably the wrong approach: portability and ease of use
is not important, but performance is.
Compiled Native Interface (CNI)
- http://gcc.gnu.org/onlinedocs/gcj/About-CNI.html
- http://per.bothner.com/papers/UsenixJVM01/CNI01.pdf
Probably faster than JNI, but only usable with GCJ.
Given that we have very few JNI calls, it might be interesting to take a "dual" approach if CNI actually improves performance when compiling to native code. --jkoenig 2011-07-20
IRC, freenode, #hurd, 2011-07-13
<jkoenig> Yes, I guess so. Maybe start investigating mig because it may
have repercussions on what the best approach would be for some aspects of
the Mach bindings.
<tschwinge> I still think that making MIG emit Java code is not too
difficult, once you have the required Java infrastructure (like what
you're writing at the moment).
<tschwinge> On the other hand, if there's another approach that you'd like
to use, I'm not trying to force using MIG.
<braunr> i still have a problem understanding your approach
<braunr> at which level are your bindings located ?
<jkoenig> I expect mig it will be the easiest route, but of course possibly
it won't.
<tschwinge> jkoenig: Yeah, be give some high-level to low-level overview?
<jkoenig> ok, so
<jkoenig> at the very core, low-level, we have a very thin amount of JNI
code to access (proper) system calls.
<jkoenig> by "proper" I mean things like mach_task_self, mach_msg and
mach_reply_port, which are actually system calls rather than RPCs to the
kernel.
<braunr> right
<jkoenig> at this level, we manipulate port names as integers, and the
message buffers for mach_msg are raw ByteBuffers (from the java.nio
package)
<jkoenig> actually, so-called /direct/ ByteBuffers, which are backed by
memory allocated outside of the Java heap, rather than as a byte[] array
<jkoenig> we can retreive the pointer from the JNI code and use the buffer
directly.
<jkoenig> (so, good for performance and it's also portable.)
<braunr> ok
<braunr> i'm more interested in the higher level bindings :)
<jkoenig> ok so, higher up.
<jkoenig> design goal from my proposal: "the memory safety of Java should
be maintained and extended to Mach primitives such as port names and
out-of-line memory regions"
<jkoenig> so integer port names are not "safe" in the sense that they can
be forged and misused in all kinds of way
<jkoenig> which is why I have a layer of Java code whose job is to wrap
this kind of low-level Mach stuff into safe abstractions
<jkoenig> and ideally the user should only use these safe abstractions.
<tschwinge> (Not to restrict the programmer, but to help him write correct
code.)
<jkoenig> right.
<braunr> so you can't use mach RPCs directly
<jkoenig> tschwinge, also to actually restrict them, in a Joe-E /
object-capability context, but that's not the primary concern right now
;-)
<braunr> or you force your wrappers to have these abstractions as input
<jkoenig> braunr, well, actually at this level you still have Mach RPC
<jkoenig> but for instance, port names are encapsulated into "MachPort"
objects which ensure they are handled correcly
<tschwinge> As I understand it, you use these abstractions to prepare a
usual mach_msg message, and then invoke mach_msg.
<braunr> ok
<jkoenig> and message buffers are wrapped into "MachMsg" objects which both
help you write the messages into the ByteBuffer and prevent you from
doing funky stuff
<jkoenig> and ensure the ports which you send/receive/pseudo-receive after
an error/... are deallocated as required, etc.
<braunr> what's the interface to use IPC ?
<tschwinge> Is MIG doing that, too, I think? (And antrik once found some
error there, which is still to be reviewed...)
<jkoenig> braunr, so basically as a user you would be free to use either
one of these layers, or to use MIG-generated classes which would
construct and exchange messages for you using the second (safe) layer.
<braunr> ok, let's just finish with the low level layer before going
further please
<jkoenig> tschwinge, MIG does some type checking on the received message
and saves you the trouble of constructing/parsing them yourself, but I'm
not sure about how mach_msg errors are handled
<braunr> what are the main methods of MachMsg for example ?
<jkoenig> braunr, you may want to have a look at
http://jk.fr.eu.org/hurd-java/doc/html/classorg_1_1gnu_1_1mach_1_1MachMsg.html
<braunr> right, sorry
<braunr> grabbed the code at work and forgot here
<jkoenig> and also
https://github.com/jeremie-koenig/hurd-java/blob/master/HelloMach.java
which uses it
<jkoenig> but roughly, you'd use setRemotePort, setLocalPort, setId to
write your message's header
<jkoenig> then use one of the putFoo() methods to add data items to the
message
<braunr> ok, the mapping with the low level C interface is very clear
<braunr> that's good for me
<jkoenig> the putFoo() methods would write the appropriate type
descriptors, then the actual data.
<braunr> we can go on with the MiG part if you want :)
<jkoenig> right,
<jkoenig> so here you may want to look at the UML class diagram from
http://www.bddebian.com/~hurd-web/user/jkoenig/java/proposal/
<jkoenig> so in the C case, mig generates 3 files
<jkoenig> a header file which has the prototypes of the mig-generated
stubs,
<jkoenig> a *User.c which has their actual implementation
<jkoenig> and a *Server.c which handles demultiplexing the incoming
messages and helps with implementing servers.
<jkoenig> so we would do something along these lines, more or less:
<jkoenig> mig would generate the code for a Java interface in lieu of the
*.h file.
<jkoenig> a generated FooUser class would implement this interface by doing
RPC
<jkoenig> (so basically you would pass a MachPort object to the
constructor, and then you could use the resulting object to do RPC with
whatever is on the other end)
<jkoenig> and the generated FooServer class would do the opposite,
<braunr> ok
<braunr> issues with threads ?
<jkoenig> you would pass an object implementing the Foo interface to the
constructor,
<braunr> i'm guessing the demux part may have to create threads, right ?
<jkoenig> and the resulting object would handle messages by using the
object you passed.
<jkoenig> braunr, right, so that would be more a libports kind of code,
<braunr> the libports-like library, i see
<jkoenig> to which you could pass Server objects (for instance the
FooServer above), and it would handle incoming messages.
<braunr> how is message content mapped to a java interface ?
<jkoenig> this would be determined from the .defs files and MIG would
generate the appropriate code, hopefully.
<braunr> so the demux part would handle rpc integer identifiers ?
<jkoenig> right.
<braunr> but hm
<jkoenig> also mapping .defs files to Java interfaces might prove to be
tricky. data types conversion and all
<antrik> tschwinge: my mamory is rather hazy. IIRC the issue was that the
MIG-generated stubs deallocate out-of-line port arrays after the
implementation returns, before returning to the dispatcher
<braunr> i'll just overlook this specific implementation detail
<jkoenig> but we could use some annotation-based system if we need to
provide more information to generate the java code.
<antrik> but the Hurd (or rather glibc) RPC handling also automatically
deallocates everything if an error occurs
<antrik> so I changed the MIG code to deallocate only when no error occurs
<braunr> jkoenig: ok, we'll talk about that when there is more progress and
you have a better view of the problem
<antrik> at that time I was pretty sure that this is a correctly working
solution, but it always seemed questionable conceptually... however, I
wasn't able to come up with a better one, and nobody else commented on it
<braunr> antrik: shouldn't the hurd be changed not to deallocate something
it didn't allocate in the first place ?
<antrik> braunr: no, the server has to deallocate stuff before returning to
the client. the request message is destroyed before returning the reply.
<tschwinge> jkoenig, braunr: That's what I had in mind where MIG might be a
bit awkward. Then we can indeed either add annotations to the .defs
files, or reproduce them in some other format. That's some work, but
it's mostly a one-time work.
<tschwinge> After all, the RPC interface is a binary one, and there may be
more than one API for creating these messages, etc.
<antrik> jkoenig: actually, at least in the Hurd, server-side and
client-side headers are separate -- so MIG actually creates four files
<jkoenig> tschwinge, wrt to annotations I was more thinking about Java
ones, such as: @MIGDefsFile("mach/task.defs") @MIGCType("task_t") public
interface Task { }
<jkoenig> antrik, oh, ok, it makes sense.
<braunr> jkoenig: anything else ?
<jkoenig> braunr, nothing that I can think of
<braunr> ok
<antrik> tschwinge: I think it would be a *very* bad idea to introduce
redundancy regarding RPC definitions
<braunr> thanks for the tour :)
<antrik> (the _request.defs/_reply.defs mess is bad enough...)
<jkoenig> did I speak about the "Unsafe" pseudo-exception? that's
interesting :-)
<tschwinge> jkoenig: Also, virtual memory abstractions?
<braunr> jkoenig: you didn't
<tschwinge> antrik: Well, then we could create some other super-format.
But that's just a detail IMO.
<jkoenig> ok, so wrt virtual memory, a page we received can be wrapped with
some JNI help into a (direct) ByteBuffer object.
<jkoenig> deallocating sent pages will be tricky, though.
<tschwinge> antrik: To put it this way: for me the .defs files are just one
way of expressing the RPC interfaces' contracts. (At the same time, they
happen to be the actual reference for these, too. But the specification
itself could just as well be a textual one.)
<jkoenig> on approach I've been thinking about would be to "wrap" the
ByteBuffer object into an object which has the sole reference to it, so
that when it's deallocated the reference can be replaced with "null", and
further attempts to access the buffer would throw exceptions.
<braunr> sounds reasonable
<jkoenig> but that's still in flux in my head, we may end up needing our
own implementation of ByteBuffer-like objects.
<tschwinge> The problem being that there is no mechanism to ``revoke'' an
object once a reference to it has been shared.
<jkoenig> right.
<tschwinge> A wrapper is one possibility indeed.
<antrik> tschwinge: they are called interface *definitions* for a reason
:-)
<tschwinge> This is a very similar problem as with capabilities when there
is no revoke operation for these, too.
<tschwinge> antrik: Yes, because they define MIG's input. :-P
<tschwinge> Isn't that what is called a membrane in the capability world?
<antrik> I do not say that we have to consider the format of the .defs to
be set in stone; but I do insist on using a canonical machine-parsable
source for all language bindings
<tschwinge> attenuation
<jkoenig> tschwinge, you mean the revokable proxy contruct ? (It's the same
principle indeed)
<tschwinge> A common design pattern in object-capability systems: given
one reference of an object, create another reference for a proxy object
with certain security restrictions, such as only permitting read-only
access or allowing revocation. The proxy object performs security checks
on messages that it receives and passes on any that are allowed. Deep
attenuation refers to the case where the same attenuation is applied
transitively to any
<tschwinge> objects obtained via the original attenuated object,
typically by use of a "membrane".
<tschwinge> http://en.wikipedia.org/wiki/Object-capability_model
<tschwinge> Yes.
<tschwinge> Good. I understood something. ;-)
<tschwinge> antrik: OKAY! :-P
<tschwinge> jkoenig: And hopefully the JVM will optimize away all the
additional indirection... :-D
<tschwinge> jkoenig: Is there anything more to say about the VM layer?
<jkoenig> tschwinge, "hopefully", yes :-)
<tschwinge> Like, the data that I'm sharing -- is it untyped, isn't it?
<jkoenig> tschwinge, you mean that within the received/sent pages ?
<tschwinge> Yes.
<tschwinge> But that'S how it is, indeed.
<jkoenig> well actually the type descriptor should indicate what they
contain.
<tschwinge> I cannot trust anything I receive from externally.
<jkoenig> it's most often used for MACH_MSG_TYPE_CHAR items I guess, and it
will be type checked when retreive
<tschwinge> Yeah, and that then just *is* arbitrary data, like a block read
from a disk file.
<jkoenig> you would have something like: ByteBuffer
MachMsg.getBuffer(MachMsg.Type expected), and MachMsg would check the
type descriptor against that which you specified
<tschwinge> Or a packet transmitted over the network.
<tschwinge> OK, yes.
<antrik> jkoenig: in theory ints should be used quite often too. the whole
purpose of the type descriptors is to allow byte order swapping when
messages are passed between hosts with different architecture...
<jkoenig> tschwinge, right, except for out-of-line port arrays, which need
to be handled differently obviously.
<antrik> (which is totally irrelevat for our purposes -- especially since
the actual network IPC code doesn't exist anymore ;-) )
<jkoenig> antrik, oh, interesting
<tschwinge> Yes, that was one original idea.
<jkoenig> actually my litmus test for what the bindings should be, is you
should be able to implement such a proxy in Java :-)
<tschwinge> antrik: And hey, you now have processors that can switch
between different modes during runtime... :-)
<jkoenig> (although arguably that's a little bit ambitious)
<braunr> tschwinge: there should be bits in page tables to indicate the
endianness to use on a page .. :)
<tschwinge> Hehe!
<tschwinge> jkoenig: Don't worry -- you're already known for ambitious
projects. One more can't hurt.
<jkoenig> Also, actually the word size is not something that I've been able
to abstract so far, so I'll be hardcoding little-endian 32 bits for now.
<braunr> why is that ?
<antrik> some of the Hurd RPC break the idea anyways BTW
<jkoenig> the org.vmmagic package (from Jikes RVM and JNode) could help
with that, but GCJ does not support it unfortunately (not sure about
OpenJDK)
<jkoenig> braunr, Java does not allow us to define new unboxed types
<braunr> jkoenig: does it have its own definition of the word size ?
<jkoenig> braunr, nope.
<jkoenig> (although, maybe, and also we could use JNI to query it)
<braunr> even if virtual, i'd expect a machine to have such a defnition
<jkoenig> braunr, maybe it has, but basically in Java nothing depends on
the word size
<jkoenig> 'int' is 32 bits, 'long' is 64 and that's it.
<braunr> oh right, i remember most types are fixed size, right ?
<jkoenig> right.
<braunr> if not all
<jkoenig> now Jikes RVM's "org.vmmagic" provides an interface to defined
new unboxed types which can depend on the actual word size, but Jikes RVM
is its own JVM so obviously they can use and provide whatever extensions
they need :-)
<jkoenig> (but maybe they've implemented them in OpenJDK for bootstrap
purposes, I'm not sure)
<tschwinge> I'm missing this detail: where does the word size come into
play here?
<jkoenig> anyway, I _could_ indiscriminately use 'long' for port names, and
sparkle the code with word size tests but that would be very clumsy
<braunr> jkoenig: port names are actually ints :/
<jkoenig> tschwinge, the actual format of the message header and type
descriptors, for instance.
<braunr> jkoenig: ok, got your point
<jkoenig> braunr, by 'long' I mean 64-bits integers (which they are on
64-bits machines I think?)
<braunr> :)
<braunr> jkoenig: port names are as large as the word size
<braunr> but in C at least, they're int, not long
<braunr> it doesn't change many things, but you get lots of warnings if you
try with a long :)
<tschwinge> What is the reason that port names are an
architecture-dependent word size's width, and not simply 32 bit?
<jkoenig> "4 billions of port names should be enough for everyone" :-)
<braunr> tschwinge: an optimization is to use them as pointers in the
kernel
<antrik> tschwinge: the machine's native word size is what it can process
most efficiently, and what should be used for most normal
operations... it makes sense to define stuff as int, except for network
communication
<tschwinge> jkoenig: Well, yeah, but if you want to communicate with a
peer, you have to agree on the maximum number anyway (not for port names,
though, which are local).
<braunr> antrik: int isn't the word size everywhere
<braunr> antrik: the most common type matching the word size is long, at
least on ILP32/LP64 data models
<antrik> braunr: that's just because some idiots assumed int would always
be 32 bits, and consequently when 64 architectures came up the compiler
guys chickened out ;-)
<braunr> without int, you wouldn't have a 32 bits type
<antrik> that's not true for all architectures and/or operating systems
though AFAIK
<braunr> or a 16 bits one
<braunr> antrik: windows guys got even more scared, so windows 64 is LLP64
<antrik> BTW, I haven't checked, but it's quite possible that 32 bit
numbers are actually preferable even on AMD64...
<tschwinge> jkoenig: So, back on track. :-)
<tschwinge> jkoenig: You didn't find anything yet in Mach's VM interfaces
as well a MemoryObject, etc., that can't be used/implemented in the Java
world?
<braunr> antrik: they consume less memory, but don't have much effect on
performance
<jkoenig> tschwinge, once we have the basic system calls and the
corresponding abstractions in place, I don't think anything else
fundamentally problematic could possibly show up
<antrik> braunr: if you really *need* a type of a certain bit size, you
should use stdint types. so not having a 16 or 32 bit type in the
short/int/long canon is *not* an excuse
<tschwinge> jkoenig: That speaks for the Mach designers!
<braunr> antrik: right
<jkoenig> tschwinge, on trick is that for instance, mach_task_self would
still be unsafe even if it returned a nicely wrapped Task object, because
you could still wreck your own address space and threads with it. So we
would need the "attenuation" pattern mentionned above to provide a safe
one.
<jkoenig> (which would disallow thinks such as the port/thread/vm calls)
<braunr> jkoenig: you mentioned the unsafe pseudo exception earlier
<jkoenig> braunr, right, so the issue is with distinguishing safe from
unsafe methods
<antrik> braunr: BTW, the Windows guys actually broke a lot of stuff by
fixing long at 32 bits -- this way long doesn't match size_t and pointer
types anymore, which was an assumption that was true for pretty much any
system so far...
<tschwinge> jkoenig: Yes. (And again hope for the JVM to optim...)
<braunr> antrik: that's right :)
<braunr> antrik: that's LLP64
<braunr> antrik: long long and pointers
<jkoenig> braunr, so basically the idea is that unsafe methods are declared
as "throws Unsafe"
<jkoenig> the effect is that if you use such a method you must either
"throw Unsafe" yourself,
<jkoenig> or if you're building a safe abstraction on top of Unsafe
methods, you'll "catch" the "exception" in question to tell the compiler
that it's okay.
<jkoenig> it's more or less inspired from the "semantic regimes" idea from
the org.vmmagic paper which is referenced in my original proposal,
<jkoenig> only implementing by hijacking the exception checking machinery,
which has a behaviour similar to what we want.
<braunr> ok
<braunr> but hmm this seems pretty normal, what's the tricky part ? :)
<tschwinge> braunr: The idea is that the programmer explicitly has to
acknowledge if he'S using an unsafe interface.
<braunr> tschwinge: sounds pretty normal too
<jkoenig> braunr, the trick is that you would not usually declare
exceptions which are never actually thrown (and actually since the
compiler does not know it's never thrown, I need to work around it in a
few places)
<braunr> oh, ok
<braunr> jkoenig: that's interesting indeed
<jkoenig> braunr, the org.vmmagic paper provides an example which uses some
annotations called @UncheckedMemoryAccess and @AssertSafe to the same
effect (which is kind of cleaner), but it would be a headache to
implement without help from the compiler I think (as far as I can tell
the annotation processor would have to inspect the bytecode)
<braunr> but hm
<braunr> what's the true problem about this ?
<jkoenig> (the paper advocates "high-level low-level programming" and is a
very interesting read I think,
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.151.5253&rep=rep1&type=pdf,
for what it's worth)
<braunr> what's wrong if you just declare your methods unsafe and don't
alter anything else ?
<tschwinge> Yes, I read it and it is interesting. Unfortunately, it seems
I forgot most of it again...
<jkoenig> braunr, declare? alter?
<jkoenig> you mean just tag them with an annotation?
<braunr> just stating a method "throws Unsafe"
<jkoenig> braunr, well some compiler will output a warning because they can
tell there's no way the method is going to throw such an exception.
<jkoenig> and then some other compiler will complain that my
@SuppressWarnings("unused") does not serve any purpose to them :-)
<jkoenig> also, when initializing final fields, I need to work around the
fact that the compiler thinks "Unsafe" might be thrown.
<jkoenig> see for instance MachPort.DEAD
<braunr> jkoenig: ok
<jkoenig> braunr, but I'm more than willing to accept this in exchange for
a clear, compiler-enforced materialization of the border between safe an
unsafe code.
<jkoenig> actually another question I have is the amount of static typing I
should add to the safe version, for instance should I subclass MachPort
into MachSendRight, MachReceiveRight and so on. I don't want to depart
from the C inteface too much but it could be useful.
<braunr> jkoenig: can't answer that :)
<braunr> jkoenig: keep them in mind for later i think
<tschwinge> jkoenig: What's the safety concern w.r.t. having MachPort (not)
final?
<jkoenig> tschwinge, actually I'm partly wrong in that we only need name()
and a couple other methods to be final
<tschwinge> jkoenig: That's what I was thinking. :-)
<tschwinge> I though I'm missing something here.
<jkoenig> tschwinge, the idea is that the user (ie., the adversary :-)
could extend MachPort and inject their own fake port name into messages
<jkoenig> by overriding name() or clear()
<tschwinge> Yeah, but if these are final, that's not possible.
<jkoenig> right.
<tschwinge> And that *should* be enough, I think.
<tschwinge> Unless I'm missing something.
<jkoenig> I don't think so. Also I hope it is, because as mentionned above
there might be some value in subclassing MachPort.
<tschwinge> Yep.
<jkoenig> incidentally, declaring the class or the method final will allow
the JVM to inline them I think.
<tschwinge> It will help the JVM, yes. It can also figure that out without
final, though. (And may have to de-optimize the code again in case there
are additional classes loaded during run-time.)
<tschwinge> jkoenig: The reference counting in MachPort. I think I'm
beginning to understand this.
<jkoenig> oh ok
<jkoenig> tschwinge, yes the javadoc is maybe a bit obscure so far.
<jkoenig> but basically you don't want the port name you acquire to become
invalid before you're done using it.
<tschwinge> But how is this different from the C world?
<jkoenig> here my goal is to provide some guarantees if you use only safe
methods
<jkoenig> like, you can't forge a port name and things like that
<jkoenig> so basically it should never be possible to include an invalid
port name in a message if you use only safe methods.
<tschwinge> Ah, I see!
<tschwinge> Now that does make sense.
<jkoenig> but the mechanism in itself is similar to the Hurd port cells and
user_link structures
<tschwinge> It's again ``only'' helping the programmer.
<jkoenig> right, no object-capability ulterior motives :-)
<jkoenig> another assumption which the javadoc does not state yet it that
basically there should be exactly one MachPort object for each mach-level
port name reference (in the sense of mach_port_mod_refs)
<tschwinge> Yes, I figured out that bit.