vendredi, 5 août 2011

Cassandra, hector, maven and tomcat

Cassandra (v0.7 and 0.8), hector (version compatible with Cassandra), maven and tomcat bound together result in a warning at tomcat startup :

INFO: validateJarFile(/...<webappspath>.../WEB-INF/lib/servlet-api-2.5-20081211.jar) - jar not loaded. See Servlet Spec 2.3, section 9.7.2. Offending class: javax/servlet/Servlet.class

Nothing harmful here, but as we want to do things as clean as possible we should avoid violating Servlet Spec 2.3.

So running a dependency:tree help to find out the guilty dependency :

[INFO] +- me.prettyprint:hector-core:jar:0.8.0-2:compile
[INFO] |  +- org.apache.cassandra:cassandra-all:jar:0.8.1:compile
[INFO] |  |  +- org.apache.cassandra.deps:avro:jar:1.4.0-cassandra-1:compile
[INFO] |  |  |  \- org.mortbay.jetty:jetty:jar:6.1.22:compile
[INFO] |  |  |     \- org.mortbay.jetty:servlet-api:jar:2.5-20081211:compile

Hum hum, org.mortbay.jetty:servlet-api:jar:2.5-20081211. We found the responsible of the problem.

Using <dependencymanagement> in our pom.xml file we will be able to mark this dependency as provided and the problem should flight away :
  <dependencymanagement>
    <dependencies>
      <dependency>
        <groupId>org.mortbay.jetty</groupId>
        <artifactId>servlet-api</artifactId>
        <version>2.5-20081211</version>
        <scope>provided</scope>
      </dependency>
    </dependencies>
  </dependencymanagement>
Build, deploy, and that's it !

samedi, 16 juillet 2011

More Interview Questions

All interviews are different, all candidates are different, all positions are different. The goal here is to provide some general questions to test engineering skills of candidates. These questions are common in recruiting process in technical companies like Google or Verisign.

How could you detect a loop in a linked list ? 

It's say linked list, to be more precise you only have a pointer to the head of the list, and you know how to move to the next element of the list.
Of course you are not allowed to modify the list at all. And obviously you should try to propose an algorithm that do it in the most efficient way, let say you should try to target O(n).

In more programmer oriented : write a function A that take as parameter the first element of a List and return a boolean with value true if the given List has a loop, false otherwise. You know how to pass to the next element of the List : just call element.next.

For those who want to try to answer by themself, stop reading here.

One good answer is to have 2 iterators that walk through the elements of the List, but at different speed. One iterator move from 1 element per iteration, the second one move from 2 elements per iteration, and check if it meet the first iterator during it's move. How fast goes the second iterator can be configurable.
If at one point the second iterator (fast moving) meet the first one, there is a loop in the List. If the second iterator reach the end of the List, there is obviously no loop.

In pseudo code, it would be :

function A (head element of List list) return boolean :
it1 is initialized to the head element of list
it2 is initialized to the head element of list
while it2 has not reach end list
    for number of element it2 will move
        it2 move to the next element.
        if the element of it2 is equal to the element of it1
            there is a loop ! return true
    it1 move to the next element
there is no loop. return false
As targeted, this algorithm will be on complexity of O(n), as the worst case is the last element looping to itself. When the last element loop to the first one, the algorithm will detect the loop in n / speed of the second iterator.
And finally, the memory footprint is very low.

What are the advantages of a LinkedList over an ArrayList ? 

This question is more often ask in a different way : what are the differences between a LinkedList and an ArrayList. But turning the question up side down like here is a bit more interesting. So what are the advantages ?

The basic advantage of the LinkedList is its O(1) insertion and deletion complexity for any given element, while ArrayList need to move back or forth the elements after the given element. Another advantage is the linearly increasing memory footprint of a LinkedList, when ArrayList memory (still linear but) increases by steps, and then perform sometimes costly memory copies. With such memory copies ArrayList have an amotized complexity in O(1), but this leads to non deterministic operations, which are unwanted when dealing with tight SLAs.
One more hidden advantage of the LindekList is still in the topic of memory usage : it does not require to allocate blocks of memory, elements stored in LinkedList do not require to be allocated in contiguous memory. Just notice that fragmentation of the memory is not always seen as good point.

lundi, 7 mars 2011

NoSQL Overview @Webmardi

Tuesday 1st of March 2011 I gave an overview of NoSQL at the Webmardi monthly event. The presentation was hosted by Webdoc. Thanks all for the warm welcome !

Some pictures :




The slides are available on Slideshare : NoSQL Overview, implementation free.

It is called "Implementation free" because I didn't want to focus on a particular implementation of a NoSQL datasore, but rather generalize concepts of "What is NoSQL". And it's a tricky answer to formulate :)

So what is NoSQL ? It's a family of datastores that does not have SQL as main data access pattern.

The presentation was hosted by Webdoc : http://www.webdoc.com.

jeudi, 6 janvier 2011

Eventually Consistency demystified

In my crusade into the NoSQL world, Eventually Consistency is everywhere. I want to demystify this property a little bit here.

But let's begin with an example to have the same base for the discussion :

  • Let "Node1", "Node2" and "Node3" be three nodes (servers) that are part of our distributed datastore.
  • Let "User A", "User B", "User c" be three users wanting to read and write data in our fictive distributed datastore.
At time (1), "User A" write the value "A" to "Node1". "Node1" will replicate asynchronously this value to both "Node2" and "Node3" (specific to my example).
At time (2) the write call of "Node A" returns. But the replication of value "A" hasn't been completely propagate to "Node2" and "Node3".
At time (3), "User B" and "User C" will read value "A" from "Node1" and "Node2" respectively. "User B" got the latest value (because it reads the node which initiate the update), "User C" will read either the old or the new version of "A", but without any guarantee regarding what it will read.



In a future time (5), "User B" and "User C" re-read value "A" and then got the same value. At this point of time, the datastore is consistent.

Immediate Consistency

In a Immediate Consistency, opposing Eventually Consistency, the write call from "User A" should wait till the replication is done on other nodes before returning, and replica nodes ("Node2" and "Node3") should be synchronized to expose the new value at the same time.

Moreover, if "Node1" is unable to talk to "Node2", the write replication will probably fail then the write call from "User A" will fail.

As we can notice, Immediate Consistency is hard to scale (see two-phase commit or paxos algorithm), because it increases the latency of the writes and makes the system not redundant to failure.

Trade-off for scaling writes

Eventually Consistency is then a trade-off for scaling writes that seems reasonable in certain use-cases.