Benoit Perroud : concepts and cloud ready applications: patterns

Affichage des articles dont le libellé est patterns. Afficher tous les articles

samedi, 16 juillet 2011

More Interview Questions

All interviews are different, all candidates are different, all positions are different. The goal here is to provide some general questions to test engineering skills of candidates. These questions are common in recruiting process in technical companies like Google or Verisign.

How could you detect a loop in a linked list ?

It's say linked list, to be more precise you only have a pointer to the head of the list, and you know how to move to the next element of the list.
Of course you are not allowed to modify the list at all. And obviously you should try to propose an algorithm that do it in the most efficient way, let say you should try to target O(n).

In more programmer oriented : write a function A that take as parameter the first element of a List and return a boolean with value true if the given List has a loop, false otherwise. You know how to pass to the next element of the List : just call element.next.

For those who want to try to answer by themself, stop reading here.

One good answer is to have 2 iterators that walk through the elements of the List, but at different speed. One iterator move from 1 element per iteration, the second one move from 2 elements per iteration, and check if it meet the first iterator during it's move. How fast goes the second iterator can be configurable.
If at one point the second iterator (fast moving) meet the first one, there is a loop in the List. If the second iterator reach the end of the List, there is obviously no loop.

In pseudo code, it would be :

function A (head element of List list) return boolean :
it1 is initialized to the head element of list
it2 is initialized to the head element of list
while it2 has not reach end list
    for number of element it2 will move
        it2 move to the next element.
        if the element of it2 is equal to the element of it1
            there is a loop ! return true
    it1 move to the next element
there is no loop. return false

As targeted, this algorithm will be on complexity of O(n), as the worst case is the last element looping to itself. When the last element loop to the first one, the algorithm will detect the loop in n / speed of the second iterator.
And finally, the memory footprint is very low.

What are the advantages of a LinkedList over an ArrayList ?

This question is more often ask in a different way : what are the differences between a LinkedList and an ArrayList. But turning the question up side down like here is a bit more interesting. So what are the advantages ?

The basic advantage of the LinkedList is its O(1) insertion and deletion complexity for any given element, while ArrayList need to move back or forth the elements after the given element. Another advantage is the linearly increasing memory footprint of a LinkedList, when ArrayList memory (still linear but) increases by steps, and then perform sometimes costly memory copies. With such memory copies ArrayList have an amotized complexity in O(1), but this leads to non deterministic operations, which are unwanted when dealing with tight SLAs.
One more hidden advantage of the LindekList is still in the topic of memory usage : it does not require to allocate blocks of memory, elements stored in LinkedList do not require to be allocated in contiguous memory. Just notice that fragmentation of the memory is not always seen as good point.

jeudi, 6 janvier 2011

Eventually Consistency demystified

In my crusade into the NoSQL world, Eventually Consistency is everywhere. I want to demystify this property a little bit here.

But let's begin with an example to have the same base for the discussion :

Let "Node1", "Node2" and "Node3" be three nodes (servers) that are part of our distributed datastore.
Let "User A", "User B", "User c" be three users wanting to read and write data in our fictive distributed datastore.

At time (1), "User A" write the value "A" to "Node1". "Node1" will replicate asynchronously this value to both "Node2" and "Node3" (specific to my example).
At time (2) the write call of "Node A" returns. But the replication of value "A" hasn't been completely propagate to "Node2" and "Node3".
At time (3), "User B" and "User C" will read value "A" from "Node1" and "Node2" respectively. "User B" got the latest value (because it reads the node which initiate the update), "User C" will read either the old or the new version of "A", but without any guarantee regarding what it will read.

In a future time (5), "User B" and "User C" re-read value "A" and then got the same value. At this point of time, the datastore is consistent.

Immediate Consistency

In a Immediate Consistency, opposing Eventually Consistency, the write call from "User A" should wait till the replication is done on other nodes before returning, and replica nodes ("Node2" and "Node3") should be synchronized to expose the new value at the same time.

Moreover, if "Node1" is unable to talk to "Node2", the write replication will probably fail then the write call from "User A" will fail.

As we can notice, Immediate Consistency is hard to scale (see two-phase commit or paxos algorithm), because it increases the latency of the writes and makes the system not redundant to failure.

Trade-off for scaling writes

Eventually Consistency is then a trade-off for scaling writes that seems reasonable in certain use-cases.

mardi, 9 janvier 2007

Optimisation du nombre de requêtes SQL dans les collections d'objets et relation n-m

L'orientation objet de PHP n'est plus contestable, mais les problèmes d'optimisation persistent.

Par exemple le malheureusement célèbre n+1 pattern, qui fait que pour afficher une liste de n objets, le + 1 étant la requête qui sélectionne tous les ids des objets à instancier, n + 1 requêtes SQL seront effectuées.

De même dans des relations n-m, la majorité des implémentations sélectionnent les n objets (en n + 1 requêtes donc), puis pour chacun on sélectionne les m objets de la relation. On obtient donc n * m + 1 requêtes.

Le but de cet article est de présenter deux techniques qui, combinées, réduisent les n * m + 1 requêtes en n + m + 1.

Les deux techniques que je vais illustrer ici se nomment object caching et grouped fetching. Elles se combinent très bien, ce qui permet d'optimiser drastiquement les performances d'un script PHP, du points de vue I/O (moins de requêtes), vitesse d'exécution et même mémoire utilisée (les objets ne sont pas dupliqués).

Object caching :

Le principe de l'object caching est de rendre le constructeur de l'objet privé (au pire protégé) et de l'instancier au moyen d'une factory. Puis on ajoute à la classe un tableau statique dans lequel les références des objets instanciés seront placés. La factory va donc regarder dans le tableau de références si l'objet existe, et si ça n'est pas le cas elle va le créer, l'ajouter au tableau et le retourner.


class A {
 protected static $objects_cache = array();

 public static function factory($id, $class = __CLASS__)
 {
   if (array_key_exists($id, $class::$objects_cache)) {
     return $class::$objects_cache[$id];
   } else {
     $o = new $class($id);
     $class::$objects_cache[$id] = $o;
     return $o;
   }
 }
}

Cette solution pourrait encore être améliorée si les objets pouvaient être partagés entre toutes les instances de PHP. Ce n'est pas le cas à cause de l'architecture share nothing de PHP, et c'est un des points forts des serveurs d'applications.
Un autre désavantage de ce concept, si on a un script qui tourne suffisamment longtemps pour que le grabage collector se lance, est que tous les objets créés sont toujours référencés, même ceux qui pourraient être des candidats potentiels à la finalisation. Ce problème est résolu dans d'autres langages, en Java par exemple grâce aux références faibles (WeakReference).

Grouped fetching

Le principe du grouped fetching, dérivé d'une solution proposée par notre ami Colder, est de charger les données des objets de manière asynchrone et en bloque. Quand un objet est instancié, il est marqué comme non chargé, et sa référence est placée dans un tableau global de la classe. Puis lors d'un accès à un champ non chargé, la classe va sélectionner dans la base de données tous les objets instanciés mais pas encore chargés. Plus on retarde les accès aux attributs d'un objet, plus on va paralléliser les requêtes SQL.


class A {
 private static $_to_load = array();
 private $_is_loaded = false;
 public function __construct($id)
 {
   $this->id = $id;
   self::$_to_load[$id] = $this;
 }

 public function __get($attribut)
 {
   if (!$this->_is_loaded) {
     self::_groupedLoad();
   }
   return $this[$attribut];
 }

 private function _setLoaded()
 {
   $this->is_loaded = true;
 }

 private static function _groupedLoad()
 {
   $ids = implode(', ', self::$_to_load);
   $query = 'SELECT * FROM ' . self::$_table . ' WHERE id IN ( ' . $ids . ' ) ';
   $res = db_query($query);
   while ($row = $res->getNext()) {
     self::_to_load[$row['id']]->_initByArray($row);
     self::_to_load[$row['id']]->_setLoaded;
   }
 }
}

L'overhead de ce concept est très faible (un tableau de références supplémentaire), et le nombre d'accès à la base de données sont grandement réduit. Mais le problème de la finalisation des objets se repose aussi.

Benoit Perroud : concepts and cloud ready applications

samedi, 16 juillet 2011

More Interview Questions

jeudi, 6 janvier 2011

Eventually Consistency demystified

mardi, 9 janvier 2007

Optimisation du nombre de requêtes SQL dans les collections d'objets et relation n-m

Qui êtes-vous ?

RSS Feed(burner)

Private links

Archives du blog

Tags