ntegral image is the double integral of image (first along rows and then along columns).
This representation is used to compute two-reclangle, three-reclangle and four-rectangle features which are difference between the sum of the pixels within two rectangular regions (horizontally or vertically adjacent), sum within two outside rectangles subtracted from the sum in a center rectangle and difference between diagonal pairs of rectangles.
The essence of the usage of this representation is that these features can be computed very rapidly with this intermediate representation: integral image at location x;y contains the sum of the pixels above and to the left of x;y so that any rectangular sum can be computed in four array references (the difference between two sums in eight references etc).

Why does the cascade of classifiers make the system so much faster?
Because it allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions.
In more details the structure of 38 classifiers of successively more complex computations has the shape of a degenerate decision tree.
A face detection attentional operator can be learned such that it will filter out over 50% of the image leaving 99% of the faces for further processing. The sub-windows rejected by the initial classifier are never addressed again,the chosen sub-windows pass through other classifiers subsequently and if at some stage some window is rejected it is never addressed again which saves the time dramatically.