The purpose of the research described in this paper is to examine the existence of correlation between low level audio, visual and textual features and movie content similarity. In order to focus on a well defined and controlled case, we have built a small dataset of movie scenes from three sequel movies. In addition, manual annotations have led to a ground-truth similarity matrix between the adopted scenes. Then, three similarity matrices (one for each medium) have been computed based on Gaussian Mixture Models (audio and visual) and Latent Semantic Indexing (text). We have evaluated the automatically extracted similarities along with two simple fusion approaches and results indicate that the low-level features can lead to an accurate representation of the movie content. In addition, the fusion approach seems to outperform the individual modalities, which is a strong indication that individual modules lead to diverse similarities (in terms of content). Finally, we have evaluated the extracted similarities for different groups of human annotators, based on what a human interprets as similar and the results show that different groups of people correlate better with different modalities. This last result is very important and can be either used in (a) a personalized content-based retrieval and recommender system and (b) in a local weighted fusion approach, in future research.