Home » Resources » Corpora


The CLIL secondary corpus

The corpus has been collected to represent the spoken and written language of a school subject which all schools with a bilingual section teach in English: social science, with mainly history topics. This is a subject in which the role of language is paramount, and we believe our study may be useful to a number of practitioners. We have collected spoken and written data once a year from the same two classes in two state secondary schools in different areas of Madrid, throughout the four years of obligatory secondary education (ESO). The project started in the academic year 2005-06, and the data collection was completed in 2008-09. In each school, the majority of the children who finished the project, aged 15-16, are those with whom we started when they were 12-13.

The corpus consists of 30,000 words from whole class sessions, 200 short texts written by the same students under controlled conditions, and 20,000 words from individual interviews with 6 students from each class. Each set of data was produced by the same prompt. That is, we have different conditions of production while maintaining the same topic.

The data in the project also includes the basic input for the topic, that is, textbook extracts, which provided the students’ written input, and the language the teacher used during the recording of the sessions. We have also recorded parallel groups working in their native language, Spanish, on the same topics. The corpus also includes production by native speakers of English of the same age, in class on similar topics.

The CLIL-AfL Primary corpus

The CLIL-AfL Primary corpus represents spoken discourse from five different bilingual primary schools in the Comunidad de Madrid. This corpus is unique in that it includes full didactic units recorded at the beginning and the end of the 2010/2011 academic year in order to chart student progress. The recordings come from an array of subjects taught in English, which include: science, citizenship, literacy, drama and art. The participants were Spanish students from Year 4, Year 5 and Year 6 of primary school (ages 9-12), most of which had been exposed to bilingual education since their entry into primary school. As a result of this, the students have a high English level with respect to their age, making detailed discourse analysis of the classes possible.

Teachers include native and non-native English speakers, and two of these teachers employ a formative assessment technique known as Assessment for Learning (AfL). This technique is meant to encourage students to recognize and fill learning gaps by use of effective questioning techniques, stating the communicative purpose of activities and providing effective feedback. Two teachers employ this type of assessment in their classrooms while three teachers use traditional summative assessment.

The corpus consists of 81 classes total with a total estimated word count of 500,000 words. This total includes six citizenship units, six science units, four art units, two drama units and one literacy unit. Many of the units were recorded in tandem from each school; one unit was filmed at the beginning and one at the end of the school year in order to track student progress. Also included in the corpus are interviews from three students in select citizenship and science classes in order to measure classroom motivation, as well as interviews with each of the five teachers on their background and assessment practices.