神戸大学石川慎一郎研究室　活動報告: 2020.5.6 研究メモ：BNCのspokenのテキストサンプル数について

2020/05/06

2020.5.6 研究メモ：BNCのspokenのテキストサンプル数について

コーパス言語学の概説書の校正作業中ですが，作業過程で，以下に気づいたのでメモとして残します。

BNC XML版のユーザーガイドより（Lou Burnard, 2007）
http://www.natcorp.ox.ac.uk/docs/URG/BNCdes.html

図版出典：上記URL

・1.5.2.3 Composition of the spoken componentの冒頭に，context-governedのテキストサンプル数について，「A total of 757 texts (6,153,671 words) make up the context-governed part of the corpus.」という記述がある。

・その直下にTable 16があり，内訳が出ているが，示された数字を足し算すると，context-governedのサンプル数は755となってなぜか計算が合わない。

・さらにその下にはdemographic+context-governedの両方をあわせた内訳が載っているが，その合計は 908になっている（Table 17/18とも数字は一致）

・そこで1.5.1のdemographicのほうの解説に戻ると，Table 13～15に全協力者の内訳表があり，合計すると153となる（３つの表で一致）。

・その上に下記の説明がある。
124 adults (aged 15+) were recruited... Additional recordings were gathered for the BNC as part of the University of Bergen COLT Teenager Language Project. This project used the same recording methods and transcription scheme as the BNC, but selected only respondents aged 16 or below.
つまり，15歳以上の独自に集めた協力者124人＋COLTで集めた若者29人=153人かと推測される。

・上記を踏まえ，demographicの側のサンプル数を153とすると，
　　　Table 16に従った場合　　　　153＋755=908
　　　1.5.2.3の説明に従った場合　 153+757=910
となり，前者の場合が，Table 17/18の数値に一致する。

・ゆえに，仮ではあるが，153+"755"と判断しておく。（実際のデータで後で確認要）

このブログを検索

2020/05/06

2020.5.6 研究メモ：BNCのspokenのテキストサンプル数について