Croatian Language Corpus
Encyclopedia
The Croatian Language Corpus is a corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

 of Croatian
Croatian language
Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...

 compiled at the Institute of Croatian Language and Linguistics
Institute of Croatian Language and Linguistics
The Institute of Croatian Language and Linguistics is an official institute in Croatia whose purpose is to preserve and foster the Croatian language. It traces its history back to 1948, when it was part of the Yugoslav Academy of Sciences and Arts...

 (IHJJ
Institute of Croatian Language and Linguistics
The Institute of Croatian Language and Linguistics is an official institute in Croatia whose purpose is to preserve and foster the Croatian language. It traces its history back to 1948, when it was part of the Yugoslav Academy of Sciences and Arts...

).

Background

The CLC was initially funded as a sub-project of the research program Riznica (Croatian Language Repository) by the Ministry of Science, Education, and Sports of the Republic of Croatia
Ministry of Science, Education and Sports (Croatia)
The Ministry of Science, Education and Sports of the Republic of Croatia is the ministry in the Government of Croatia which is in charge of primary, secondary and tertiary education, research institutions and sports.-List of ministers:...

 (MZOŠ
Ministry of Science, Education and Sports (Croatia)
The Ministry of Science, Education and Sports of the Republic of Croatia is the ministry in the Government of Croatia which is in charge of primary, secondary and tertiary education, research institutions and sports.-List of ministers:...

) (project no. 0212010) from May 2005. In a second development phase, since 2007, the further extension and development of the CLC was embedded within the research program The Croatian Language Repository (CLR) that was granted by the MZOŠ
Ministry of Science, Education and Sports (Croatia)
The Ministry of Science, Education and Sports of the Republic of Croatia is the ministry in the Government of Croatia which is in charge of primary, secondary and tertiary education, research institutions and sports.-List of ministers:...

 (cf. Ćavar and Brozović Rončević, 2012). Being a research program (PI Dunja Brozović Rončević) with numerous subsumed independent research projects that make use of the CLC, the corpus is mainly developed as a by-product of those research projects within the CLR. Currently Dunja Brozović Rončević and Damir Ćavar are in charge of the corpus development.

Goals

One of the main goals of the CLC project is to create a publicly available Croatian
Croatian language
Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...

 corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

 that is annotated on multiple levels, i.e. lemmatized, morphologically
Morphology (linguistics)
In linguistics, morphology is the identification, analysis and description, in a language, of the structure of morphemes and other linguistic units, such as words, affixes, parts of speech, intonation/stress, or implied context...

 segmented and morpho-syntactically
Morphology (linguistics)
In linguistics, morphology is the identification, analysis and description, in a language, of the structure of morphemes and other linguistic units, such as words, affixes, parts of speech, intonation/stress, or implied context...

 annotated, phonemically transcribed and syllabified, and syntactically parsed. While the current version of the corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

 provides resources from the Croatian
Croatian language
Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...

 language standard, several corpora
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

 from different development phases of Croatian
Croatian language
Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...

 are created as well, including the digitizations of manuscripts and Croatian
Croatian language
Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...

 dictionaries.

Format and Availability

From the outset, the collected and digitized texts in the CLC were annotated using the Text Encoding Initiative
Text Encoding Initiative
The Text Encoding Initiative is a text-centric community of practice in the academic field of digital humanities. The community runs a mailing list, meetings and conference series, and maintains a technical standard, a wiki and a toolset....

 (TEI
Text Encoding Initiative
The Text Encoding Initiative is a text-centric community of practice in the academic field of digital humanities. The community runs a mailing list, meetings and conference series, and maintains a technical standard, a wiki and a toolset....

) P5 XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

 standard. Currently approx. 90 mil. tokens are available in the TEI
Text Encoding Initiative
The Text Encoding Initiative is a text-centric community of practice in the academic field of digital humanities. The community runs a mailing list, meetings and conference series, and maintains a technical standard, a wiki and a toolset....

 P5 XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

 format. The corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

 can be accessed online via the Philologic interface (see The ARTFL Project, Department of Romance Languages and Literatures, The University of Chicago
University of Chicago
The University of Chicago is a private research university in Chicago, Illinois, USA. It was founded by the American Baptist Education Society with a donation from oil magnate and philanthropist John D. Rockefeller and incorporated in 1890...

). It is virtualized into various sub-corpora, and individual or specific definitions of sub-corpora can be provided on demand.

Content

The CLC is assembled from selected text of Croatian
Croatian language
Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...

, covering various functional domains and genres. It includes literature and other written sources from the period of the beginning of the final shaping of the standardization of the Croatian
Croatian language
Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...

 language, i.e. from the second half of the 19th century on.

The CLC consists of:
  • fundamental Croatian literature (e.g. novels, short stories, drama, poetry)
  • non-fiction
  • scientific publications from various domains and University textbooks
  • school books
  • translated literature from outstanding Croatian
    Croatian language
    Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...

     translators
  • online journals and newspapers
  • books from the pre-standardization period of Croatian
    Croatian language
    Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...

     that are adapted to nowadays standard Croatian
    Croatian language
    Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...



Cooperation

The realization of the CLC was made possible in cooperation with:
  • Školska knjiga d.d.
    Školska knjiga
    Školska knjiga is one of the largest publishing companies in Croatia. It was established in 1950. Until the mid-1990s it had a virtual monopoly on publishing schoolbooks and this remains its core business....

  • Croatian Academy of Sciences and Arts (HAZU)
    Croatian Academy of Sciences and Arts
    The Croatian Academy of Sciences and Arts is the national academy of Croatia. It was founded in 1866 as the Yugoslav Academy of Sciences and Arts , and was known by that name for most of its existence.- History :...

  • Stoljeća hrvatske književnosti, Matica hrvatska
    Matica hrvatska
    Matica hrvatska is one of the oldest Croatian cultural institutions, dating back to 1842. The name is somewhat idiosyncratic, best translated as "The Croatian Centre" . It is the largest publisher of Croatian language books...



External Links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK