Buradasınız

Morphological Analysis from the Raw Kashmiri Corpus Using Open Source Extract Tool

Journal Name:

Publication Year:

Abstract (2. Language): 
Purpose: Morphological information is a key part when we consider the design of any machine translation engine, any information retrieval system or any natural language processing application. It is important to investigate how lexicon development can be automated maintaining the quality which makes it of use for the applications, since manual development can be highly time consuming task. The paper describe how we can simply provide the extraction rules along with raw texts which can guide the computerized extraction of morphological information with the help of the extract tool like Extract v2.0. Design/methodology/approach: We used Extract v2.0 which is an open source tool for extracting linguistic information from raw text, and in particular inflectional information on words based on the word forms appearing in the text. The input to the Extract is a file containing, an un-annotated Kashmiri corpus and a file containing the Extract rules for the language. The tools output is the list of analyses; each analysis consists of a sequence of words annotated with a identifier that describes some linguistic information about the word. Findings: The study includes the fundamental extraction rules which can guide the Extract tool v2.0 to extract the inflectional information and help in the development of a full lexicon that can be use for developing different applications in the natural language applications. The major contributions of the study are:  Orthography component: A Unicode Infrastructure to accommodate Perso-Arabic script of Kashmiri.  Morphology component: A type system that covers the language abstraction and an inflection engine that covers word-and-paradigm morphological rules for all word classes. Research Implications: The study however does not include all the rules but can be taken as a prototype for extending the functionality of the lexicon. An attempt has been made to make use of automated morphological information using Extract tool. Originality/Value: Kashmiri language is the most widely spoken language in the state of Jammu and Kashmir. The language has very scarce software tools and applications. The study provides a framework for the development of a full size lexicon for the Kashmiri language from the raw text. The study is an attempt to provide a lexicon support for the applications which make use of Kashmiri language. This study can be extended for developing spoken lexicon of Kashmiri language that can be used in spoken dialogue systems.
176-187

REFERENCES

References: 

Forsberg, M., & Ranta, A. (2004). Functional Morphology. In Proceedings of the ninth ACM SIGPLAN international conference on Functional programming (ICFP '04) (pp. 213-223). Snow Bird UT, U.S.A. New York: ACM. doi: 10.1145/1016850.1016879
Creutz, M., & Lagus, K. (2005). Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR ’05), 15-17 June, Espoo, Finland, Espoo (2005) (106–113). Espoo, Finland. Retrieved fromhttp://research.ics.tkk.fi/events/AKRR05/papers/akrr05creutz.pdf
Sharma, U., Kalita, J., & Das, R. (2002). Unsupervised learning of morphology for building lexicon for a highly inflectional language. In Proceedings of the ACL-02 workshop on Morphological and phonological learning (MPL ‘02) (pp. 1-10). Stroudsburg, PA, USA: Association for Computational Linguistics. doi: 10.3115/1118647.1118648
Hopcroft, J. E., & Ullman, J. D. (2001). Introduction to automata theory, languages, and computation (2nd ed.). Reading, Mass: Addison-Wesley.
Cl´ement, L., Sagot, B., & Lang, B. (2004). Morphology based automatic acquisition of large-coverage lexica. Retrieved from
http://hal.archives-ouvertes.fr/docs/00/41/31/89/PDF/LREC04.pdf
Oliver, A. (2004).Adquisici´o d’informaci´o l`exica i morfosint`actica a partir de corpus sense anotar: aplicaci´o al rus i al croat. (PhD Thesis). Universitat de Barcelona.
Oliver, A., & Tadic, M. (2004 a). Enlarging the croatian morphological lexicon by automatic lexical acquisition from raw corpora. In Proceedings of LREC’04, Lisboa, Portugal (2004) 1259–1262. Retrieved from
http://www.hnk.ffzg.hr/txts/aomt4lrec2004.pdf
Goldsmith, J. (2001). Unsupervised learning of the morphology of natural language. Computational Linguistics. 27(2), 153–198. doi: 10.1162/089120101750300490
Kermanidis, K. L., Nikos, F., & Kokkinakis, G. (2004). Automatic acquisition of verb subcategorization information by exploiting mininal linguistic resources. International Journal of Corpus Linguistics, 9 (1), 1-28. doi: 10.1075/ijcl.9.1.01ker
Faure, D., & Nedellec, C. (1998). Asium: Learning subcategorization frames and restrictions of selection. In Y. Kodratoff (Ed.). 10th Conference on Machine Learning (ECML 98) – Workshop on TextMining, Chemnitz, Germany, Avril 1998. Springer-Verlag, Berlin (1998)
Gamallo, P., Agustini, A., & Lopes, G.P. (2003). Learning subcategorisation information to model a grammar with “Co-restrictions”. Traitement Automatique des Langues. 44 (1), 93–177

Thank you for copying data from http://www.arastirmax.com