Data Sharing in the Mon-Khmer Languages Project

Third International Conference of Austroasiatic Linguistics

26-28 NOVEMBER 2007, Deccan College Post-Graduate & Research Institute, Poona, India

Data Sharing in the Mon-Khmer Languages Project

Doug Cooper

Center for Research in Computational Linguistics

The Mon-Khmer Languages Project is a broad plan to support research in comparative linguistics and lexicography. It was created to provide a practical means for sharing lexicographic data and comparative analysis, including both confirmed and edited results, and the ‘dark matter’ of working data and partial results that are, in some cases, our only available resources.

The Project provides two linked, Web-accessible resources with the usual array of search and presentation tools:

The Mon-Khmer languages database is an on-line store of lexicographic data. Drawn from both published and unpublished sources, the database will ultimately provide a snapshot of relevant (for comparative purposes) knowledge of each of the Mon-Khmer languages, including glossing and phonetic transcription.
The Mon-Khmer etymology database serves a similar role for analysis. It will initially be based on data extracted from Shorto’s Mon-Khmer Comparative Dictionary (2006); the most extensive such resource, and a fitting starting point for this effort.

The MKL Project is intended to be both accessible and extensible. ‘Source filtering’ lets resource sets be defined as narrowly or broadly as desired; for example, searches might include only data from a particular dictionary, or incorporate all data available for a given language. However they are defined, resource sets can also be extracted and downloaded for off-line research.

New datasets that follows a simple XML tagging protocol can also be added to the MKL Project databases. Every item is identified by its contributor’s name, so the obvious issue of quality control is dealt with in a transparent, elegant manner: source filtering can include, or just as readily exclude, any individual’s contributions. Thus, only sources the user trusts, or items that been vetted by scholars the user trusts, will actually figure in responding to any of the user’s queries.

The Mon-Khmer Languages Project is, above all, a collaborative venture. We have received wide support in the linguistics community in planning and acquiring initial data for the project, and generous funding from the U.S. National Endowment for the Humanities in launching it as of May, 2007. I look forward to describing the project’s implementation, and to soliciting advice and comment on how it can best meet its goal of enabling timely sharing of data and analysis by Mon-Khmer language researchers.