Third International
Conference of Austroasiatic Linguistics
26-28 NOVEMBER 2007, Deccan
College Post-Graduate & Research Institute,
Data Sharing in the Mon-Khmer Languages Project
Doug Cooper
Center for Research in Computational Linguistics
The
Mon-Khmer Languages Project is a broad plan to support research in comparative
linguistics and lexicography. It was
created to provide a practical means for sharing lexicographic data and
comparative analysis, including both confirmed and edited results, and the
‘dark matter’ of working data and partial results that are, in some cases, our
only available resources.
The Project provides two linked,
Web-accessible resources with the usual array of search and presentation tools:
The MKL Project is intended to be both
accessible and extensible. ‘Source
filtering’ lets resource sets be defined as narrowly or broadly as desired; for
example, searches might include only data from a particular dictionary, or incorporate
all data available for a given language.
However they are defined, resource sets can also be extracted and downloaded
for off-line research.
New datasets that follows a simple XML
tagging protocol can also be added to the MKL Project databases. Every item is identified by its contributor’s
name, so the obvious issue of quality control is dealt with in a transparent,
elegant manner: source filtering can
include, or just as readily exclude, any individual’s contributions. Thus, only sources the user trusts, or items
that been vetted by scholars the user trusts, will actually figure in
responding to any of the user’s queries.
The Mon-Khmer Languages Project is, above
all, a collaborative venture. We have
received wide support in the linguistics community in planning and acquiring
initial data for the project, and generous funding from the U.S. National
Endowment for the Humanities in launching it as of May, 2007. I look forward to describing the project’s
implementation, and to soliciting advice and comment on how it can best meet
its goal of enabling timely sharing of data and analysis by Mon-Khmer language
researchers.