[eml-dev] [Bug 585] - internationalization needed in EML
Matt Jones
jones at nceas.ucsb.edu
Tue Dec 9 15:08:02 PST 2008
Hi Éamonn,
Thanks for the INSPIRE document. That clarified things a lot. The section
A.6.3 was particularly useful. The crux seems to be that they have defined
a complex type called PT_FreeText_PropertyType that allows a simple
character string in the default language to be accompanied by one or more
versions in other languages. So, for abstract, they would have:
<abstract xsi:type="PT_FreeText_PropertyType">
<gco:CharacterString>Brief narrative summary of the content of the
resource</gco:CharacterString>
<!--== Alternative value ==-->
<PT_FreeText>
<textGroup>
<LocalisedCharacterString locale="#locale-fr">Résumé succinct
du contenu de la ressource</LocalisedCharacterString>
</textGroup>
</PT_FreeText>
</abstract>
So, the PT_FreeText_PropertyType is very similar in concept to the EML
TextType. We could indeed define a new set of types that use this same
trick, basically allowing textGroup subelements with alternate language
strings. Or we could simply use the definition of PT_FreeText in EML via an
import (except that there may be restrictions on free reuse of ISO
standards, which would prevent us from incorporating such a thing directly
in EML, as redistribution is critical to an open standard).
I'll add this to the bugzilla bug for a record of the discussion for EML
developers.
Matt
On Tue, Dec 9, 2008 at 1:44 PM, Éamonn Ó Tuama <eotuama at gbif.org> wrote:
> Hi Matt,
>
> I agree about the inaccessability of ISO standards - I also had to use a
> draft release of ISO 19115. At least the ISO 19139 XSD schemas are freely
> available once you accept to a stern copyright notice. You can view them
> here:
> http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/, or download as a zipped archive by going here (one dir up) and searching
> for 19139: http://standards.iso.org/ittf/PubliclyAvailableStandards/ .
>
> Regarding the points you raise below -
>
> 'contact' : details for the metadata writer could be different to those of
> the data custodian, collector, etc.
>
> 'locale' : I presume using the combination of language code and country is
> because one country can have multiple languages and one language can be
> spoken in many countries.
>
> I was using the term "attribute" in the general sense of a property and not
> in the strict XML sense of element vs attribute. 'locale' is actually
> expressed as a repeating element in ISO 19139.
>
> I have attached an incomplete example of an ISO 19139 instance document to
> show you how I understand multiple locales are used with related, translated
> free text elements. I have based this on an example in an INSPIRE document
> "Draft Guidelines – INSPIRE metadata implementing rules based on ISO 19115
> and ISO 19119" (
> http://inspire.brgm.fr/Documents/MD_IR_and_ISO_20080425.pdf)
> See Section A.6, page 34 - 37.
> I'm only beginning to explore the schemas themselves and the multilingual
> aspects of ISO 19139 so apart from what I have copied from the INSPIRE doc,
> the encoding in the example may not be fully correct.
>
> I have joined the eml-dev list so I assume this will now get posted there.
>
> Éamonn
>
>
>
>
> Matt Jones wrote:
>
> Thanks, Eammon, for the information. Very useful.
>
> The frustrating thing about ISO standards is how impossible they are to
> obtain. I have an old draft copy of ISO 19115, but neither I nor the UC
> library has a copy of the current standard or of the 19139. I have a
> fundamental philosophical problem with standards that are not free and
> open. Nevertheless, I will continue to try to find a copy of these so that
> I can look into it.
>
> In the meantime, a couple of comments below in your note...
>
> On Tue, Dec 9, 2008 at 12:23 AM, Eamonn O Tuama (GBIF) <eotuama at gbif.org>wrote:
>
>> Hi All,
>>
>> I presume any extensions to EML will involve changes to the schemas and
>> therefore versioning. I do not know how complicated that will be – someone
>> familiar with the EML schemas construction is best suited to answering.
>> However, I think we might be able to learn something from the ISO
>> 19115/19139 standard regarding multilingual metadata.
>>
>> First of all, it provides a distinct set of attributes for the metadata
>> document itself (rather than the data the metadata document is describing).
>> These include:
>>
>> 1. fileIdentifier
>>
> same as EML packageID
>
>> 2. language
>>
> see my earlier discussion on this issue
>
>> 3. characterSet
>>
> same as 'encoding' attribute in the XML prolog
>
>> 4. contact
>>
> why would the metadata contact be different from the data contact? We
> have trouble enough keeping one up to date
>
>> 5. dateStamp
>>
> this would be useful, we should consider adding it to EML. Presumably,
> this is the date on which the metadata document was last updated, which
> would probably belong in the 'maintenance' section of EML
>
>> 6. metadataStandardname
>>
> provided in the namesapce of EML
>
>> 7. metadataStandardVersion
>>
> provided in the namespace of EML
>
>> 8. locale
>>
> this could be useful, although it seems like providing the language code
> would be just as effective and essentially redundant.
>
>> ISO 19139 (the implementation standard for the conceptual model in ISO
>> 19115) also provides a means for encoding multilingual metadata. This is
>> achieved through use of an optional, repeatable "locale" attribute
>> consisting of language, country and characterset encodings.
>>
> This sounds interesting. So, how does it repeat? XML attributes are not
> repeatable, nor do they have substructure. Is it an element? And if so, is
> it a child element of every other element in the model?
>
>> Multiple instances of locale may be defined for a metadata document and
>> translations representing those locales provided for each metadata element.
>> So, repeatability in multiple languages is built in.
>>
> I don't quite see how this would work. Could you show a brief snippet as
> an example. For example, for the title of the dataset, how would you encode
> two titles, each in both english and spanish, and be able to tell which of
> the elements were semantically linked? Here's one way I could see doing it,
> but its a bit clunky:
> <title>
> <translation xml:lang="en">Forests of New Mexico</translation>
> <translation xml:lang="es">Bosques del Nuevo México</translation>
> </title>
> <title>
> <translation xml:lang="en">Survey of Plants and Animals</translation>
> <translation xml:lang="es">Estudio de Plantas y Animales</translation>
> </title>
>
> How would the ISO 19139 propose representing this content?
>
>> The ability to work with multiple languages is seen as a strong
>> advantage in moving from the FGDC metadata standard to the North American
>> Profile (NAP) of ISO 19115. The problem, at the moment, is that a biological
>> profile in ISO 19115 does not exist but it seems that work is underway to
>> express the FDGC Biological Profile in ISO. (I understand that EML based
>> their taxonomic module directly on the FGDC biological profile component.)
>>
> Actually, the BDP standard first got these fields from EML 1.3.x and 1.4.x,
> and then EML 2.x reincorporated the changes from the BDP. Either way, we've
> been looking at replacing the EML taxonomic module with something more in
> line with TDWG standards, in particular with TCS. I have worked out a new
> set of schemas for eml-taxon with Jessie Kennedy and Bob Peet that directly
> incorporate TCS, but I haven't had time to introduce these changes to the
> rest of the EML community. On the TODO list. Nevertheless, as you said,
> there's a lot of compatibility between EML and the BDP.
>
>>
>>
>> The European Union, because of its composition, has always faced the
>> challenge of dealing with multiple languages. A document by the European
>> Committee for Standardisation (CEN) on "Geographic information — Standards,
>> specifications, technical reports and guidelines, required to implement
>> Spatial Data Infrastructure" (can't find URL where I downloaded originally
>> but have PDF if anyone wants it) provides some insights on "Cultural and
>> Linguistic Adaptibility" where it places the emphasis on use of multilingual
>> thesauri rather than efforts to translate element contents.
>>
> Interesting. I'd like to see that. So, given a metadata document in
> Chinese, they are arguiing that scientists that speak other languages can
> get by with multilingual thesausrus entries in place of the natural language
> metadata? I find this somewhat unconvincing if you really want to re-use
> the data.
>
> Thanks for your comments, Eammon.
>
> Matt
>
>> See also Nowak et al paper "Issues of multilinguality in creating a
>> European SDI – the perspective for spatial data interoperability"
>>
>> http://www.ec-gis.org/Workshops/11ec-gis/papers/309nowak.pdf
>>
>>
>>
>> Regards,
>>
>>
>>
>> Éamonn
>>
>>
>>
>>
>>
>> *From:* David Blankman [mailto:dblankman1 at gmail.com]
>> *Sent:* 08 December 2008 20:59
>> *To:* Matt Jones
>> *Cc:* inigo san gil; eml-dev at ecoinformatics.org;
>> bugzilla-daemon at ecoinformatics.org; Vivian B Hutchison;
>> burkeker at gate.sinica.edu.tw; chin at tfri.gov.tw; guoxb at igsnrr.ac.cn;
>> hehl at igsnrr.ac.cn; lijh at sdb.cnic.cn; Aikiko Ogawa; Eamonn O Tuama;
>> Kristin Vanderbilt; Schentz Herbert; Shang; Su Wen; Werf, Bert van der
>> *Subject:* Re: [eml-dev] [Bug 585] - internationalization needed in EML
>>
>>
>>
>> As I think back upon the discussions in China and my discussions with Matt
>> at ISEI, it seems to me that my initial thought that multiple language
>> versions of EML documents are probably better handled by creating separate
>> EML documents for each language used. EML is already complex, I see no
>> reason to make it more complex.
>>
>>
>> In the ILTER situation we are asking ILTER member networks to provide a
>> core of EML in English, on the understanding that more complete metadata may
>> be in another language. In this case should there be an EML module,
>> eml-ilter or eml-language analogous to eml-access that specifies the
>> identifier of the "main" eml-document and the language of that document.
>> This module might also include an element to record a brief statement about
>> the amount of data in that foreign language. I am not sure what else might
>> be appropriate for this module. I know that Matt was thinking that there
>> might be some modifications to metacat replication that might be needed.
>>
>> David
>>
>>
>>
>>
>> On Mon, Dec 8, 2008 at 1:34 PM, Matt Jones <jones at nceas.ucsb.edu>
>> wrote:
>>
>> David and I discussed (briefly) some of these issues at ISEI. And we also
>> discussed them at the ILTER meeting in China. The 'language' tag in
>> eml-resource defines the language of the resource, which in the case of
>> eml-dataset resources means the language of the data. Interestingly, we
>> don't really have a language tag per se for the EML document content itself,
>> except that all XML documents can use the built-in "xml:lang" attribute,
>> which is optional for all XML elements (
>> http://www.w3.org/TR/REC-xml/#sec-lang-tag). This allows one to set the
>> language for each and every element in an XML document, such as:
>>
>> <title xml:lang="en">North American Forests</title>
>> <title xml:lang="es">Bosques de Norte Americano</title>
>>
>> Two problems we would need to address with this approach come immedately
>> to mind:
>>
>> 1) Many elements in EML are not repeatable, and therefore it is not
>> possible to have one copy of the element in English and another in a
>> different language. So cardinality would have to be updated throughout the
>> EML schemas, which would make some aspects of validation more confusing.
>> 2) For those elements that are already repeatable or are made repatable
>> through a revision, there is no mechanism to indicate that the two element
>> nodes are meant to be have the same semantic meaning in different languages,
>> as opposed to two semantically different elements that happen to also differ
>> in their language.
>>
>> This second issue is the one that would require more structural changes to
>> EML. For example, one might sometimes want to have more than one title
>> (which is why title is currently repeatable), but other times want to have
>> one title in two different languages. Either way, EML's current structures
>> don't allow these subtleties to be specified.
>>
>> Matt
>>
>>
>>
>> On Fri, Dec 5, 2008 at 12:54 PM, inigo san gil <isangil at lternet.edu>
>> wrote:
>>
>>
>> Metadata folks:
>>
>> I think this opens (perhaps re-opens) and interesting discussion.
>>
>> EML's resource (main module) offers us a <language> element that,
>> as I understand it, serves to specify the language used for the document.
>> The cardinality is set to <= 1, so it is optional, and if used, only one
>> language.
>>
>> However, we understood from Kristin Valnderbilt and David Blankman
>> that at a recent ILTER meeting, there was an agreement to provide
>> referencial-level EML for all metadata in English (and perhaps more
>> rich EML in their native languages).
>> The option David proposes, providing content in two languages
>> one being english, does not play well with the EML schema as is.
>> There are options in the interim, while we think whether 'we' tweak
>> the EML schema. Some solutions go in the direction of "duplicating" the
>> original EML record: Take what it is in the native language, and either
>> have it translate at some minimal-compliance level EML (ouch) or
>> run it by a translation web service and laugh (or rather cry) at the
>> results.
>>
>> There are of course many other approaches to this problem, Mark
>> Servilla mentioned some in the hallways of the LTER Network Office.
>>
>> The thing is that part of the international community in ecology has
>> expressed formal interest/commitment in using EML to document their
>> metadata. The ILTER group quickly realized of the Babelian challenge
>> ahead, (see Blankman's ISEI-6 presentation & future paper) and
>> David, Akiko Ocgawa and others took in helping the ILTER providing
>> basic EML in english (remember ILTER committed to use English
>> -chinglish and spanglish- as the lingua franca for referential level EML,
>> EML level 1, title, creator, abstract, contact at least
>>
>> Cheers,
>> Inigo
>>
>>
>>
>>
>> bugzilla-daemon at ecoinformatics.org wrote:
>>
>> http://bugzilla.ecoinformatics.org/show_bug.cgi?id=585
>>
>>
>>
>>
>>
>> ------- Comment #2 from mob at icess.ucsb.edu 2008-12-05 09:31 -------
>> This comment from an email from David Blankman:
>> As EML is becoming an international standard, we need to start thinking
>> about
>> ways to make EML more intelligent about multiple languages. While EML
>> allows
>> multiple titles, there is currently no way to indicated that multiple
>> titles
>> are equivalent. For example,if I have:
>> <title> North American Forests </title> AND
>> <title> Bosques de Norte Americano</title>
>>
>> EML currently has no way to indicate that these are the same title, just
>> in a
>> different language.
>>
>> Matt and I were talking about this at the ISEI-Cancun meeting, but I
>> thought
>> that it would be a good idea to get this discussion started within eml-dev
>> and
>> the ILTER group as well.
>> _______________________________________________
>> Eml-dev mailing list
>> Eml-dev at ecoinformatics.org
>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>>
>>
>>
>> _______________________________________________
>> Eml-dev mailing list
>> Eml-dev at ecoinformatics.org
>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>>
>>
>>
>> --
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> Matthew B. Jones
>> Director of Informatics Research and Development
>> National Center for Ecological Analysis and Synthesis (NCEAS)
>> UC Santa Barbara
>> jones at nceas.ucsb.edu Ph: 1-907-523-1960
>> http://www.nceas.ucsb.edu/ecoinfo
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> _______________________________________________
>> Eml-dev mailing list
>> Eml-dev at ecoinformatics.org
>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/eml-dev
>>
>>
>>
>>
>> --
>> Nature is trying very hard to make us succeed, but nature does not depend
>> on us. We are not the only experiment.
>> - R. Buckminster Fuller
>>
>> If I am not for myself, then who will be for me? If I am for myself alone,
>> then who am I? If not now, when?
>> - Rabbi Hillel
>>
>
>
>
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Matthew B. Jones
> Director of Informatics Research and Development
> National Center for Ecological Analysis and Synthesis (NCEAS)
> UC Santa Barbara
> jones at nceas.ucsb.edu Ph: 1-907-523-1960
> http://www.nceas.ucsb.edu/ecoinfo
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Matthew B. Jones
Director of Informatics Research and Development
National Center for Ecological Analysis and Synthesis (NCEAS)
UC Santa Barbara
jones at nceas.ucsb.edu Ph: 1-907-523-1960
http://www.nceas.ucsb.edu/ecoinfo
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20081209/8b3444c1/attachment-0001.html>
More information about the Eml-dev
mailing list