A DataONE package is a collection of datasets and other files that are described by a metadata file.
A DataPackage is a dataone R object that can
contain a DataONE package that is either downloaded from DataONE or
newly created locally in R and then uploaded to DataONE.
After a package has been uploaded to a DataONE Member Node, it may be determined by the package submitter or other interested parties that the package needs to be updated, for example to add a missing file, replace one file with another, or remove a package member from the package.
These types of modifications can be accomplished by downloading the
package from DataONE using the getDataPackage() method to
create a local copy of the package in R, modifying the package contents
locally, then uploading the modified package to DataONE.
The complete example script used to perform a package update is shown below:
library(datapack)
library(dataone)d1c <- D1Client("STAGING", "urn:node:mnStageUCSB2")
packageId <- "resource_map_urn:uuid:a9aeefcf-228c-4534-b4ad-b480a937be7d"pkg <- getDataPackage(d1c, identifier=packageId, lazyLoad=TRUE, quiet=FALSE)metadataId <- selectMember(pkg, name="sysmeta@formatId", value="eml://ecoinformatics.org/eml-2.1.1")
objId <- selectMember(pkg, name="sysmeta@fileName", value='Strix-occidentalis-obs.csv')
zipfile <- system.file("extdata/Strix-occidentalis-obs.csv.zip", package="dataone")
pkg <- replaceMember(pkg, objId, replacement=zipfile, formatId="application/octet-stream")
auxFile <- system.file("extdata/WeatherInf.txt", package="dataone")
auxObj <- new("DataObject", format="text/plain", filename=auxFile)
pkg <- addMember(pkg, auxObj, metadataId)newPackageId <- uploadDataPackage(d1c, pkg, public=TRUE, quiet=FALSE)This example script downloads the example package that was created
and uploaded to DataONE in the vignette upload-data.
Each line of this script will be described in the following sections.
The first step in updating a package is to download the package from
DataONE into R, so that it can be modified locally using methods in the
dataone package. Modifications can be made to the package
such as adding or removing members from the package, or changing the
contents of a package member, as would be the case if the wrong file was
initially uploaded.
The getDataPackage() method downloads all files
belonging to the package specified by the identifier
parameter.
d1c <- D1Client("STAGING", "urn:node:mnStageUCSB2")pkg <- getDataPackage(d1c, identifier=packageId, lazyLoad=TRUE, quiet=FALSE)Because packages might contain large files or a large number of
files, it is possible to lazy load the package. This means
that the metadata describing the files is downloaded downloaded, but the
file contents, the data bytes, are not. For more information on using
the ‘lazyLoad’ parameter, see the R console help for the
‘getDataPackage()’ function.
For the package update that will be shown, it is not necessary to
download the data bytes of each package member, so
lazy loading is acceptable. An example when lazy loading
would not be appropriate is when the files in a package will be used for
local computational processing.
Note that metadata files, such as the EML in this example, are always
downloaded regardless of the lazyLoad parameter value.
The downloaded package can be viewed by typing the DataPackage object
name at the console, which invokes the R show method for
the object:
pkgNote that the show output for a DataPackage is condensed
to fit the width of the current R console. If the output is condensed
and more detail is required, set the R console width to a larger value,
for example, options(width=120), or if using
Rstudio, widen the console window by clicking and dragging
on the window boarder.
The original uploaded package included the file
Strix-occidentalis-obs.csv. This file can be substituted
for a different file, as would be necessary if it was determined that
the zipped form of the file should have been used instead.
First, determine which DataObject in the DataPackage pkg
contains the file to be replaced:
objId <- selectMember(pkg, name="sysmeta@fileName", value='Strix-occidentalis-obs.csv')The selectMember method inspects every DataObject in the
package pkg and checks for a match in the R S4 slot
specified by the name argument for the value specified with
the value argument. The identifier for any matching
DataObject is returned.
The documentation for the DataObject R slots available can be viewed
with the command help("DataObject-class"). As a
SystemMetadata object is contained in each DataObject, the
slots for SystemMetadata are available, with documentation viewable with
help("SystemMetadata-class")
Next, update the DataObject in pkg to replace the file
Strix-occidentalis-obs.csv with
Strix-occidentalis-obs.csv.zip using the
replaceMember method:
objId <- selectMember(pkg, name="sysmeta@fileName", value='Strix-occidentalis-obs.csv')
zipfile <- system.file("extdata/Strix-occidentalis-obs.csv.zip", package="dataone")
pkg <- replaceMember(pkg, objId, replacement=zipfile, formatId="application/octet-stream")The replaceMember method updates the DataPackage
pkg, replacing the data content of the DataObject with
identifier objId, and updating the relevant system metadata
slots such as size and checksum, to reflect
the new contents of the DataObject.
In this example, the file WeatherInf.txt was mistakenly
omitted from the original package upload, so add it now:
First, the identifier for the metadata DataObject that describes the package members will be retrieved from the DataPackage:
metadataId <- selectMember(pkg, name="sysmeta@formatId", value="eml://ecoinformatics.org/eml-2.1.1")The selectMember method returns the identifier of any
DataObject in the package with an R slot name specified in the parameter
name that matches the value specified in the argument
value. In this particular package, there is only one
DataPackage that has formatId of
eml://ecoinformatics.org/eml-2.1.1, so only that identifier
is returned.
Note that the getValue method can be used to retrieve
the values for DataObject slots, for example:
getValue(pkg, name="sysmeta@formatId")An R named list is returned, with the names being the identifiers of each DataPackage in the DataPackage and the values being the slot value.
Next, add a new package member that was omitted from the original package:
auxFile <- system.file("extdata/WeatherInf.txt", package="dataone")
auxObj <- new("DataObject", format="text/csv", filename=auxFile)
pkg <- addMember(pkg, auxObj, metadataId)The modified package can be reviewed before updating to DataONE:
Now upload the modified package to DataONE. Each DataObject in the
DataPackage will be inspected by uploadDataPackage and
DataObjects that have been modified will be updated and DataObjects that
have been added to the DataPackage will be uploaded.
newPackageId <- uploadDataPackage(d1c, pkg, public=TRUE, quiet=FALSE)As DataPackage members are modified, the DataObject that holds the metadata that describes the package members may become outdated.
For example, it was shown above that a package member was updated to
contain the file Strix-occidentalis-obs.csv.zip instead of
Strix-occidentalis-obs.csv. After this change, the
DataObject holding the EML metadata should be updated so that the EML
element objectName for this dataset matched the new
filename:
metadataId <- selectMember(pkg, name="sysmeta@formatId", value="eml://ecoinformatics.org/eml-2.1.1")
nameXpath <- '//dataTable/physical/objectName[text()="Strix-occidentalis-obs.csv"]'
newName <- basename(zipfile)
pkg <- updateMetadata(pkg, metadataId, xpath=nameXpath, replacement=newName)The updateMetadata method updates a DataObject
containing XML, substituting the value located in the document specified
with the xpath argument with the value provided in the
replacement argument. In this example, the XML contained in
the DataObject with identifier metadataId is an EML
metadata document. The EML element objectName is updated
with the value Strix-occidentalis-obs.csv.zip.
Note the the xpath argument uses the XPath query
language, which is used to locate elements within an XML document.
For EML metadata, another element that may need to be updated, if it is present in the original EML file, is the distribution url. This element contains the link that can be used to download the file from DataONE, which includes the DataONE identifier for the object. Because the identifier value is set when the DataObject is created, the EML metadata is out of date and needs to be updated with the new identifier value.
In this example, the section of the metadata that describes the
dataset Strix-occidentalis-obs.csv.zip will be updated.
A portion of the EML from our example looks like this:
<dataTable>
<entityName>Strix Occidentalis</entityName>
<entityDescription>A data file that contains only observations of Strix occidentalis</entityDescription>
<physical>
<objectName>Strix-occidentalis-obs.csv</objectName>
<size unit="byte">6017</size>
<dataFormat>
<externallyDefinedFormat>
<formatName>text/csv</formatName>
</externallyDefinedFormat>
</dataFormat>
<distribution id="1430343425153">
<online>
<url function="download"></url>
</online>
</distribution>
</physical>
<entityType>Other</entityType>
</dataTable>
The following lines get the current identifier for the DataObject that contains the correct file, and uses this identifier value to construct a standard DataONE access URL, using the base URL of the DataONE coordinating node.
objId <- selectMember(pkg, name="sysmeta@fileName", value='Strix-occidentalis-obs.csv.zip')
newURL <- sprintf("%s/%s/resolve/%s", d1c@cn@baseURL, d1c@cn@APIversion, objId)
newURLThe selectMember method checks slot specified in
name argument, in this case sysmeta@fileName,
of each DataObject in the DataPackage pkg
and returns the identifier of any member that matches the specified
value. This package has only one DataObject with the name
Strix-occidentalis-obs.csv.zip so only one identifier is
returned in objId.
Next, the metadata element for the distribution url will be updated
in the DataObject that contains the EML metadata in the package
pkg
metadataId <- selectMember(pkg, name="sysmeta@formatId", value="eml://ecoinformatics.org/eml-2.1.1")
xpathToURL <- "//dataTable/physical/distribution[../objectName/text()=\"OwlNightj.csv\"]/online/url"
pkg <- updateMetadata(pkg, do=metadataId, xpath=xpathToURL, replacement=newURL)When the modified metadata file is updated to DataONE, a new
identifier is required, as the modified metadata will replace the old
one by created a new object in DataONE, and marking the old one as
obsoleted by the new one. For this reason, the call to
updateMetadata will assign a new identifier to the metadata
DataObject in the DataPackage pkg, if
necessary.