(Adapted from LBCC)
Data Management About 50% of the institutions that provide data on a Symbiota portal have a primary database for entering and annotating data outside of Symbiota and the portals only display a snapshot of their data. Regular synchronization between the portal snapshot and the central database ensures that the data snapshot is up-to-date within the portal. Many institutions use a Symbiota portal as their central management database with record modifications being reflexed as they are made (live dataset), there is no need for infrastructure to regularly update the portal data. As Symbiota continues to add functionality in terms of annotating records and images, the live collections are able to realize added value
Specimen Record Review
All specimen records will require review. Depending on the results of the automated processing steps, the review process will consist of simple approval, minor editing, importing duplicate record data, reprocessing of OCR, trained NLP parsers, and/or simply key stroking the label information.
Use case Scenario
- Simple Approval: In the best case scenario, reviewers will simply need to approve record.
- Minor Editing: Most records will likely need some type of data adjustments before approval.
- Importing Duplicate / Exsiccatae Record Data: In cases where a duplicate or exsiccatae record has already been processed in a partner institution, the reviewer will have the ability to view a list of pending duplicate records and selectively import data from the best matching record. Reviewers will have the ability to process these records in batches.
- Reprocessing of OCR: Reviewer will have the ability to rerun OCR on a particular image from the review page.
- Reprocessing NLP parsers: Reviewer will have the ability to reparse the raw text. There might be two alternative parsing algorithms and one may work better with some label formats than others. Furthermore, the central parsing algorithms will have the ability to “learn” how to better parse labels that have the same layout, e.g., from the same collector, or when a herbarium has used pre-printed label forms. The reviewer will have the ability to select certain label profiles that were specifically trained parse database fields based on its location or word frequency within the content.
- Key Stroking Label Information: Labels that were hand-written or have general poor OCR return will have to be hand typed into the data entry form. Unfortunately, key stroking will be necessary for many of the older labels; however, these labels tend to have little information that needs to be entered.
- Portal and Central Database Synchronization: In addition to regular updates of the data snapshot within the data portals, collections that maintain in-house central databases need the ability to transfer new or edited records that have been processed within the portal. Collections that manage their data directly within the portal have no need for this infrastructure since their central and portal datasets are the same.
- Refresh Portal Snapshots: When the portal features a data snapshot of an herbarium’s central database, the snapshot needs to be refreshed at regular intervals. Portals have several building tools and services to accomplish this. For more information, visit the Symbiota documentation website.
- Download New Records: Records entered within the portal from the images of the specimen labels need to be transferred to the collection’s central database at regular intervals. Password protected download modules will aid collection managers in preforming regular downloads in data formats that best match their needs. As an example, collections utilizing Specify as their data management system will be able to download recently reviewed records as a Darwin Core CVS file and import into their central database using the Specify Workbench.
- Downloading Recent Edits: Portals have the ability to make use of crowdsourcing and community involvement to aid in data cleaning, georeferencing, and error resolution. Edits will need to be regularly downloaded by data managers and integrated into their central database. To ensure that these edits that have not yet been transferring to the central database are not copied over at the next refresh of the data snapshot, these edits are preserved (versioned) separate layer than the snapshot and reapplied as needed.