Linking Images for OCR

  1. Nightly Image Uploads: Images of the labels will be uploaded to the web servers via an FTP drop box. The upload process will involve the creation of 2-3 web versions (thumbnail, medium, large) of the images which will be displayed through the web portals.
  2. Linking Images to Portal Database: During the upload process, the barcode identifier will be obtained either from the image file name or directly from the image using OCR. The barcode identifier will be used to locate and link specimen records that already exist within the portal database. Previously existing records will be given a processing status of “pending review”. In cases where the specimen record does not yet already exist, the image will be linked to a new specimen record that is only populated with the barcode identifier. The new record will be given a processing status of “unprocessed”. If the imaging workflow records the most recent identification, this data will be appended to the record at this time.
  3. Automated OCR: Automated scripts will attempt to harvest raw text from each “unprocessed” image. When valid text is returned, it will be stored as a raw text block linked to the specimen record. Processing status will be changed to “OCR processed”.
  4. Automated NLP: Automated scripts will attempt to parse raw text into Darwin Core compliant data fields. On success, data will be appended to the appropriate Symbiota data fields. Processing status will be changed to “NLP parsed”.
  5. Automated Duplicate Record Query: Automated script will further process all records where the NLP parsing scripts returned collector, collector number, and collection date. This process will use those fields to search the integrated consortium database for duplicate records that have already been processed at another institution. Pending duplicates will be linked and the processing status will be changed to “pending duplicate”.

Comments are closed.