Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipelines - images missing in the DRs with the new pipeline #907

Open
sadeghim opened this issue Jul 30, 2024 · 1 comment
Open

pipelines - images missing in the DRs with the new pipeline #907

sadeghim opened this issue Jul 30, 2024 · 1 comment

Comments

@sadeghim
Copy link
Member

Some of the DRs (mainly from bioCollect) are missing imageID/imageIDs/multimedia field in the occurrence while there are records with multimedia in the index.avro for that DR in the new index:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":385,
    "params":{
      "q":"dataResourceUid:dr17465",
      "fl":"imageID, imageIDs, multimedia,id",
      "q.op":"AND",
      "fq":"imageID:*",
      "rows":"5",
      "_":"1722295162881"}},
  "response":{"numFound":19,"start":0,"maxScore":4.248656,"numFoundExact":true,"docs":[
      {
        "id":"dfaf107e-69bd-4c93-8e0e-43a214b9a0d9",
        "imageID":"446460ab-a69f-48d7-94f9-f8acc70b524d",
        "multimedia":["Image"],
        "imageIDs":["446460ab-a69f-48d7-94f9-f8acc70b524d"]},
      {
        "id":"d28fc582-3982-4c37-bfb0-a92339b467fc",
        "imageID":"b364a58b-c9da-4803-8bb8-d174d3d657d3",
        "multimedia":["Image"],
        "imageIDs":["b364a58b-c9da-4803-8bb8-d174d3d657d3"]},
      {
        "id":"44779727-5559-48d2-9fad-faf5079b28be",
        "imageID":"d89116ac-de2b-4886-a7fa-41363ee0dfa3",
        "multimedia":["Image"],
        "imageIDs":["d89116ac-de2b-4886-a7fa-41363ee0dfa3"]},
      {
        "id":"2b396900-2c07-4e3b-9ec8-82353b80cda5",
        "imageID":"b99b264f-360c-467c-9151-6f443ebde6b8",
        "multimedia":["Image"],
        "imageIDs":["b99b264f-360c-467c-9151-6f443ebde6b8"]},
      {
        "id":"227ffa3d-ce90-4877-ab46-0efb2b42093a",
        "imageID":"bc9c62ee-1d24-4836-8eb9-0d7fcd484906",
        "multimedia":["Image"],
        "imageIDs":["bc9c62ee-1d24-4836-8eb9-0d7fcd484906"]}]
  }}

While same DR in the current index has 11,907 occurrences with images: https://biocache.ala.org.au/occurrences/search?q=dataResourceUid%3Adr17465&qualityProfile=ALA&qc=-_nest_parent_%3A*&fq=multimedia%3A%22Image%22

@adam-collins
Copy link
Contributor

The problem was identified as the change that returns an older requirement that the DwCA multimedia.csv have a valid and populated format field. e.g. image/jpeg.

The inferred requirements are

  1. biocollect DwCA export to populate this field.
  2. pipelines change that will assign media without a format value to images. See pipelines IndexRecordTransform.java
               if (image.getFormat() != null) {
                  if (image.getFormat().startsWith("image")) {
                    multimedia.add(IMAGE);
                    images.add(image.getIdentifier());
                  }
                  if (image.getFormat().startsWith("audio")) {
                    multimedia.add(SOUND);
                    sounds.add(image.getIdentifier());
                  }
                  if (image.getFormat().startsWith("video")) {
                    multimedia.add(VIDEO);
                    videos.add(image.getIdentifier());
                  }
                }

There is a partially related issue where the images export will be somehow merged and/or joined with the DwCA media file.

  1. images will enable empty dataResourceUid updates by pipelines Uploaded duplicate images never update the dataResourceUid image-service#209
  2. The logic behind the images export and DwCA media file merge/join will be documented and reviewed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants