How to manipulate on the fly YUV Camera frame efficiently in Android?





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







0















I'm adding a black (0) padding around Region of interest (center) of NV21 frame got from Android CameraPreview callbacks in a thread.



To avoid overhead of conversion to RGB/Bitmap and reverse, I'm trying to manipulate NV21 byte array directly but this involves nested loops which is also making preview/processing slow.



This is my run() method sending frames to detector after calling method blackNonROI.



public void run() {
Frame outputFrame;
ByteBuffer data;
while (true) {
synchronized (mLock) {

while (mActive && (mPendingFrameData == null))
try{ mLock.wait(); }catch(InterruptedException e){ return; }

if (!mActive) { return; }

// Region of Interest
mPendingFrameData = blackNonROI(mPendingFrameData.array(),mPreviewSize.getWidth(),mPreviewSize.getHeight(),300,300);

outputFrame = new Frame.Builder().setImageData(mPendingFrameData, mPreviewSize.getWidth(),mPreviewSize.getHeight(), ImageFormat.NV21).setId(mPendingFrameId).setTimestampMillis(mPendingTimeMillis).setRotation(mRotation).build();

data = mPendingFrameData;
mPendingFrameData = null;

}

try {
mDetector.receiveFrame(outputFrame);
} catch (Throwable t) {
} finally {
mCamera.addCallbackBuffer(data.array());
}
}
}


Following is the method blackNonROI



private ByteBuffer blackNonROI(byte yuvData, int width, int height, int roiWidth, int roiHeight){

int hozMargin = (width - roiWidth) / 2;
int verMargin = (height - roiHeight) / 2;

// top/bottom of center
for(int x=0; x<width; x++){
for(int y=0; y<verMargin; y++)
yuvData[y * width + x] = 0;
for(int y=height-verMargin; y<height; y++)
yuvData[y * width + x] = 0;
}

// left/right of center
for(int y=verMargin; y<height-verMargin; y++){
for (int x = 0; x < hozMargin; x++)
yuvData[y * width + x] = 0;
for (int x = width-hozMargin; x < width; x++)
yuvData[y * width + x] = 0;
}

return ByteBuffer.wrap(yuvData);
}


Example output frame



Note that I'm not cropping the image, just padding black pixels around specified center of image to maintain coordinated for further activities. This works like it should but it's not fast enough and causing lag in preview and frames processing.




  1. Can I further improve byte array update?

  2. Is time/place for calling blackNonROI fine?

  3. Any other way / lib for doing it more efficiently?

  4. My simple pixel iteration is so slow, how YUV/Bitmap libraries do complex things so fast? do they use GPU?


Edit:



I've replaced both for loops with following code, and it's pretty much fast now (Please refer to greeble31's answer for details):



    // full top padding
from = 0;
to = (verMargin-1)*width + width;
Arrays.fill(yuvData,from,to,(byte)1);

// full bottom padding
from = (height-verMargin)*width;
to = (height-1)*width + width;
Arrays.fill(yuvData,from,to,(byte)1);

for(int y=verMargin; y<height-verMargin; y++) {
// left-middle padding
from = y*width;
to = y*width + hozMargin;
Arrays.fill(yuvData,from,to,(byte)1);

// right-middle padding
from = y*width + width-hozMargin;
to = y*width + width;
Arrays.fill(yuvData,from,to,(byte)1);
}









share|improve this question































    0















    I'm adding a black (0) padding around Region of interest (center) of NV21 frame got from Android CameraPreview callbacks in a thread.



    To avoid overhead of conversion to RGB/Bitmap and reverse, I'm trying to manipulate NV21 byte array directly but this involves nested loops which is also making preview/processing slow.



    This is my run() method sending frames to detector after calling method blackNonROI.



    public void run() {
    Frame outputFrame;
    ByteBuffer data;
    while (true) {
    synchronized (mLock) {

    while (mActive && (mPendingFrameData == null))
    try{ mLock.wait(); }catch(InterruptedException e){ return; }

    if (!mActive) { return; }

    // Region of Interest
    mPendingFrameData = blackNonROI(mPendingFrameData.array(),mPreviewSize.getWidth(),mPreviewSize.getHeight(),300,300);

    outputFrame = new Frame.Builder().setImageData(mPendingFrameData, mPreviewSize.getWidth(),mPreviewSize.getHeight(), ImageFormat.NV21).setId(mPendingFrameId).setTimestampMillis(mPendingTimeMillis).setRotation(mRotation).build();

    data = mPendingFrameData;
    mPendingFrameData = null;

    }

    try {
    mDetector.receiveFrame(outputFrame);
    } catch (Throwable t) {
    } finally {
    mCamera.addCallbackBuffer(data.array());
    }
    }
    }


    Following is the method blackNonROI



    private ByteBuffer blackNonROI(byte yuvData, int width, int height, int roiWidth, int roiHeight){

    int hozMargin = (width - roiWidth) / 2;
    int verMargin = (height - roiHeight) / 2;

    // top/bottom of center
    for(int x=0; x<width; x++){
    for(int y=0; y<verMargin; y++)
    yuvData[y * width + x] = 0;
    for(int y=height-verMargin; y<height; y++)
    yuvData[y * width + x] = 0;
    }

    // left/right of center
    for(int y=verMargin; y<height-verMargin; y++){
    for (int x = 0; x < hozMargin; x++)
    yuvData[y * width + x] = 0;
    for (int x = width-hozMargin; x < width; x++)
    yuvData[y * width + x] = 0;
    }

    return ByteBuffer.wrap(yuvData);
    }


    Example output frame



    Note that I'm not cropping the image, just padding black pixels around specified center of image to maintain coordinated for further activities. This works like it should but it's not fast enough and causing lag in preview and frames processing.




    1. Can I further improve byte array update?

    2. Is time/place for calling blackNonROI fine?

    3. Any other way / lib for doing it more efficiently?

    4. My simple pixel iteration is so slow, how YUV/Bitmap libraries do complex things so fast? do they use GPU?


    Edit:



    I've replaced both for loops with following code, and it's pretty much fast now (Please refer to greeble31's answer for details):



        // full top padding
    from = 0;
    to = (verMargin-1)*width + width;
    Arrays.fill(yuvData,from,to,(byte)1);

    // full bottom padding
    from = (height-verMargin)*width;
    to = (height-1)*width + width;
    Arrays.fill(yuvData,from,to,(byte)1);

    for(int y=verMargin; y<height-verMargin; y++) {
    // left-middle padding
    from = y*width;
    to = y*width + hozMargin;
    Arrays.fill(yuvData,from,to,(byte)1);

    // right-middle padding
    from = y*width + width-hozMargin;
    to = y*width + width;
    Arrays.fill(yuvData,from,to,(byte)1);
    }









    share|improve this question



























      0












      0








      0








      I'm adding a black (0) padding around Region of interest (center) of NV21 frame got from Android CameraPreview callbacks in a thread.



      To avoid overhead of conversion to RGB/Bitmap and reverse, I'm trying to manipulate NV21 byte array directly but this involves nested loops which is also making preview/processing slow.



      This is my run() method sending frames to detector after calling method blackNonROI.



      public void run() {
      Frame outputFrame;
      ByteBuffer data;
      while (true) {
      synchronized (mLock) {

      while (mActive && (mPendingFrameData == null))
      try{ mLock.wait(); }catch(InterruptedException e){ return; }

      if (!mActive) { return; }

      // Region of Interest
      mPendingFrameData = blackNonROI(mPendingFrameData.array(),mPreviewSize.getWidth(),mPreviewSize.getHeight(),300,300);

      outputFrame = new Frame.Builder().setImageData(mPendingFrameData, mPreviewSize.getWidth(),mPreviewSize.getHeight(), ImageFormat.NV21).setId(mPendingFrameId).setTimestampMillis(mPendingTimeMillis).setRotation(mRotation).build();

      data = mPendingFrameData;
      mPendingFrameData = null;

      }

      try {
      mDetector.receiveFrame(outputFrame);
      } catch (Throwable t) {
      } finally {
      mCamera.addCallbackBuffer(data.array());
      }
      }
      }


      Following is the method blackNonROI



      private ByteBuffer blackNonROI(byte yuvData, int width, int height, int roiWidth, int roiHeight){

      int hozMargin = (width - roiWidth) / 2;
      int verMargin = (height - roiHeight) / 2;

      // top/bottom of center
      for(int x=0; x<width; x++){
      for(int y=0; y<verMargin; y++)
      yuvData[y * width + x] = 0;
      for(int y=height-verMargin; y<height; y++)
      yuvData[y * width + x] = 0;
      }

      // left/right of center
      for(int y=verMargin; y<height-verMargin; y++){
      for (int x = 0; x < hozMargin; x++)
      yuvData[y * width + x] = 0;
      for (int x = width-hozMargin; x < width; x++)
      yuvData[y * width + x] = 0;
      }

      return ByteBuffer.wrap(yuvData);
      }


      Example output frame



      Note that I'm not cropping the image, just padding black pixels around specified center of image to maintain coordinated for further activities. This works like it should but it's not fast enough and causing lag in preview and frames processing.




      1. Can I further improve byte array update?

      2. Is time/place for calling blackNonROI fine?

      3. Any other way / lib for doing it more efficiently?

      4. My simple pixel iteration is so slow, how YUV/Bitmap libraries do complex things so fast? do they use GPU?


      Edit:



      I've replaced both for loops with following code, and it's pretty much fast now (Please refer to greeble31's answer for details):



          // full top padding
      from = 0;
      to = (verMargin-1)*width + width;
      Arrays.fill(yuvData,from,to,(byte)1);

      // full bottom padding
      from = (height-verMargin)*width;
      to = (height-1)*width + width;
      Arrays.fill(yuvData,from,to,(byte)1);

      for(int y=verMargin; y<height-verMargin; y++) {
      // left-middle padding
      from = y*width;
      to = y*width + hozMargin;
      Arrays.fill(yuvData,from,to,(byte)1);

      // right-middle padding
      from = y*width + width-hozMargin;
      to = y*width + width;
      Arrays.fill(yuvData,from,to,(byte)1);
      }









      share|improve this question
















      I'm adding a black (0) padding around Region of interest (center) of NV21 frame got from Android CameraPreview callbacks in a thread.



      To avoid overhead of conversion to RGB/Bitmap and reverse, I'm trying to manipulate NV21 byte array directly but this involves nested loops which is also making preview/processing slow.



      This is my run() method sending frames to detector after calling method blackNonROI.



      public void run() {
      Frame outputFrame;
      ByteBuffer data;
      while (true) {
      synchronized (mLock) {

      while (mActive && (mPendingFrameData == null))
      try{ mLock.wait(); }catch(InterruptedException e){ return; }

      if (!mActive) { return; }

      // Region of Interest
      mPendingFrameData = blackNonROI(mPendingFrameData.array(),mPreviewSize.getWidth(),mPreviewSize.getHeight(),300,300);

      outputFrame = new Frame.Builder().setImageData(mPendingFrameData, mPreviewSize.getWidth(),mPreviewSize.getHeight(), ImageFormat.NV21).setId(mPendingFrameId).setTimestampMillis(mPendingTimeMillis).setRotation(mRotation).build();

      data = mPendingFrameData;
      mPendingFrameData = null;

      }

      try {
      mDetector.receiveFrame(outputFrame);
      } catch (Throwable t) {
      } finally {
      mCamera.addCallbackBuffer(data.array());
      }
      }
      }


      Following is the method blackNonROI



      private ByteBuffer blackNonROI(byte yuvData, int width, int height, int roiWidth, int roiHeight){

      int hozMargin = (width - roiWidth) / 2;
      int verMargin = (height - roiHeight) / 2;

      // top/bottom of center
      for(int x=0; x<width; x++){
      for(int y=0; y<verMargin; y++)
      yuvData[y * width + x] = 0;
      for(int y=height-verMargin; y<height; y++)
      yuvData[y * width + x] = 0;
      }

      // left/right of center
      for(int y=verMargin; y<height-verMargin; y++){
      for (int x = 0; x < hozMargin; x++)
      yuvData[y * width + x] = 0;
      for (int x = width-hozMargin; x < width; x++)
      yuvData[y * width + x] = 0;
      }

      return ByteBuffer.wrap(yuvData);
      }


      Example output frame



      Note that I'm not cropping the image, just padding black pixels around specified center of image to maintain coordinated for further activities. This works like it should but it's not fast enough and causing lag in preview and frames processing.




      1. Can I further improve byte array update?

      2. Is time/place for calling blackNonROI fine?

      3. Any other way / lib for doing it more efficiently?

      4. My simple pixel iteration is so slow, how YUV/Bitmap libraries do complex things so fast? do they use GPU?


      Edit:



      I've replaced both for loops with following code, and it's pretty much fast now (Please refer to greeble31's answer for details):



          // full top padding
      from = 0;
      to = (verMargin-1)*width + width;
      Arrays.fill(yuvData,from,to,(byte)1);

      // full bottom padding
      from = (height-verMargin)*width;
      to = (height-1)*width + width;
      Arrays.fill(yuvData,from,to,(byte)1);

      for(int y=verMargin; y<height-verMargin; y++) {
      // left-middle padding
      from = y*width;
      to = y*width + hozMargin;
      Arrays.fill(yuvData,from,to,(byte)1);

      // right-middle padding
      from = y*width + width-hozMargin;
      to = y*width + width;
      Arrays.fill(yuvData,from,to,(byte)1);
      }






      android image-processing android-camera yuv google-vision






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 3 at 20:36







      Neogist

















      asked Jan 3 at 16:07









      NeogistNeogist

      32




      32
























          2 Answers
          2






          active

          oldest

          votes


















          1














          1. Yes. To understand why, let's take a look at the bytecode Android Studio produces for your "left/right of center" nested loop:



          (Annotated excerpt from a release build of blackNonROI, AS 3.2.1):



          :goto_27
          sub-int v2, p2, p4 ;for(int y=verMargin; y<height-verMargin; y++)
          if-ge v1, v2, :cond_45
          const/4 v2, 0x0
          :goto_2c
          if-ge v2, p3, :cond_36 ;for (int x = 0; x < hozMargin; x++)
          mul-int v3, v1, p1
          add-int/2addr v3, v2
          .line 759
          aput-byte v0, p0, v3
          add-int/lit8 v2, v2, 0x1
          goto :goto_2c
          :cond_36
          sub-int v2, p1, p3
          :goto_38
          if-ge v2, p1, :cond_42 ;for (int x = width-hozMargin; x < width; x++)
          mul-int v3, v1, p1
          add-int/2addr v3, v2
          .line 761
          aput-byte v0, p0, v3
          add-int/lit8 v2, v2, 0x1
          goto :goto_38
          :cond_42
          add-int/lit8 v1, v1, 0x1
          goto :goto_27
          .line 764
          :cond_45 ;all done with the for loops!


          Without bothering to decipher this whole thing line-by-line, it is clear that each of your small, inner loops is performing:




          • 1 comparison

          • 1 integer multiplication

          • 1 addition

          • 1 store

          • 1 goto


          That's a lot, when you consider that all that you really need this inner loop to do is set a certain number of successive array elements to 0.



          Moreover, some of these bytecodes require multiple machine instructions to implement, so I wouldn't be surprised if you're looking at over 20 cycles, just to do a single iteration of one of the inner loops. (I haven't tested what this code looks like once it's compiled by the Dalvik VM, but I sincerely doubt it is smart enough to optimize the multiplications out of these loops.)



          POSSIBLE FIXES



          You could improve performance by eliminating some redundant calculations. For example, each inner loop is recalculating y * width each time. Instead, you could pre-calculate that offset, store it in a local variable (in the outer loop), and use that when calculating the indices.



          When performance is absolutely critical, I will sometimes do this sort of buffer manipulation in native code. If you can be reasonably certain that mPendingFrameData is a DirectByteBuffer, this is an even more attractive option. The disadvantages are 1.) higher complexity, and 2.) less of a "safety net" if something goes wrong/crashes.



          MOST APPROPRIATE FIX



          In your case, the most appropriate solution is probably just to use Arrays.fill(), which is more likely to be implemented in an optimized way.



          Note that the top and bottom blocks are big, contiguous chunks of memory, and can be handled by one Arrays.fill() each:



          Arrays.fill(yuvData, 0, verMargin * width, 0);   //top
          Arrays.fill(yuvData, width * height - verMargin * width, width * height, 0); //bottom


          And then the sides could be handled something like this:



          for(int y=verMargin; y<height-verMargin; y++){
          int offset = y * width;
          Arrays.fill(yuvData, offset, offset + hozMargin, 0); //left
          Arrays.fill(yuvData, offset + width, offset + width - hozMargin, 0); //right
          }


          There are more opportunities for optimization, here, but we're already at the point of diminishing returns. For example, since the end of each row of is adjacent to the start of the next one (in memory), you could actually combine two smaller fill() calls into a larger one that covers both the right side of row N and the left side of row N + 1. And so forth.



          2. Not sure. If your preview is displaying without any corruption/tearing, then it's probably a safe place to call the function from (from a thread safety standpoint), and is therefor probably as good a place as any.



          3 and 4. There could be libraries for doing this task; I don't know of any offhand, for Java-based NV21 frames. You'd have to do some format conversions, and I don't think it's be worth it. Using a GPU to do this work is excessive over-optimization, in my opinion, but it may be appropriate for some specialized applications. I'd consider going to JNI (native code) before I'd ever consider using the GPU.



          I think your choice to do the manipulation directly to the NV21, instead of converting to a bitmap, is a good one (considering your needs and the fact that the task is simple enough to avoid needing a graphics library).






          share|improve this answer


























          • You nailed it greeble31 !

            – Neogist
            Jan 3 at 20:16











          • Meanwhile, I was too struggling for optimizations and trying to use Arrays.fill(). Amazingly, I came to almost same solution you proposed. Your answer certified my approach and cleared many things. You deserve thanks for such an answer.

            – Neogist
            Jan 3 at 20:30



















          0














          Obviously, the most efficient way to pass image for detection would be to pass the ROI rectangle to detector. All our image processing functions accept bounding box as a parameter.



          If the black margin is used for display, consider using a black overlay mask for preview layout instead of pixel manipulation.



          If pixel manipulation is inevitable, check if you can limit it to Y OK, you already do this!



          If your detector works on a downscaled image (as my face recognition engine does), it may be wise to apply black out to a resized frame.



          At any rate, keep your loops clean and tidy, remove all recurring calculations. Using Arrays.fill() operations may help significantly, but not dramatically.






          share|improve this answer


























          • Black padding isn't shown in preview and it's added to image so that when detector return result rectangle, its coordinates correspond to actual preview size. Yes, I'm only blacking Y hoping that detector is also considering only Y frame.

            – Neogist
            Jan 5 at 20:44











          • It may be much more efficient to pass a cropped frame and compensate the detection result for the bounding rectangle. I would recommend to compare performance (even as simulation) and see if the gain is significant to you.

            – Alex Cohn
            Jan 6 at 9:16











          • BTW, if your detector works on a downscaled image (as my face recognition engine does), it may be wise to resize the frame while cropping and/or blacking out pixels.

            – Alex Cohn
            Jan 6 at 9:19











          • Thanks I'll try your suggestions.

            – Neogist
            Jan 9 at 9:55












          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54025913%2fhow-to-manipulate-on-the-fly-yuv-camera-frame-efficiently-in-android%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          1. Yes. To understand why, let's take a look at the bytecode Android Studio produces for your "left/right of center" nested loop:



          (Annotated excerpt from a release build of blackNonROI, AS 3.2.1):



          :goto_27
          sub-int v2, p2, p4 ;for(int y=verMargin; y<height-verMargin; y++)
          if-ge v1, v2, :cond_45
          const/4 v2, 0x0
          :goto_2c
          if-ge v2, p3, :cond_36 ;for (int x = 0; x < hozMargin; x++)
          mul-int v3, v1, p1
          add-int/2addr v3, v2
          .line 759
          aput-byte v0, p0, v3
          add-int/lit8 v2, v2, 0x1
          goto :goto_2c
          :cond_36
          sub-int v2, p1, p3
          :goto_38
          if-ge v2, p1, :cond_42 ;for (int x = width-hozMargin; x < width; x++)
          mul-int v3, v1, p1
          add-int/2addr v3, v2
          .line 761
          aput-byte v0, p0, v3
          add-int/lit8 v2, v2, 0x1
          goto :goto_38
          :cond_42
          add-int/lit8 v1, v1, 0x1
          goto :goto_27
          .line 764
          :cond_45 ;all done with the for loops!


          Without bothering to decipher this whole thing line-by-line, it is clear that each of your small, inner loops is performing:




          • 1 comparison

          • 1 integer multiplication

          • 1 addition

          • 1 store

          • 1 goto


          That's a lot, when you consider that all that you really need this inner loop to do is set a certain number of successive array elements to 0.



          Moreover, some of these bytecodes require multiple machine instructions to implement, so I wouldn't be surprised if you're looking at over 20 cycles, just to do a single iteration of one of the inner loops. (I haven't tested what this code looks like once it's compiled by the Dalvik VM, but I sincerely doubt it is smart enough to optimize the multiplications out of these loops.)



          POSSIBLE FIXES



          You could improve performance by eliminating some redundant calculations. For example, each inner loop is recalculating y * width each time. Instead, you could pre-calculate that offset, store it in a local variable (in the outer loop), and use that when calculating the indices.



          When performance is absolutely critical, I will sometimes do this sort of buffer manipulation in native code. If you can be reasonably certain that mPendingFrameData is a DirectByteBuffer, this is an even more attractive option. The disadvantages are 1.) higher complexity, and 2.) less of a "safety net" if something goes wrong/crashes.



          MOST APPROPRIATE FIX



          In your case, the most appropriate solution is probably just to use Arrays.fill(), which is more likely to be implemented in an optimized way.



          Note that the top and bottom blocks are big, contiguous chunks of memory, and can be handled by one Arrays.fill() each:



          Arrays.fill(yuvData, 0, verMargin * width, 0);   //top
          Arrays.fill(yuvData, width * height - verMargin * width, width * height, 0); //bottom


          And then the sides could be handled something like this:



          for(int y=verMargin; y<height-verMargin; y++){
          int offset = y * width;
          Arrays.fill(yuvData, offset, offset + hozMargin, 0); //left
          Arrays.fill(yuvData, offset + width, offset + width - hozMargin, 0); //right
          }


          There are more opportunities for optimization, here, but we're already at the point of diminishing returns. For example, since the end of each row of is adjacent to the start of the next one (in memory), you could actually combine two smaller fill() calls into a larger one that covers both the right side of row N and the left side of row N + 1. And so forth.



          2. Not sure. If your preview is displaying without any corruption/tearing, then it's probably a safe place to call the function from (from a thread safety standpoint), and is therefor probably as good a place as any.



          3 and 4. There could be libraries for doing this task; I don't know of any offhand, for Java-based NV21 frames. You'd have to do some format conversions, and I don't think it's be worth it. Using a GPU to do this work is excessive over-optimization, in my opinion, but it may be appropriate for some specialized applications. I'd consider going to JNI (native code) before I'd ever consider using the GPU.



          I think your choice to do the manipulation directly to the NV21, instead of converting to a bitmap, is a good one (considering your needs and the fact that the task is simple enough to avoid needing a graphics library).






          share|improve this answer


























          • You nailed it greeble31 !

            – Neogist
            Jan 3 at 20:16











          • Meanwhile, I was too struggling for optimizations and trying to use Arrays.fill(). Amazingly, I came to almost same solution you proposed. Your answer certified my approach and cleared many things. You deserve thanks for such an answer.

            – Neogist
            Jan 3 at 20:30
















          1














          1. Yes. To understand why, let's take a look at the bytecode Android Studio produces for your "left/right of center" nested loop:



          (Annotated excerpt from a release build of blackNonROI, AS 3.2.1):



          :goto_27
          sub-int v2, p2, p4 ;for(int y=verMargin; y<height-verMargin; y++)
          if-ge v1, v2, :cond_45
          const/4 v2, 0x0
          :goto_2c
          if-ge v2, p3, :cond_36 ;for (int x = 0; x < hozMargin; x++)
          mul-int v3, v1, p1
          add-int/2addr v3, v2
          .line 759
          aput-byte v0, p0, v3
          add-int/lit8 v2, v2, 0x1
          goto :goto_2c
          :cond_36
          sub-int v2, p1, p3
          :goto_38
          if-ge v2, p1, :cond_42 ;for (int x = width-hozMargin; x < width; x++)
          mul-int v3, v1, p1
          add-int/2addr v3, v2
          .line 761
          aput-byte v0, p0, v3
          add-int/lit8 v2, v2, 0x1
          goto :goto_38
          :cond_42
          add-int/lit8 v1, v1, 0x1
          goto :goto_27
          .line 764
          :cond_45 ;all done with the for loops!


          Without bothering to decipher this whole thing line-by-line, it is clear that each of your small, inner loops is performing:




          • 1 comparison

          • 1 integer multiplication

          • 1 addition

          • 1 store

          • 1 goto


          That's a lot, when you consider that all that you really need this inner loop to do is set a certain number of successive array elements to 0.



          Moreover, some of these bytecodes require multiple machine instructions to implement, so I wouldn't be surprised if you're looking at over 20 cycles, just to do a single iteration of one of the inner loops. (I haven't tested what this code looks like once it's compiled by the Dalvik VM, but I sincerely doubt it is smart enough to optimize the multiplications out of these loops.)



          POSSIBLE FIXES



          You could improve performance by eliminating some redundant calculations. For example, each inner loop is recalculating y * width each time. Instead, you could pre-calculate that offset, store it in a local variable (in the outer loop), and use that when calculating the indices.



          When performance is absolutely critical, I will sometimes do this sort of buffer manipulation in native code. If you can be reasonably certain that mPendingFrameData is a DirectByteBuffer, this is an even more attractive option. The disadvantages are 1.) higher complexity, and 2.) less of a "safety net" if something goes wrong/crashes.



          MOST APPROPRIATE FIX



          In your case, the most appropriate solution is probably just to use Arrays.fill(), which is more likely to be implemented in an optimized way.



          Note that the top and bottom blocks are big, contiguous chunks of memory, and can be handled by one Arrays.fill() each:



          Arrays.fill(yuvData, 0, verMargin * width, 0);   //top
          Arrays.fill(yuvData, width * height - verMargin * width, width * height, 0); //bottom


          And then the sides could be handled something like this:



          for(int y=verMargin; y<height-verMargin; y++){
          int offset = y * width;
          Arrays.fill(yuvData, offset, offset + hozMargin, 0); //left
          Arrays.fill(yuvData, offset + width, offset + width - hozMargin, 0); //right
          }


          There are more opportunities for optimization, here, but we're already at the point of diminishing returns. For example, since the end of each row of is adjacent to the start of the next one (in memory), you could actually combine two smaller fill() calls into a larger one that covers both the right side of row N and the left side of row N + 1. And so forth.



          2. Not sure. If your preview is displaying without any corruption/tearing, then it's probably a safe place to call the function from (from a thread safety standpoint), and is therefor probably as good a place as any.



          3 and 4. There could be libraries for doing this task; I don't know of any offhand, for Java-based NV21 frames. You'd have to do some format conversions, and I don't think it's be worth it. Using a GPU to do this work is excessive over-optimization, in my opinion, but it may be appropriate for some specialized applications. I'd consider going to JNI (native code) before I'd ever consider using the GPU.



          I think your choice to do the manipulation directly to the NV21, instead of converting to a bitmap, is a good one (considering your needs and the fact that the task is simple enough to avoid needing a graphics library).






          share|improve this answer


























          • You nailed it greeble31 !

            – Neogist
            Jan 3 at 20:16











          • Meanwhile, I was too struggling for optimizations and trying to use Arrays.fill(). Amazingly, I came to almost same solution you proposed. Your answer certified my approach and cleared many things. You deserve thanks for such an answer.

            – Neogist
            Jan 3 at 20:30














          1












          1








          1







          1. Yes. To understand why, let's take a look at the bytecode Android Studio produces for your "left/right of center" nested loop:



          (Annotated excerpt from a release build of blackNonROI, AS 3.2.1):



          :goto_27
          sub-int v2, p2, p4 ;for(int y=verMargin; y<height-verMargin; y++)
          if-ge v1, v2, :cond_45
          const/4 v2, 0x0
          :goto_2c
          if-ge v2, p3, :cond_36 ;for (int x = 0; x < hozMargin; x++)
          mul-int v3, v1, p1
          add-int/2addr v3, v2
          .line 759
          aput-byte v0, p0, v3
          add-int/lit8 v2, v2, 0x1
          goto :goto_2c
          :cond_36
          sub-int v2, p1, p3
          :goto_38
          if-ge v2, p1, :cond_42 ;for (int x = width-hozMargin; x < width; x++)
          mul-int v3, v1, p1
          add-int/2addr v3, v2
          .line 761
          aput-byte v0, p0, v3
          add-int/lit8 v2, v2, 0x1
          goto :goto_38
          :cond_42
          add-int/lit8 v1, v1, 0x1
          goto :goto_27
          .line 764
          :cond_45 ;all done with the for loops!


          Without bothering to decipher this whole thing line-by-line, it is clear that each of your small, inner loops is performing:




          • 1 comparison

          • 1 integer multiplication

          • 1 addition

          • 1 store

          • 1 goto


          That's a lot, when you consider that all that you really need this inner loop to do is set a certain number of successive array elements to 0.



          Moreover, some of these bytecodes require multiple machine instructions to implement, so I wouldn't be surprised if you're looking at over 20 cycles, just to do a single iteration of one of the inner loops. (I haven't tested what this code looks like once it's compiled by the Dalvik VM, but I sincerely doubt it is smart enough to optimize the multiplications out of these loops.)



          POSSIBLE FIXES



          You could improve performance by eliminating some redundant calculations. For example, each inner loop is recalculating y * width each time. Instead, you could pre-calculate that offset, store it in a local variable (in the outer loop), and use that when calculating the indices.



          When performance is absolutely critical, I will sometimes do this sort of buffer manipulation in native code. If you can be reasonably certain that mPendingFrameData is a DirectByteBuffer, this is an even more attractive option. The disadvantages are 1.) higher complexity, and 2.) less of a "safety net" if something goes wrong/crashes.



          MOST APPROPRIATE FIX



          In your case, the most appropriate solution is probably just to use Arrays.fill(), which is more likely to be implemented in an optimized way.



          Note that the top and bottom blocks are big, contiguous chunks of memory, and can be handled by one Arrays.fill() each:



          Arrays.fill(yuvData, 0, verMargin * width, 0);   //top
          Arrays.fill(yuvData, width * height - verMargin * width, width * height, 0); //bottom


          And then the sides could be handled something like this:



          for(int y=verMargin; y<height-verMargin; y++){
          int offset = y * width;
          Arrays.fill(yuvData, offset, offset + hozMargin, 0); //left
          Arrays.fill(yuvData, offset + width, offset + width - hozMargin, 0); //right
          }


          There are more opportunities for optimization, here, but we're already at the point of diminishing returns. For example, since the end of each row of is adjacent to the start of the next one (in memory), you could actually combine two smaller fill() calls into a larger one that covers both the right side of row N and the left side of row N + 1. And so forth.



          2. Not sure. If your preview is displaying without any corruption/tearing, then it's probably a safe place to call the function from (from a thread safety standpoint), and is therefor probably as good a place as any.



          3 and 4. There could be libraries for doing this task; I don't know of any offhand, for Java-based NV21 frames. You'd have to do some format conversions, and I don't think it's be worth it. Using a GPU to do this work is excessive over-optimization, in my opinion, but it may be appropriate for some specialized applications. I'd consider going to JNI (native code) before I'd ever consider using the GPU.



          I think your choice to do the manipulation directly to the NV21, instead of converting to a bitmap, is a good one (considering your needs and the fact that the task is simple enough to avoid needing a graphics library).






          share|improve this answer















          1. Yes. To understand why, let's take a look at the bytecode Android Studio produces for your "left/right of center" nested loop:



          (Annotated excerpt from a release build of blackNonROI, AS 3.2.1):



          :goto_27
          sub-int v2, p2, p4 ;for(int y=verMargin; y<height-verMargin; y++)
          if-ge v1, v2, :cond_45
          const/4 v2, 0x0
          :goto_2c
          if-ge v2, p3, :cond_36 ;for (int x = 0; x < hozMargin; x++)
          mul-int v3, v1, p1
          add-int/2addr v3, v2
          .line 759
          aput-byte v0, p0, v3
          add-int/lit8 v2, v2, 0x1
          goto :goto_2c
          :cond_36
          sub-int v2, p1, p3
          :goto_38
          if-ge v2, p1, :cond_42 ;for (int x = width-hozMargin; x < width; x++)
          mul-int v3, v1, p1
          add-int/2addr v3, v2
          .line 761
          aput-byte v0, p0, v3
          add-int/lit8 v2, v2, 0x1
          goto :goto_38
          :cond_42
          add-int/lit8 v1, v1, 0x1
          goto :goto_27
          .line 764
          :cond_45 ;all done with the for loops!


          Without bothering to decipher this whole thing line-by-line, it is clear that each of your small, inner loops is performing:




          • 1 comparison

          • 1 integer multiplication

          • 1 addition

          • 1 store

          • 1 goto


          That's a lot, when you consider that all that you really need this inner loop to do is set a certain number of successive array elements to 0.



          Moreover, some of these bytecodes require multiple machine instructions to implement, so I wouldn't be surprised if you're looking at over 20 cycles, just to do a single iteration of one of the inner loops. (I haven't tested what this code looks like once it's compiled by the Dalvik VM, but I sincerely doubt it is smart enough to optimize the multiplications out of these loops.)



          POSSIBLE FIXES



          You could improve performance by eliminating some redundant calculations. For example, each inner loop is recalculating y * width each time. Instead, you could pre-calculate that offset, store it in a local variable (in the outer loop), and use that when calculating the indices.



          When performance is absolutely critical, I will sometimes do this sort of buffer manipulation in native code. If you can be reasonably certain that mPendingFrameData is a DirectByteBuffer, this is an even more attractive option. The disadvantages are 1.) higher complexity, and 2.) less of a "safety net" if something goes wrong/crashes.



          MOST APPROPRIATE FIX



          In your case, the most appropriate solution is probably just to use Arrays.fill(), which is more likely to be implemented in an optimized way.



          Note that the top and bottom blocks are big, contiguous chunks of memory, and can be handled by one Arrays.fill() each:



          Arrays.fill(yuvData, 0, verMargin * width, 0);   //top
          Arrays.fill(yuvData, width * height - verMargin * width, width * height, 0); //bottom


          And then the sides could be handled something like this:



          for(int y=verMargin; y<height-verMargin; y++){
          int offset = y * width;
          Arrays.fill(yuvData, offset, offset + hozMargin, 0); //left
          Arrays.fill(yuvData, offset + width, offset + width - hozMargin, 0); //right
          }


          There are more opportunities for optimization, here, but we're already at the point of diminishing returns. For example, since the end of each row of is adjacent to the start of the next one (in memory), you could actually combine two smaller fill() calls into a larger one that covers both the right side of row N and the left side of row N + 1. And so forth.



          2. Not sure. If your preview is displaying without any corruption/tearing, then it's probably a safe place to call the function from (from a thread safety standpoint), and is therefor probably as good a place as any.



          3 and 4. There could be libraries for doing this task; I don't know of any offhand, for Java-based NV21 frames. You'd have to do some format conversions, and I don't think it's be worth it. Using a GPU to do this work is excessive over-optimization, in my opinion, but it may be appropriate for some specialized applications. I'd consider going to JNI (native code) before I'd ever consider using the GPU.



          I think your choice to do the manipulation directly to the NV21, instead of converting to a bitmap, is a good one (considering your needs and the fact that the task is simple enough to avoid needing a graphics library).







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jan 3 at 19:16

























          answered Jan 3 at 19:10









          greeble31greeble31

          2,0582515




          2,0582515













          • You nailed it greeble31 !

            – Neogist
            Jan 3 at 20:16











          • Meanwhile, I was too struggling for optimizations and trying to use Arrays.fill(). Amazingly, I came to almost same solution you proposed. Your answer certified my approach and cleared many things. You deserve thanks for such an answer.

            – Neogist
            Jan 3 at 20:30



















          • You nailed it greeble31 !

            – Neogist
            Jan 3 at 20:16











          • Meanwhile, I was too struggling for optimizations and trying to use Arrays.fill(). Amazingly, I came to almost same solution you proposed. Your answer certified my approach and cleared many things. You deserve thanks for such an answer.

            – Neogist
            Jan 3 at 20:30

















          You nailed it greeble31 !

          – Neogist
          Jan 3 at 20:16





          You nailed it greeble31 !

          – Neogist
          Jan 3 at 20:16













          Meanwhile, I was too struggling for optimizations and trying to use Arrays.fill(). Amazingly, I came to almost same solution you proposed. Your answer certified my approach and cleared many things. You deserve thanks for such an answer.

          – Neogist
          Jan 3 at 20:30





          Meanwhile, I was too struggling for optimizations and trying to use Arrays.fill(). Amazingly, I came to almost same solution you proposed. Your answer certified my approach and cleared many things. You deserve thanks for such an answer.

          – Neogist
          Jan 3 at 20:30













          0














          Obviously, the most efficient way to pass image for detection would be to pass the ROI rectangle to detector. All our image processing functions accept bounding box as a parameter.



          If the black margin is used for display, consider using a black overlay mask for preview layout instead of pixel manipulation.



          If pixel manipulation is inevitable, check if you can limit it to Y OK, you already do this!



          If your detector works on a downscaled image (as my face recognition engine does), it may be wise to apply black out to a resized frame.



          At any rate, keep your loops clean and tidy, remove all recurring calculations. Using Arrays.fill() operations may help significantly, but not dramatically.






          share|improve this answer


























          • Black padding isn't shown in preview and it's added to image so that when detector return result rectangle, its coordinates correspond to actual preview size. Yes, I'm only blacking Y hoping that detector is also considering only Y frame.

            – Neogist
            Jan 5 at 20:44











          • It may be much more efficient to pass a cropped frame and compensate the detection result for the bounding rectangle. I would recommend to compare performance (even as simulation) and see if the gain is significant to you.

            – Alex Cohn
            Jan 6 at 9:16











          • BTW, if your detector works on a downscaled image (as my face recognition engine does), it may be wise to resize the frame while cropping and/or blacking out pixels.

            – Alex Cohn
            Jan 6 at 9:19











          • Thanks I'll try your suggestions.

            – Neogist
            Jan 9 at 9:55
















          0














          Obviously, the most efficient way to pass image for detection would be to pass the ROI rectangle to detector. All our image processing functions accept bounding box as a parameter.



          If the black margin is used for display, consider using a black overlay mask for preview layout instead of pixel manipulation.



          If pixel manipulation is inevitable, check if you can limit it to Y OK, you already do this!



          If your detector works on a downscaled image (as my face recognition engine does), it may be wise to apply black out to a resized frame.



          At any rate, keep your loops clean and tidy, remove all recurring calculations. Using Arrays.fill() operations may help significantly, but not dramatically.






          share|improve this answer


























          • Black padding isn't shown in preview and it's added to image so that when detector return result rectangle, its coordinates correspond to actual preview size. Yes, I'm only blacking Y hoping that detector is also considering only Y frame.

            – Neogist
            Jan 5 at 20:44











          • It may be much more efficient to pass a cropped frame and compensate the detection result for the bounding rectangle. I would recommend to compare performance (even as simulation) and see if the gain is significant to you.

            – Alex Cohn
            Jan 6 at 9:16











          • BTW, if your detector works on a downscaled image (as my face recognition engine does), it may be wise to resize the frame while cropping and/or blacking out pixels.

            – Alex Cohn
            Jan 6 at 9:19











          • Thanks I'll try your suggestions.

            – Neogist
            Jan 9 at 9:55














          0












          0








          0







          Obviously, the most efficient way to pass image for detection would be to pass the ROI rectangle to detector. All our image processing functions accept bounding box as a parameter.



          If the black margin is used for display, consider using a black overlay mask for preview layout instead of pixel manipulation.



          If pixel manipulation is inevitable, check if you can limit it to Y OK, you already do this!



          If your detector works on a downscaled image (as my face recognition engine does), it may be wise to apply black out to a resized frame.



          At any rate, keep your loops clean and tidy, remove all recurring calculations. Using Arrays.fill() operations may help significantly, but not dramatically.






          share|improve this answer















          Obviously, the most efficient way to pass image for detection would be to pass the ROI rectangle to detector. All our image processing functions accept bounding box as a parameter.



          If the black margin is used for display, consider using a black overlay mask for preview layout instead of pixel manipulation.



          If pixel manipulation is inevitable, check if you can limit it to Y OK, you already do this!



          If your detector works on a downscaled image (as my face recognition engine does), it may be wise to apply black out to a resized frame.



          At any rate, keep your loops clean and tidy, remove all recurring calculations. Using Arrays.fill() operations may help significantly, but not dramatically.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jan 6 at 9:20

























          answered Jan 5 at 14:42









          Alex CohnAlex Cohn

          42.5k555196




          42.5k555196













          • Black padding isn't shown in preview and it's added to image so that when detector return result rectangle, its coordinates correspond to actual preview size. Yes, I'm only blacking Y hoping that detector is also considering only Y frame.

            – Neogist
            Jan 5 at 20:44











          • It may be much more efficient to pass a cropped frame and compensate the detection result for the bounding rectangle. I would recommend to compare performance (even as simulation) and see if the gain is significant to you.

            – Alex Cohn
            Jan 6 at 9:16











          • BTW, if your detector works on a downscaled image (as my face recognition engine does), it may be wise to resize the frame while cropping and/or blacking out pixels.

            – Alex Cohn
            Jan 6 at 9:19











          • Thanks I'll try your suggestions.

            – Neogist
            Jan 9 at 9:55



















          • Black padding isn't shown in preview and it's added to image so that when detector return result rectangle, its coordinates correspond to actual preview size. Yes, I'm only blacking Y hoping that detector is also considering only Y frame.

            – Neogist
            Jan 5 at 20:44











          • It may be much more efficient to pass a cropped frame and compensate the detection result for the bounding rectangle. I would recommend to compare performance (even as simulation) and see if the gain is significant to you.

            – Alex Cohn
            Jan 6 at 9:16











          • BTW, if your detector works on a downscaled image (as my face recognition engine does), it may be wise to resize the frame while cropping and/or blacking out pixels.

            – Alex Cohn
            Jan 6 at 9:19











          • Thanks I'll try your suggestions.

            – Neogist
            Jan 9 at 9:55

















          Black padding isn't shown in preview and it's added to image so that when detector return result rectangle, its coordinates correspond to actual preview size. Yes, I'm only blacking Y hoping that detector is also considering only Y frame.

          – Neogist
          Jan 5 at 20:44





          Black padding isn't shown in preview and it's added to image so that when detector return result rectangle, its coordinates correspond to actual preview size. Yes, I'm only blacking Y hoping that detector is also considering only Y frame.

          – Neogist
          Jan 5 at 20:44













          It may be much more efficient to pass a cropped frame and compensate the detection result for the bounding rectangle. I would recommend to compare performance (even as simulation) and see if the gain is significant to you.

          – Alex Cohn
          Jan 6 at 9:16





          It may be much more efficient to pass a cropped frame and compensate the detection result for the bounding rectangle. I would recommend to compare performance (even as simulation) and see if the gain is significant to you.

          – Alex Cohn
          Jan 6 at 9:16













          BTW, if your detector works on a downscaled image (as my face recognition engine does), it may be wise to resize the frame while cropping and/or blacking out pixels.

          – Alex Cohn
          Jan 6 at 9:19





          BTW, if your detector works on a downscaled image (as my face recognition engine does), it may be wise to resize the frame while cropping and/or blacking out pixels.

          – Alex Cohn
          Jan 6 at 9:19













          Thanks I'll try your suggestions.

          – Neogist
          Jan 9 at 9:55





          Thanks I'll try your suggestions.

          – Neogist
          Jan 9 at 9:55


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54025913%2fhow-to-manipulate-on-the-fly-yuv-camera-frame-efficiently-in-android%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Can a sorcerer learn a 5th-level spell early by creating spell slots using the Font of Magic feature?

          Does disintegrating a polymorphed enemy still kill it after the 2018 errata?

          A Topological Invariant for $pi_3(U(n))$