ArangoDB - Slow query performance on cluster












3















I have a query that compares two collections and finds the "missing" documents from one side. Both collections (existing and temp) contain about 250K documents.



FOR existing IN ExistingCollection
LET matches = (
FOR temp IN TempCollection
FILTER temp._key == existing._key
RETURN true
)
FILTER LENGTH(matches) == 0
RETURN existing


When this runs in a single-server environment (DB and Foxx are on the same server/container), this runs like lightning in under 0.5 seconds.



However, when I run this in a cluster (single DB, single Coordinator), even when the DB and Coord are on the same physical host (different containers), I have to add a LIMIT 1000 after the initial FOR existing ... to keep it from timing out! Still, this limited result returns in almost 7 seconds!



Looking at the Execution Plan, I see that there are several REMOTE and GATHER statements after the LET matches ... SubqueryNode. From what I can gather, the problem stems from the separation of the data storage and memory structure used to filter this data.



My question: can this type of operation be done efficiently on a cluster?



I need to detect obsolete (to be deleted) documents, but this is obviously not a feasible solution.










share|improve this question



























    3















    I have a query that compares two collections and finds the "missing" documents from one side. Both collections (existing and temp) contain about 250K documents.



    FOR existing IN ExistingCollection
    LET matches = (
    FOR temp IN TempCollection
    FILTER temp._key == existing._key
    RETURN true
    )
    FILTER LENGTH(matches) == 0
    RETURN existing


    When this runs in a single-server environment (DB and Foxx are on the same server/container), this runs like lightning in under 0.5 seconds.



    However, when I run this in a cluster (single DB, single Coordinator), even when the DB and Coord are on the same physical host (different containers), I have to add a LIMIT 1000 after the initial FOR existing ... to keep it from timing out! Still, this limited result returns in almost 7 seconds!



    Looking at the Execution Plan, I see that there are several REMOTE and GATHER statements after the LET matches ... SubqueryNode. From what I can gather, the problem stems from the separation of the data storage and memory structure used to filter this data.



    My question: can this type of operation be done efficiently on a cluster?



    I need to detect obsolete (to be deleted) documents, but this is obviously not a feasible solution.










    share|improve this question

























      3












      3








      3








      I have a query that compares two collections and finds the "missing" documents from one side. Both collections (existing and temp) contain about 250K documents.



      FOR existing IN ExistingCollection
      LET matches = (
      FOR temp IN TempCollection
      FILTER temp._key == existing._key
      RETURN true
      )
      FILTER LENGTH(matches) == 0
      RETURN existing


      When this runs in a single-server environment (DB and Foxx are on the same server/container), this runs like lightning in under 0.5 seconds.



      However, when I run this in a cluster (single DB, single Coordinator), even when the DB and Coord are on the same physical host (different containers), I have to add a LIMIT 1000 after the initial FOR existing ... to keep it from timing out! Still, this limited result returns in almost 7 seconds!



      Looking at the Execution Plan, I see that there are several REMOTE and GATHER statements after the LET matches ... SubqueryNode. From what I can gather, the problem stems from the separation of the data storage and memory structure used to filter this data.



      My question: can this type of operation be done efficiently on a cluster?



      I need to detect obsolete (to be deleted) documents, but this is obviously not a feasible solution.










      share|improve this question














      I have a query that compares two collections and finds the "missing" documents from one side. Both collections (existing and temp) contain about 250K documents.



      FOR existing IN ExistingCollection
      LET matches = (
      FOR temp IN TempCollection
      FILTER temp._key == existing._key
      RETURN true
      )
      FILTER LENGTH(matches) == 0
      RETURN existing


      When this runs in a single-server environment (DB and Foxx are on the same server/container), this runs like lightning in under 0.5 seconds.



      However, when I run this in a cluster (single DB, single Coordinator), even when the DB and Coord are on the same physical host (different containers), I have to add a LIMIT 1000 after the initial FOR existing ... to keep it from timing out! Still, this limited result returns in almost 7 seconds!



      Looking at the Execution Plan, I see that there are several REMOTE and GATHER statements after the LET matches ... SubqueryNode. From what I can gather, the problem stems from the separation of the data storage and memory structure used to filter this data.



      My question: can this type of operation be done efficiently on a cluster?



      I need to detect obsolete (to be deleted) documents, but this is obviously not a feasible solution.







      arangodb






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Dec 29 '18 at 1:56









      kerrykerry

      33228




      33228
























          1 Answer
          1






          active

          oldest

          votes


















          0














          Your query executes one subquery for each document in the existing collection. Each subquery will require many HTTP roundtrips for setup, the actual querying and shutdown.



          You can avoid subqueries with the following query. It loads all document _key's into RAM - but that should be no problem with your rather small collections.



          LET ExistingCollection = (FOR existing IN c2 RETURN existing._key)
          LET TempCollection = (FOR temp IN c1 RETURN temp._key)
          RETURN MINUS(ExistingCollection, TempCollection)





          share|improve this answer
























          • Excellent! I'm not sure how I missed the MINUS() function, but this does exactly what I need. Thanks!

            – kerry
            Jan 2 at 18:11











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53966083%2farangodb-slow-query-performance-on-cluster%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          Your query executes one subquery for each document in the existing collection. Each subquery will require many HTTP roundtrips for setup, the actual querying and shutdown.



          You can avoid subqueries with the following query. It loads all document _key's into RAM - but that should be no problem with your rather small collections.



          LET ExistingCollection = (FOR existing IN c2 RETURN existing._key)
          LET TempCollection = (FOR temp IN c1 RETURN temp._key)
          RETURN MINUS(ExistingCollection, TempCollection)





          share|improve this answer
























          • Excellent! I'm not sure how I missed the MINUS() function, but this does exactly what I need. Thanks!

            – kerry
            Jan 2 at 18:11
















          0














          Your query executes one subquery for each document in the existing collection. Each subquery will require many HTTP roundtrips for setup, the actual querying and shutdown.



          You can avoid subqueries with the following query. It loads all document _key's into RAM - but that should be no problem with your rather small collections.



          LET ExistingCollection = (FOR existing IN c2 RETURN existing._key)
          LET TempCollection = (FOR temp IN c1 RETURN temp._key)
          RETURN MINUS(ExistingCollection, TempCollection)





          share|improve this answer
























          • Excellent! I'm not sure how I missed the MINUS() function, but this does exactly what I need. Thanks!

            – kerry
            Jan 2 at 18:11














          0












          0








          0







          Your query executes one subquery for each document in the existing collection. Each subquery will require many HTTP roundtrips for setup, the actual querying and shutdown.



          You can avoid subqueries with the following query. It loads all document _key's into RAM - but that should be no problem with your rather small collections.



          LET ExistingCollection = (FOR existing IN c2 RETURN existing._key)
          LET TempCollection = (FOR temp IN c1 RETURN temp._key)
          RETURN MINUS(ExistingCollection, TempCollection)





          share|improve this answer













          Your query executes one subquery for each document in the existing collection. Each subquery will require many HTTP roundtrips for setup, the actual querying and shutdown.



          You can avoid subqueries with the following query. It loads all document _key's into RAM - but that should be no problem with your rather small collections.



          LET ExistingCollection = (FOR existing IN c2 RETURN existing._key)
          LET TempCollection = (FOR temp IN c1 RETURN temp._key)
          RETURN MINUS(ExistingCollection, TempCollection)






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jan 2 at 10:59









          IngoIngo

          1496




          1496













          • Excellent! I'm not sure how I missed the MINUS() function, but this does exactly what I need. Thanks!

            – kerry
            Jan 2 at 18:11



















          • Excellent! I'm not sure how I missed the MINUS() function, but this does exactly what I need. Thanks!

            – kerry
            Jan 2 at 18:11

















          Excellent! I'm not sure how I missed the MINUS() function, but this does exactly what I need. Thanks!

          – kerry
          Jan 2 at 18:11





          Excellent! I'm not sure how I missed the MINUS() function, but this does exactly what I need. Thanks!

          – kerry
          Jan 2 at 18:11




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53966083%2farangodb-slow-query-performance-on-cluster%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          MongoDB - Not Authorized To Execute Command

          in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

          Npm cannot find a required file even through it is in the searched directory