Patent Application 18328853 - SYSTEMS AND METHODS FOR OPTIMIZING QUERIES IN A

Title: SYSTEMS AND METHODS FOR OPTIMIZING QUERIES IN A DATA LAKE

Application Information

Invention Title: SYSTEMS AND METHODS FOR OPTIMIZING QUERIES IN A DATA LAKE
Application Number: 18328853
Submission Date: 2025-05-20T00:00:00.000Z
Effective Filing Date: 2023-06-05T00:00:00.000Z
Filing Date: 2023-06-05T00:00:00.000Z
National Class: 707
National Sub-Class: 603000
Examiner Employee Number: 90575
Art Unit: 2168
Tech Center: 2100

Rejection Summary

102 Rejections: 0
103 Rejections: 7

Cited Patents

The following patents were cited in the rejection:

Office Action Text

DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.

Examiner Notes
(1) In the case of amending the Claimed invention, Applicant is respectfully requested to indicate the portion(s) of the specification which dictate(s) the structure relied on for proper interpretation and also to verify and ascertain the metes and bounds of the claimed invention. This will assist in expediting compact prosecution. MPEP 714.02 recites: “Applicant should also specifically point out the support for any amendments made to the disclosure. See MPEP § 2163.06. An amendment which does not comply with the provisions of 37 CFR 1.121 (b), (c), (d), and (h) may be held not fully responsive. See MPEP § 714.” Amendments not pointing to specific support in the disclosure may be deemed as not complying with provisions of 37 C.F.R. 1.131 (b), (c), (d), and (h) and therefore held not fully responsive. Generic statements such as "Applicants believe no new matter has been introduced" may be deemed insufficient.
(2) Examiner cites particular columns, paragraphs, figures and line numbers in the references as applied to the claims below for the convenience of the applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested that, in preparing responses, the applicant fully consider the references in their entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the Examiner.
Remarks

Receipt of Applicant’s Amendment file on 09/18/2024 is acknowledged.
Response to Arguments
Applicant’s arguments with respect to claims 1 and 19 have been considered but are moot in view of the new ground(s) of rejection (See new reference of GUPTA).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 5-11, 15-16, 19 and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Sundaram et al. (U.S. Pub. No. 2021/0232604 A1) in view of Park et al. (U.S. Patent No. 11,531,666 B1), further in view of GUPTA et al. (U.S. No. 2023/0205757 A1).
Regarding claim 1, Sundaram teaches: system for optimizing queries in a data lake storing data in one or more data lake sources, the one or more data lake sources storing a plurality of data object of different native format (Fig. 2, paragraph [0032]-[0034], the data pipeline system 200 includes one or more data sources 270 through 272, which provide data to a data ingestion engine service 202; a data storage engine reads the data from the pipelines and stores it in a data lake 214; also see paragraph [0067], one upstream source may have data lake in Databricks Delta in parquet format on amazon S3…another data source may have data stored with Azure blob storage in Avro file format…; noted, data sources 270 through 272 is interpreted as one or more data lake sources; Further noted, (‘for’ indicates intended use; Minton v. Nat ’l Ass ’n of Securities Dealers, Inc., 336 F.3d 1373, 1381, 67 USPQ2d 1614, 1620 (Fed. Cir. 2003) “whereby clause in a method claim is not given weight when it simply expresses the intended result of a process step positively recited.” Examples of claim language, although not exhaustive, that may raise a question as to the limiting effect of the language in a claim are: (A) “adapted to” or “adapted for” clauses; (B) “wherein” clauses; and (C) “whereby” clauses. Therefore intended use limitations are not required to be taught, see MPEP 2111.04 [R-3])), the system comprising: memory separate from storage of the one or more data lake sources, the memory configured to store: data objects of the plurality of data objects ingested from the storage of the one or more data lake sources of the data lake (Fig. 2, paragraph [0032]-[0034], the data pipeline system 200 includes one or more data sources 270 through 272, which provide data to a data ingestion engine service 202; a data storage engine reads the data from the pipelines and stores it in a data lake 214; noted, data sources 270 through 272 is interpreted as one or more data lake sources of data lake);
a plurality of data partitions storing data from the data objects ingested from the storage of the one or more data lake sources, the plurality of data partitions partitioned based on a key (paragraph [0039], the data lake 214 may store the data received by the ingested service; also see paragraph [0033], the stored data 220 is separated into some number of partitions including the partition 1 222 through N 224; also see paragraph [0041], the data may be partitioned into buckets by {orgId, engagementDay}…; also see paragraph [0038], data storage engine may store the keys of partitions that were updated in a changed log in the change log repository).
Sundaram does not explicitly disclose: a partition index, […], comprising entries of values of the key being associated with respective data partitions of the plurality of data partitions, the values of the key each mapped to one or more of the plurality of data partitions.
Park teaches: a partition index, […], comprising entries of values of the key being associated with respective data partitions of the plurality of data partitions, the values of the key each mapped to one or more of the plurality of data partitions (Figs. 2-3, col. 4, line 65-67, col. 5, line 1-25, col, 8, line 20-27, the data system index 100 may generate a set of Bloom filters corresponding to fields (or other collection of fields) of partitions 180A-180C; for a given one of the partition 180A-180C, a plurality of Bloom filters may be generated for one or more fields to capture the possibility that particular values are founds in the field(s); as shown in Fig. 2, a Bloom filter 120 A1 may be generated by the indexing component represent the first field of the partition 180A), the values of the key each mapped to one or more of the plurality of data partitions (Figs. 2-3 illustrates Bloom filter that map to specific partition based on the key field; as shown in Fig. 2, a Bloom filter 120 A1 may be generated by the indexing component represent the first field of the partition 180A); also see Fig. 7, col. 5, line 16-25, a plurality of Bloom filters may be generated per field per partition; one bloom filter may be generated per plurality of fields; also see col. 13, line, 3-38, using the bloom filters or other probabilistic data structures, a set of candidate partitions and a set of non-candidate partitioned may be determined).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include a partition index, […], comprising entries of values of the key being associated with respective data partitions of the plurality of data partitions, the values of the key each mapped to one or more of the plurality of data partitions into data lake system of Sundaram.
Motivation to do so would be to include a partition index, […], comprising entries of values of the key being associated with respective data partitions of the plurality of data partitions, the values of the key each mapped to one or more of the plurality of data partitions to quickly find user data in a very large data set in order to provide a copy of the user data back to the user (Park, col. 3, line 7-8).
Sundaram and Park do not explicitly disclose: said partition index, separate from the one or more data lake sources and the plurality of data partitions.
GUPTA teaches: said partition index, separate from the one or more data lake sources and the plurality of data partitions (Fig. 1 illustrates the key-value store [index] separate from data sources 102 and data lake; in combination with the teaching of Park, Figs. 2-3 illustrates partition of Data Lake 104 comprising partition 180A…180C separate from indexing 110, it teaches as claimed; also see paragraph [0032], the tree data structure may be store in a relational database or key-value store; also see paragraph [0037], store an index in a separate data base or table in memory).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include said partition index, separate from the one or more data lake sources and the plurality of data partitions into data lake system of Sundaram.
Motivation to do so would be to include said partition index, separate from the one or more data lake sources and the plurality of data partitions to facilitate rapid queries on time range, modified files, changes in content, and other search criteria (GUPTA, paragraph [0037], line 10-12).
Sundaram as modified by Park and GUPTA further teach: a processor, coupled to the memory, the processor configured to execute a query engine, the query engine configured to: receive, through a communication network, from a client device, a query on target data (Park, Fig. 7, col. 12, line 61-64, a query may be received that indicate a value…);
identify, using the partition index separate from the one or more data lake sources and the plurality of data partitions, at least one data partition of the plurality of data in which the target data was stored after ingestion from the storage of the one or more data lake sources (Park, Fig. 7, col. 5, line 16-25, a plurality of Bloom filters may be generated per field per partition; one bloom filter may be generated per plurality of fields; also see col. 13, line, 3-38, using the bloom filters or other probabilistic data structures, a set of candidate partitions and a set of non-candidate partitioned may be determined; to set a data set of customer order data for a particular customer ID, the Bloom filters corresponding to the data set’s partition s may be to exclude partitions that definitely do not include the customer ID); execute the query on the identified at least one data partition to obtain response data at least in part by reading the response data from the memory of the system (Park, Fig. 7, col. 13, line 21-38, to determine that particular partitions that actually include a particular value, ….may be scanned to identify one or more partitions that actually include the value may be referred to as relevant partitions ); and transmit, through the communication network, to the client device, the response data (Park, Fig. 7, col. 13, line 39-50, one or more actions may be performed with respect to the one or more records associated with the value…; the one or more actions may include returning data to a query client).
Regarding claim 5, Sundaram as modified by Park and GUPTA teach all claimed limitations as set forth in rejection of claim 1, further teach wherein the processor is configured to identify, using the partition index, the at least one data partition in which the target data is stored by performing: identify at least one value of the key included in the query; and identify the at least one partition in the partition index based on the at least one value of the key included in the query (Sundaram, Fig. 7, col. 12, line 61-64, a query may be received that indicate a value…; also see col. 5, line 16-25, a plurality of Bloom filters may be generated per field per partition; one bloom filter may be generated per plurality of fields; also see col. 13, line, 3-38, using the bloom filters or other probabilistic data structures, a set of candidate partitions and a set of non-candidate partitioned may be determined; to set a data set of customer order data for a particular customer ID, the Bloom filters corresponding to the data set’s partition s may be to exclude partitions that definitely do not include the customer ID).
Regarding claim 6, Sundaram as modified by Park and GUPTA teach all claimed limitations as set forth in rejection of claim 1, further teach wherein the processor is configured to: receive, through the communication network from the one or more data lake sources, one or more data objects to be stored in the data lake (Sundaram, Fig. 2, paragraph [0032]-[0034], the data pipeline system 200 includes one or more data sources 270 through 272, which provide data to a data ingestion engine service 202; a data storage engine reads the data from the pipelines and stores it in a data lake 214); and store data from the one or more data objects in at least one data partition of the plurality of data partitions (Sundaram, paragraph [0039], the data lake 214 may store the data received by the ingested service; also see paragraph [0033], the stored data 220 is separated into some number of partitions including the partition 1 222 through N 224; also see paragraph [0041], the data may be partitioned into buckets by {orgId, engagementDay}…).
Regarding claim 7, Sundaram as modified by Park and GUPTA teach all claimed limitations as set forth in rejection of claim 6, further teach: wherein the processor is configured to: sort the data from the one or more data objects into the at least one data partition using one or more values of the key in the data (Sundaram, paragraph [0039], the data lake 214 may store the data received by the ingested service; also see paragraph [0033], the stored data 220 is separated into some number of partitions including the partition 1 222 through N 224; also see paragraph [0041], the data may be partitioned into buckets by {orgId, engagementDay}…).
Regarding claim 8, Sundaram as modified by Park and GUPTA teach all claimed limitations as set forth in rejection of claim 1, further teach wherein the processor is configured to partition the data from the plurality of data objects into the plurality of data partitions (Sundaram, paragraph [0039], the data lake 214 may store the data received by the ingested service; also see paragraph [0033], the stored data 220 is separated into some number of partitions including the partition 1 222 through N 224; also see paragraph [0041], the data may be partitioned into buckets by {orgId, engagementDay}…).
Regarding claim 9, Sundaram as modified by Park teach all claimed limitations as set forth in rejection of claim 8, further teach wherein the processor is configured to partition the data into the plurality of data partitions by performing: determine the key; and partition the data from the data objects originating from the one or more data lake sources based on the key (Sundaram, paragraph [0039], the data lake 214 may store the data received by the ingested service; also see paragraph [0033], the stored data 220 is separated into some number of partitions including the partition 1 222 through N 224; also see paragraph [0041], the data may be partitioned into buckets by {orgId, engagementDay}…; also see paragraph [0038], data storage engine may store the keys of partitions that were updated in a changed log in the change log repository).
Regarding claim 10, Sundaram as modified by Park and GUPTA teach all claimed limitations as set forth in rejection of claim 9, further teach wherein the processor is configured to determine the key by performing: receive user input indicating one or more fields of the data to be used as the key (Park, Fig. 7, col. 12, line 61-64, a query may be received that indicate a value…; also see col. 13, line, 3-38, a set of candidate partitions and a set of non-candidate partitioned may be determined; to set a data set of customer order data for a particular customer ID, the Bloom filters corresponding to the data set’s partition s may be to exclude partitions that definitely do not include the customer ID).
Regarding claim 11, Sundaram as modified by Park and GUPTA teach all claimed limitations as set forth in rejection of claim 9, further teach wherein the processor is configured to determine the key by performing: determine at least one field in the data expected to be used in queries received by the system; and use the at least one field as the key (Park, Fig. 7, col. 12, line 61-64, a query may be received that indicate a value…; also see col. 13, line, 3-38, a set of candidate partitions and a set of non-candidate partitioned may be determined; to set a data set of customer order data for a particular customer ID, the Bloom filters corresponding to the data set’s partition s may be to exclude partitions that definitely do not include the customer ID; to determine that particular partitions that actually include a particular value, ….may be scanned to identify one or more partitions that actually include the value may be referred to as relevant partitions).
Regarding claim 15, Sundaram as modified by Park and GUPTA teach all claimed limitations as set forth in rejection of claim 1, further teach wherein the plurality of data partitions are configured to store at least some of the data in a columnar storage format (Sundaram, Fig. 2, paragraph [0032]-[0034], the data pipeline system 200 includes one or more data sources 270 through 272, which provide data to a data ingestion engine service 202; a data storage engine reads the data from the pipelines and stores it in a data lake 214; also see paragraph [0067], one upstream source may have data lake in Databricks Delta in parquet format on amazon S3…another data source may have data stored with Azure blob storage in Avro file format…; noted, ‘parquet’ is interpreted as columnar storage format; it is noted that one of the ordinary skill in the art would know that “Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval”).
Regarding claim 16, Sundaram as modified by Park and GUPTA teach all claimed limitations as set forth in rejection of claim 15, further teach wherein the columnar storage format is APACHE PARQUET (Sundaram, Fig. 2, paragraph [0032]-[0034], the data pipeline system 200 includes one or more data sources 270 through 272, which provide data to a data ingestion engine service 202; a data storage engine reads the data from the pipelines and stores it in a data lake 214; also see paragraph [0067], one upstream source may have data lake in Databricks Delta in parquet format on amazon S3…another data source may have data stored with Azure blob storage in Avro file format…).
As per claims 19, this claim is rejected on grounds corresponding to the same rationales given above for rejected claim 1 and is similarly rejected.
As per claims 22, this claim is rejected on grounds corresponding to the same rationales given above for rejected claim 5 and is similarly rejected.
Claims 2-4 and 20-21 are rejected under 35 U.S.C. 103 as being unpatentable over Sundaram et al. (U.S. Pub. No. 2021/0232604 A1) in view of Park et al. (U.S. Patent No. 11,531,666 B1) and GUPTA et al. (U.S. No. 2023/0205757 A1), further in view of Horowitz et al. (U.S. Pub. No. 2012/0254175 A1).
Regarding claim 2, Sundaram as modified by Park and GUPTA teach all claimed limitations as set forth in rejection of claim 1, but do not explicitly disclose: wherein the plurality of data partitions are stored in a plurality of shards associated with respective ranges of a shard key.
Horowitz teaches: wherein the plurality of data partitions are stored in a plurality of shards associated with respective ranges of a shard key (paragraph [0006], the partition component is further configured to define the first partition having a minimum key value and a maximum key value…; the database is organized into a plurality of collections includes a continuous range of data from the database, wherein the contiguous range comprises a range of one or more key values associated with the database data; also see paragraph [0007], the partition component is further configured to assign at least any data in the at least one of the plurality of database partitions having associated database key less than the maximum value to the first partition, and assign at least any data in the at least one of the plurality of database partition having database key values greater that the maximum value to the second partition; the partition component is further configured to identify database partitions having a sequential database key).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include wherein the plurality of data partitions are stored in a plurality of shards associated with respective ranges of a shard key into data lake system of Sundaram.
Motivation to do so would be to include wherein the plurality of data partitions are stored in a plurality of shards associated with respective ranges of a shard key to minimize overhead associated with maintained sharded data (Horowitz, paragraph [0003], line 12-14).
Regarding claim 3, Sundaram as modified by Park, GUPTA and Horowitz teach all claimed limitations as set forth in rejection of claim 2, further teach: wherein the key based on which the plurality of data partitions are partitioned is the shard key (Horowitz, paragraph [0006], the partition component is further configured to define the first partition having a minimum key value and a maximum key value…; the database is organized into a plurality of collections includes a continuous range of data from the database, wherein the contiguous range comprises a range of one or more key values associated with the database data; also see paragraph [0007], the partition component is further configured to assign at least any data in the at least one of the plurality of database partitions having associated database key less than the maximum value to the first partition, and assign at least any data in the at least one of the plurality of database partition having database key values greater that the maximum value to the second partition; the partition component is further configured to identify database partitions having a sequential database key).
Regarding claim 4, Sundaram as modified by Park, GUPTA and Horowitz teach all claimed limitations as set forth in rejection of claim 2, further teach: wherein the processor is configured to perform rebalancing of the plurality of data partitions among the plurality of shards (Horowitz, paragraph [0077]-[0078], rebalancing chunk distribution within a shard cluster….).
As per claims 20-21, these claims are rejected on grounds corresponding to the same rationales given above for rejected claims 2-3 respectively and are similarly rejected.
Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over W Sundaram et al. (U.S. Pub. No. 2021/0232604 A1) in view of Park et al. (U.S. Patent No. 11,531,666 B1) and GUPTA et al. (U.S. No. 2023/0205757 A1), further in view of Dudami et al. (U.S. Patent No. 10,846,307 B1).
Regarding claim 12, Sundaram as modified by Park and GUPTA teach all claimed limitations as set forth in rejection of claim 1, but do not explicitly disclose: wherein the processor is configured to group, in the memory, data from at least some of the plurality of data objects that share a native format.
Dudami teaches: wherein the processor is configured to group, in the memory, data from at least some of the plurality of data objects that share a native format (col. 3, line 53-64, the data lake is a data repository that store large amounts of raw data in the native format of the raw data until that data is needed by other entities; also see col. 4, line 24-40, obtains raw data from the electronic data source 102; by raw data, it is meant data in its native format that has not been transformed into a different format; when the first metadata element matches the raw data 106, the raw data 106 is streamed into a first data storage structure in the data lake 108; by ‘match’, it is meant whether the incoming raw data has all that is required (e.g., all the required fields, elements, or identifier) by metadata…; noted, raw data [native format data object] that ‘match’ are streamed to specific data storage structure; thus it indicates that raw data [native format data object] with shared native format [e.g., all the required fields, elements, or identifier] that are stored together, which reads on wherein the processor is configured to group, in the memory, data from at least some of the plurality of data objects that share a native format as claimed).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include wherein the processor is configured to group, in the memory, data from at least some of the plurality of data objects that share a native format into data lake system of Sundaram.
Motivation to do so would be to include wherein the processor is configured to group, in the memory, data from at least some of the plurality of data objects that share a native format to overcome issue with difficult to efficiently retrieve and perform operations on the data (Dudami, col. 1, line 14-16).
Claims 13-14 are rejected under 35 U.S.C. 103 as being unpatentable over Sundaram et al. (U.S. Pub. No. 2021/0232604 A1) in view of Park et al. (U.S. Patent No. 11,531,666 B1) and GUPTA et al. (U.S. No. 2023/0205757 A1), further in view of Tormasov et al. (U.S. Pub. No. 2020/0174893 A1).
Regarding claim 13, Sundaram as modified by Park and GUPTA teach all claimed limitations as set forth in rejection of claim 1, but do not explicitly disclose: wherein one or more of the plurality of data partitions comprise a plurality of files, and the processor is further configured to: determine that the files contain less than a threshold amount of data; and combine the plurality of files into a single file.
Tormasov teaches: wherein one or more of the plurality of data partitions comprise a plurality of files, and the processor is further configured to: determine that the files contain less than a threshold amount of data; and combine the plurality of files into a single file (paragraph [0045], in response to determine that file size of the file is less than fil-size threshold,…may combine the file with other files into a data blob…).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include wherein one or more of the plurality of data partitions comprise a plurality of files, and the processor is further configured to: determine that the files contain less than a threshold amount of data; and combine the plurality of files into a single file into data lake system of Sundaram.
Motivation to do so would be to include wherein one or more of the plurality of data partitions comprise a plurality of files, and the processor is further configured to: determine that the files contain less than a threshold amount of data; and combine the plurality of files into a single file such that data packing into blobs for efficient storage (Tormasov, paragraph [0002]).
Regarding claim 14, Sundaram as modified by Park and Tormasov teach all claimed limitations as set forth in rejection of claim 13, further teach: wherein the threshold amount of data is 100 megabytes (MB) (Tormasov, paragraph [0045], in response to determine that file size of the file is less than fil-size threshold,…may combine the file with other files into a data blob…; … file-size threshold is 100 MB).
Claims 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Sundaram et al. (U.S. Pub. No. 2021/0232604 A1) in view of Park et al. (U.S. Patent No. 11,531,666 B1) and GUPTA et al. (U.S. No. 2023/0205757 A1), further in view of Wilson et al. (U.S. Pub. No. 2020/0301941 A1).
Regarding claim 17, Sundaram as modified by Park and GUPTA teach all claimed limitations as set forth in rejection of claim 1, but do not explicitly disclose: wherein the processor is configured to: receive, from the communication network, from the client device, a query for metadata about a set of data, the metadata stored in the partition index.
Wilson teaches: wherein the processor is configured to: receive, from the communication network, from the client device, a query for metadata about a set of data, the metadata stored in the partition index (paragraph [0122], information about files in data lake, including user-specified information (e.g., bucket/directories, meta-data, etc.) and/or information included in the directory structure, files, and/or file names) can be used to build an partition specification…; the database system can use the partitions to improve a data lake query; for example, if the query includes a data aspect used to build the partitions (e.g., a data or time field of a virtual collection that is inherited from the file names), the database can leverage such information in the partition to go to the desired file).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include wherein the processor is configured to: receive, from the communication network, from the client device, a query for metadata about a set of data, the metadata stored in the partition index into data lake system of Sundaram.
Motivation to do so would be to include wherein the processor is configured to: receive, from the communication network, from the client device, a query for metadata about a set of data, the metadata stored in the partition index to instantiating one or more computing resources to process query using the virtual collection to generate the response to the query (Wilson, paragraph [0039], line 10-12).
Sundaram as modified by Park, GUPTA and Wilson further teach:
execute the query by reading the metadata from the partition index without accessing a data partition of the plurality of data partitions (Wilson, paragraph [0122], information about files in data lake, including user-specified information (e.g., bucket/directories, meta-data, etc.) and/or information included in the directory structure, files, and/or file names) can be used to build an partition specification…; the database system can use the partitions to improve a data lake query; for example, if the query includes a data aspect used to build the partitions (e.g., a data or time field of a virtual collection that is inherited from the file names), the database can leverage such information in the partition to go to the desired file; also see paragraph [0017], a partition associated with a portion of query using a range of a field; also see paragraph [0090], identifying partitions; a partition can be determined and/or specified based on a value of a field, a range of values of a field…; also see paragraph [0093], if a query selects or filters documents with an asOfDateTime field in this range, n, the partition specification provides information allowing agent servers to filer or prune data that is read such that the portion of the query can be executed using this particular partition file…);
and transmit, through the communication network, to the client device, the metadata (Wilson, paragraph [0122], information about files in data lake, including user-specified information (e.g., bucket/directories, meta-data, etc.) and/or information included in the directory structure, files, and/or file names) can be used to build an partition specification…; the database system can use the partitions to improve a data lake query; for example, if the query includes a data aspect used to build the partitions (e.g., a data or time field of a virtual collection that is inherited from the file names), the database can leverage such information in the partition to go to the desired file; also see paragraph [0124], the query service nodes can divide and conquer a query by processing data from the files within the customer S3 buckets; the results of the query service nodes can be merge to create a result set; the result set can be returned to the user).
Regarding claim 18, Sundaram as modified by Park and GUPTA teach all claimed limitations as set forth in rejection of claim 1, but do not explicitly disclose: wherein: at least some of the data objects are stored in respective virtual collections.
Wilson teaches: wherein: at least some of the data objects are stored in respective virtual collections (paragraph [0009], virtual ‘collections’ of distributed data object can be specified and queried in a manner that is directly analogues to querying collections in a document database system; also see paragraph [0055], the data lake can be implemented as a database with logical organizations of subsets of database data in virtual collections; also see paragraph [0059], the system builds virtual collections within the object data storage…; also see paragraph [0082], the storage configuration file can provide a flexible mapping of data, including combining individual objects, files and/or collection into virtual collections).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include wherein: at least some of the data objects are stored in respective virtual collections into data lake system of Sundaram.
Motivation to do so would be to include wherein: at least some of the data objects are stored in respective virtual collections to instantiating one or more computing resources to process query using the virtual collection to generate the response to the query (Wilson, paragraph [0039], line 10-12).
Sundaram as modified by Park, GUPTA and Wilson further teach:
the processor is further configured to: receive, through the communication network, from the client device, a second query on second target data in a first virtual collection (Wilson, paragraph [0009], virtual ‘collections’ of distributed data object can be specified and queried in a manner that is directly analogues to querying collections in a document database system; also see paragraph [0127], a query comes in from a user to a query service node; the query service node determines the type of query is a data lake query, and determines the virtual collection for the data lake);
transmit, through the communication network, to a data storage system associated with the first virtual collection, information indicating the second query (Wilson, paragraph [0009], virtual ‘collections’ of distributed data object can be specified and queried in a manner that is directly analogues to querying collections in a document database system; also see paragraph [0127], a query comes in from a user to a query service node; the query service node determines the type of query is a data lake query, and determines the virtual collection for the data lake; if the query can divided, the query service nodes spreads the work across multiple query service nodes; the result(s) from each of the query service nodes(s) are aggregated to provide a set of results in response to the user query);
receive, through the communication network, from the data storage system, second response data obtained from executing the second query (Wilson, paragraph [0009], virtual ‘collections’ of distributed data object can be specified and queried in a manner that is directly analogues to querying collections in a document database system; also see paragraph [0127], a query comes in from a user to a query service node; the query service node determines the type of query is a data lake query, and determines the virtual collection for the data lake; if the query can divided, the query service nodes spreads the work across multiple query service nodes; the result(s) from each of the query service nodes(s) are aggregated to provide a set of results in response to the user query);
and transmit, through the communication network, to the client device, the second response data (paragraph [0009], virtual ‘collections’ of distributed data object can be specified and queried in a manner that is directly analogues to querying collections in a document database system; also see paragraph [0127], a query comes in from a user to a query service node; the query service node determines the type of query is a data lake query, and determines the virtual collection for the data lake; if the query can divided, the query service nodes spreads the work across multiple query service nodes; the result(s) from each of the query service nodes(s) are aggregated to provide a set of results in response to the user query; also see paragraph [0124], the query service nodes can divide and conquer a query by processing data from the files within the customer S3 buckets; the results of the query service nodes can be merge to create a result set; the result set can be returned to the user).

Claim 23 is rejected under 35 U.S.C. 103 as being unpatentable over Sundaram et al. (U.S. Pub. No. 2021/0232604 A1) in view of Park et al. (U.S. Patent No. 11,531,666 B1) and GUPTA et al. (U.S. No. 2023/0205757 A1), further in view of Rahle (U.S. Pub. No. 2022/0067021 A1).
Regarding claim 23, Sundaram as modified by Park and GUPTA teach all claimed limitations as set forth in rejection of claim 1, but do not explicitly disclose: wherein the partition index stores statistics for data stored in the data partition, the statistics determined based on field values stored in the partition; and executing the query on the identified at least one data partition to obtain the response data comprises reading statistics stored in an entry associated with the data partition as at least a portion the response data.
Rahle teaches: wherein the partition index stores statistics for data stored in the data partition, the statistics determined based on field values stored in the partition (Fig. 2, paragraph [0004], the system partitions time-based data items into segment based on one or more metadata criteria (e.g., product, status, deployment, environment, version, host. Etc.); the system may then group the data items by time intervals, and calculate one or more statistical attributes for each of the segments within each of the time windows; this statistical data may be stored in association with the corresponding segment and time interval for access by one or more front end software applications); and executing the query on the identified at least one data partition to obtain the response data comprises reading statistics stored in an entry associated with the data partition as at least a portion the response data (paragraph [0019], the system allows efficient identification [querying] and presentation [displaying] of statistical information for particular time period from a large set of temporally ordered events; also see paragraph [0040], using the data insight system and method disclosed herein, statistical attributes for that same 1 minute time window that are stored in a data insights database may be quickly and efficiently accessed).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include wherein the partition index stores statistics for data stored in the data partition, the statistics determined based on field values stored in the partition; and executing the query on the identified at least one data partition to obtain the response data comprises reading statistics stored in an entry associated with the data partition as at least a portion the response data into data lake system of Sundaram.
Motivation to do so would be to include wherein the partition index stores statistics for data stored in the data partition, the statistics determined based on field values stored in the partition; and executing the query on the identified at least one data partition to obtain the response data comprises reading statistics stored in an entry associated with the data partition as at least a portion the response such that the system allows efficient identification [querying] and presentation [displaying] of statistical information for particular time period from a large set of temporally ordered events (Rahle, paragraph [0019]).

Claim 24 is rejected under 35 U.S.C. 103 as being unpatentable over Sundaram et al. (U.S. Pub. No. 2021/0232604 A1) in view of Park et al. (U.S. Patent No. 11,531,666 B1) and GUPTA et al. (U.S. No. 2023/0205757 A1), further in view of Barber et al. (U.S. Pub. No. 2013/0325901 A1).
Regarding claim 23, Sundaram as modified by Park and GUPTA teach all claimed limitations as set forth in rejection of claim 1, but do not explicitly disclose at least one identified data partition stores a first portion of data in a first storage format and a second portion of data in a second storage format different from the first storage format.
Barber teaches: the at least one identified data partition stores a first portion of data in a first storage format and a second portion of data in a second storage format different from the first storage format (paragraph [0043], assign the data values 120 to partition based on a representation or encoding format; assigns the data values to a first partition and a second partition 118; the first partition include data values encoded to a first encoding format and the second partition may include unencoded data value; assign data value to third partition which includes data values encoded in second encoding format).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include at least one identified data partition stores a first portion of data in a first storage format and a second portion of data in a second storage format different from the first storage format into data lake system of Sundaram.
Motivation to do so would be to include at least one identified data partition stores a first portion of data in a first storage format and a second portion of data in a second storage format different from the first storage format to overcome issue with particular encoding configuration can have a negative impact on query performance (Barber, paragraph [0003], line 10-11).
Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to KEN HOANG whose telephone number is (571)272-8401. The examiner can normally be reached M-F 7:30am-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Charles Rones can be reached at (571)272-4085. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/KEN HOANG/ Examiner, Art Unit 2168

Patent Application 18328853 - SYSTEMS AND METHODS FOR OPTIMIZING QUERIES IN A - Rejection

Patent Application 18328853 - SYSTEMS AND METHODS FOR OPTIMIZING QUERIES IN A

Application Information

Rejection Summary

Cited Patents

Office Action Text

Transform your business with AI in minutes, not months