Fine Grained ID List Size


Overview

With very large databases, some queries go through a lot of work to build huge ID lists for filter components with many matching IDs. For example, a search for “(&(objectclass=inetorgperson)(uid=foo))” may build a huge idlist for “objectclass=inetorgperson” only to throw it away to intersect it with “uid=foo”. In these cases, it would be useful to be able to tell the indexing code to use a different idlistscanlimit for certain indexes, or use no idlist at all. In the above case, it would be useful to tell the indexing code to skip building an idlist for objectclass=inetorgperson, but still use the default idlistscanlimit for other objectclass searches (e.g. objectclass=groupOfNames).

Another example of this problem is https://github.com/389ds/389-ds-base/issues/811 - if there are several million IDs for each of the objectclass= filter components, being able to skip id list generation for the objectclass values would make that query very fast.

Unfortunately, nsslapd-idlistscanlimit is a blunt instrument - it applies to all indexes in the database. We currently have no way to specify a different idlistscanlimit for different types of search filters.

In these cases, we have a search filter which has an attribute with a large number of matching values, and another attribute with a very small number of matching values. We need to have the ability to control the max ID list size for a given index, including the type of index (eq, sub), and even the equality value - a different max ID list size for specific values of objectclass - for example, in the above, a size of 0 for objectclass=inetorgperson, but use the default for objectclass=groupOfNames.

Use Cases

Reduce large ID lists that are thrown away

If there are 10 million entries that match objectclass=inetorgperson, then a search for (&(objectclass=inetorgperson)(uid=foo)) will build an ID list containing 10m IDs for objectclass=inetorgperson, then throw it away to intersect it with uid=foo. If there are many simultaneous searches like this, the amount of CPU and RAM resources consumed by the server is great, and the performance suffers.

Allow certain very large ID lists

In some cases, the user may want to allow a very large ID list for some searches, in order that they are indexed and fast, but not increase the nsslapd-idlistscanlimit for all searches. For example, an application wants to be able to search for users by pattern (substring), and wants that search to be as fast as possible (i.e. indexed, not look through).

The user wants the search for (sn=sm*) to be very fast (indexed), and so wants to have a very high idlistscanlimit for the “sn” “sub” index (there may be hundreds of thousands of users that match “sm*” - “smith”, “small”, etc.), but wants to have a very low idlistscanlimit for other searches.

Design

We cannot reuse or re-purpose nsslapd-idlistscanlimit, so the proposed solution is to add a new allowed attribute to the nsIndex objectclass: nsIndexIDListScanLimit

nsIndexIDListScanLimit: limit=NNN [type=eq[,sub,...]] [flags=AND[,XXX..]] [values=val[,val...]]

This is a multi-valued, DirectoryString syntax attribute, which is allowed (optional) for the nsIndex objectclass.

because “abc” is not integer syntax.

the commas in the DN values are replaced with ”\2C

For example, in the objectclass case, you might want to have a maxsize of 0 for inetorgperson:

dn: cn=objectclass,cn=index,...
objectclass: nsIndex
nsIndexType: eq
nsIndexIDListScanLimit: limit=0 type=eq values=inetorgperson

This means, build no ID list (“0”) for “objectclass=inetorgperson”, but use the default nsslapd-idlistscanlimit value for other objectclass=$value searches.

Or, to apply the limit only to the use of objectclass=inetorgperson in AND clauses

dn: cn=objectclass,cn=index,...
objectclass: nsIndex
nsIndexType: eq
nsIndexIDListScanLimit: limit=0 type=eq flags=AND values=inetorgperson

This means, build no ID list (“0”) for “objectclass=inetorgperson” when used in an AND clause like (&(objectclass=inetorgperson)(uid=someuser)), but use the default nsslapd-idlistscanlimit value for other objectclass=$value searches, like (objectclass=inetorgperson) or in an OR clause like (|(objectclass=inetorgperson)(objectclass=posixAccount))

Or, for a substring example:

dn: cn=givenName,cn=index,...
objectclass: nsIndex
nsIndexType: sub
nsIndexIDListScanLimit: limit=-1 type=sub values=ab*

This means, use an unlimited max ID list size for (givenname=ab*) but use the default for other givenname substring searches.

Implementation

The struct attrinfo used to store per-attribute index information has been extended with a new member ai_idlistinfo which holds the parsed nsIndexIDListSizeLimit values for the index. This is implemented as a DataList object. ai_idlistinfo is only created if there is at least one valid nsIndexIDListSizeLimit. There is much additional parsing code added to ldbm_attr.c. The list of values is first checked for syntax, then converted to index keys (normalized) and stored as a Slapi_ValueSet. This allows fast value lookup during the indexing.

NOTE: There is no result checking for dynamic index setting over LDAP. The client has no way to know if the change was valid or not. The only way to tell currently is to look at the errors log for errors. Tests should be implemented to examine the errors log.

The server already had support for using a different idlistscanlimit (called allidslimit in the index code) for each lookup. An additional function was inserted into index_read_ext_allids() - index_get_allids(). When an index lookup happens, the values from the search filter are converted to keys in the same format. If the index specified nsIndexIDListSizeLimit with values, each index value is looked up in the Slapi_ValueSet to find a match. The index type (eq, sub) are also compared, and the flags. Each match is scored, and the idlistscanlimit corresponding to the best match is returned.

Major configuration options and enablement

If nsIndexIDListScanLimit is not specified for an index, the nsslapd-idlistscanlimit value for the database is used.

The nsIndexIDListScanLimit setting is fully dynamic - live changes take place immediately.

Replication

There should be no impact on replication.

Updates and Upgrades

There should be no impact on update/upgrade.

Dependencies

This will not change any dependencies.

External Impact

Documentation. We will need to document the new attribute in the various guides to which it is applicable, and in the Release Notes. We will need to advertise this to other groups heavily using 389 such as DogTag and FreeIPA.

Last modified on 1 March 2024