I've been working lately on some performance improvement of one of the queries that we have exposed as oData endpoint (it's a CalcView that basically calls SQL Function).
My data has following structure:
-- 75m records CREATE COLUMN TABLE PERSON_SEARCH ( "PERSON_ID" NVARCHAR(40), "FIRST_NAME" NVARCHAR(500) FUZZY SEARCH INDEX ON, "LAST_NAME" NVARCHAR(500) FUZZY SEARCH INDEX ON ); CREATE INDEX idx_person_first_name ON PERSON_SEARCH (FIRST_NAME); CREATE INDEX idx_person_last_name ON PERSON_SEARCH (LAST_NAME); -- 23m records CREATE COLUMN TABLE PERSON ( "ID" NVARCHAR(40), "FIRST_NAME" NVARCHAR(500), "LAST_NAME" NVARCHAR(500), "YEAR_OF_BIRTH" SMALLINT, PRIMARY KEY ("ID") ); -- 23m records CREATE COLUMN TABLE PERSON_TO_ADDRESS ( "ADDRESS_ID" NVARCHAR(40) NOT NULL, "PERSON_ID" NVARCHAR(40) NOT NULL ); CREATE INDEX idx_person_to_address_address_id ON PERSON_TO_ADDRESS (ADDRESS_ID); CREATE INDEX idx_person_to_address_person_id ON PERSON_TO_ADDRESS (PERSON_ID); -- 26m records CREATE COLUMN TABLE ADDRESS_DETAILS ( "ID" NVARCHAR(40) NOT NULL, "COUNTRY_CODE" NVARCHAR(2) FUZZY SEARCH INDEX ON, "CITY" NVARCHAR(255) FUZZY SEARCH INDEX ON, "ZIP" NVARCHAR(255) FUZZY SEARCH INDEX ON, "STREET" NVARCHAR(1000) FUZZY SEARCH INDEX ON, PRIMARY KEY ("ID") ); CREATE INDEX idx_address_details_street ON ADDRESS_DETAILS (STREET); CREATE INDEX idx_address_details_city ON ADDRESS_DETAILS (CITY); CREATE INDEX idx_address_details_country_code ON ADDRESS_DETAILS (COUNTRY_CODE); CREATE INDEX idx_address_details_zip ON ADDRESS_DETAILS (ZIP); -- 29m records CREATE COLUMN TABLE PERSON_FUNCTION ( "PERSON_ID" NVARCHAR(40) NOT NULL, "FUNCTION_ID" NVARCHAR(255), "FUNTION_DESCRIPTION" NVARCHAR(255)); CREATE INDEX idx_person_function_person_id ON PERSON_FUNCTION (PERSON_ID);
and there is an additional VIEW that does some concatenation etc for table with following structure:
named V_PERSON_SCORE that returns PERSON_ID, SCORE and SCORE_CATEGORY (depending on some aggregations - possible values are LOW, MEDIUM, HIGH or NULL) -- 65k records CREATE COLUMN TABLE PERSON_SCORE ( "PERSON_ID" NVARCHAR(36) NOT NULL, "PARTIAL_SCORE" DOUBLE CS_DOUBLE NOT NULL );
and my query looks like following:
SELECT DISTINCT person.id AS id, person.first_name AS first_name, person.last_name AS last_name, address.city AS city, address.country_code AS country_code, address.street AS street, address.zip AS zip, person.year_of_birth AS year_of_birth, function.function_description AS function_description, CASE WHEN ROUND(MAX(COALESCE(p_score.score,0)), 3) < 0.200 THEN 0 ELSE 1 END AS per_score, MAX(TO_DECIMAL(SCORE(),3,2)*100) AS score FROM PERSON_SEARCH AS person_history INNER JOIN PERSON AS person ON person.id = person_history.person_id INNER JOIN PERSON_TO_ADDRESS AS person_to_address ON person_to_address.entity_id = person.id INNER JOIN ADDRESS_DETAILS AS address ON address.id = person_to_address.address_id INNER JOIN PERSON_FUNCTION AS function ON function.person_id = person.id LEFT JOIN V_PERSON_SCORE AS p_score ON person.id = p_score.node_id WHERE CONTAINS(person_history.first_name, :ip_first_name, FUZZY(0.7, 'similarCalculationMode=substringsearch')) AND CONTAINS(person_history.last_name, :ip_last_name, FUZZY(0.7, 'similarCalculationMode=substringsearch')) AND CONTAINS(address.street, :ip_street, FUZZY(0.7, 'similarCalculationMode=symmetricsearch')) AND CONTAINS(address.city, :ip_city, FUZZY(0.7, 'similarCalculationMode=symmetricsearch')) AND CONTAINS(address.country_code, :ip_country_code) AND CONTAINS(address.zip, :ip_zip) AND COALESCE(person.year_of_birth, 1) BETWEEN 0 AND 9999 AND CONTAINS(function.function_id, :ip_functions) GROUP BY person.id, person.first_name, person.last_name, address.city, address.country_code, address.street, address.zip, person.year_of_birth, function.function_description, p_score.score, SCORE() ORDER BY SCORE() DESC;
ip_functions - BOSS, OWNER, EMPLOYEE, RETIREE - which are the functions that we look for. passed in format 'BOSS OR OWNER'
I've added indexes on all recommended columns that were listed by plan generated by Analyze SQL and Save Plan (WebIDE).
Generally speaking performance looks very good when we give any of parameters: ip_first_name, ip_last_name, ip_street, ip_zip but when those are missing and we give anything from lt_p_score or country_code performance looks very bad.
- ip_city as 'Berlin', ip_functions as 'EMPLOYEE OR BOSS OR OWNER' there are 152497 records returned in 1687ms.
- ip_zip as '60306', ip_functions as 'EMPLOYEE OR BOSS OR OWNER' there are 29 results returned in 154ms.
- ip_last_name as 'Mark', ip_functions as 'EMPLOYEE OR BOSS OR OWNER'there are 201561 records returned in 1925ms.
and those times (even if not perfect are fine for me).
Problem starts when:
- country_code as 'DE', ip_functions as 'EMPLOYEE OR BOSS OR OWNER' there are 4282557 records returned in 33283ms.
- country_code as 'DE', ip_functions as 'EMPLOYEE' there are 3142952 records returned in 32100ms
which are quite problematic in the point of view of the user.
Is there any way to improve the performance of my statement?
I was thinking on creating one table that will hold all of the data and will be refreshed from time to time (1x per 1h for example) but it seems completely redundant and there probably is a better way to do it, just i don't see it ...
Would be really grateful for any help or feedback about what could be improved here.