Translation of article BaseGroup Labs — Нечеткие запросы к реляционным базам данных (Nikolay Palkin)
[D R A F T]
Mechanisms of fuzzy queries (fuzzy queries, flexible queries) to relational databases are based on the Zadeh theory of
fuzzy sets, first proposed in 1984 and subsequently developed in works of Dubois, Prada.
Why do we need it
Most of the data processed in the modern information systems are definite and has numerical character. However, often there are inaccuracies and uncertainties in the queries to database formulated by human. Usual Internet user got a lot of references to documents, ordered by relevance (or correspondence) at search system request. Because of textual information originally inherit lack of clarity and uncertainty, the causes of which is a semantic ambiguity of language, existence of synonyms, etc.
There is different situation with information systems databases, or with clear databases (Crisp Databases). For example, you need to extract following information from database:
"Got list of young workers with low salary"
"Find proposal for cheap flat close to city"
Here the expression “Young”, “Low”, “Cheap”, “Close to” are vague, imprecise, even though salary is determined and the distance from apartment to center — with an accuracy of up to a kilometer. The cause of it is that, in real life we operate uncertain, inaccurate categories. Such queries is impossible to execute using SQL language. So here as aid comes concept of fuzzy queries.
Lets demonstrate the limitations of definite queries in following example. Let it be required to obtain information about managers on sales who is not older than 25 years and have annual transactions amount exceeded 200 000 on particular region. This request can be written in SQL language as follows:
SELECT FIO FROM Managers
WHERE (Managers.Age <= 25 AND Managers.Sum > 200000 AND Managers.RegionID = 1)
So, 26-years old sales manager with annual sum of sales in 400 000, or 19-years old sales manager with total of 198 000 will not appear in result query, but their characteristics almost meet the requirements.
Fuzzy queries help to cope with such problems.
How does it work
The mechanism of the fuzzy queries based on the theory of fuzzy sets, basic details of which outlined in article
BaseGroup Labs — Нечеткая логика – математические основы
(soon, you can find my translation here — Fuzzy logic – mathematical foundations).
Look at the most common methods of generation of new linguistic terms on the basis of the basic terms-set.
This is useful for a variety of semantic structures that enhance or weaken statements, such as: “a very high price”, “about middle age” and etc.
For this purpose there are linguistic modifiers (linguistic hedges) for strengthening or weakening a statement.
“Very” is enhance modifier, weaken modifiers are: “more-or-less”, “approximately”, “almost”, and other fuzzy sets described by membership functions of the form:
For example, let’s formalize unclear concept of “the age of the employee of the company”. This will be the name of the corresponding linguistic variable. Let’s define its domain of definition
X = [18; 70] and three linguistic terms: “Young” (“Молодой”), “Middle” (“Средний”), “Upper middle” (“Выше среднего”). At last we must build membership functions for each linguistic term.
We chose trapezoidal membership functions with such coordinates:
- "Young" = [18, 18, 28, 34],
- "Middle" = [28, 35, 45, 50],
- "Upper middle" = [42, 53, 60, 60].
Now we can calculate level of appertaining some 30-years old employee to each of fuzzy sets:
- MF["Upper middle"](30)=0.
Main requirement for the membership functions — the value of membership function should be greater than zero for at least for one of the linguistic term.
In conclusion, we define the operation of fuzzy negation (NOT):
Information above is enough to build and run fuzzy queries.
Go back to example with sales managers.
For simplicity we suppose that all required information is located in one table with the following fields:
ID— employee number
SUM— annual sales amount
Linguistic variable “age” was created before. Let’s define other linguistic variable for field SUM with its domain of definition
X = [0; 600000] and linguistic terms “Small”, “Middle”, “Big” and build membership functions in the same way:
- "Small" = [0, 0, 0, 200000],
- "Middle" = [90000, 180000, 265000, 330000],
- "Big" = [300000, 420000, 600000, 600000].
To such kind of tables we can do fuzzy requests. For example, we can get list of young sales managers with big annual sales amount. So, on SQL-like language it will look like:
SELECT * FROM Managers WHERE (Age = "Молодой" AND Sum = "Большая")
After calculating agregated function value of beloning to MF for each record, we’ll get result of fuzzy query:
Records 1, 2, 5 are omitted by query because their value of membership functions equals zero. There is no records which exactly correspond to our query (
MF=1). 28-years old sales manager with 398 000 annual sales amount matches query and got 0.82 value of membership function. Practically, some gain-value is used for membership function. When value of membership function is grater than gain-value, its record adds to result.
Similar crisp (not fuzzy) request could be formulated as:
SELECT * FROM Managers WHERE (Age <= 28 AND Sum >= 420000)
Its result is empty. But if we expand bounds of age in query, we are in risk to miss other employees with little bigger or little smaller age. So we can conclude, that fuzzy queries gives us ability to expand boundaries of search according to initially defined constraints (chosen by human).
When we use fuzzy modificators, we can format more difficult queries like:
SELECT * FROM Managers WHERE (Age = "more-or-less Middle" AND Sum = "Middle")
And got result:
Often, we must operate not by linguistic variables, but fuzzy analogs of crisp values. For that reason exists “NEAR” relation (Example: price is near 20). To implement such relation we must build fuzzy set with respective membership function, but this time, on some relative range (Example:
[-5; 5]) — to avoid context depending. When we calculate value of membership function for relation “Near Q” (where Q — is some crisp number) we must scale it to relevant range.
Let’s illustrate it on table with data about securities. Let us suppose that table consists of next fields:
PRICE— security price
RATIO— price-to-earnings ratio
AYIELD— average yield, %
Let us suppose that we must find securities for buy at track $150 with yield 15% and price-to-earnings ratio 11. This is equivalent to next query:
SELECT * FROM some_table
WHERE ((PRICE<=150) AND (RATIO=11) AND (AYIELD=15))
Result of such query will be empty.
So, let’s formulate such query in fuzzy aproach using relation “NEAR”:
SELECT * FROM some_table
WHERE ((PRICE = "Near 150") AND (RATIO= "Near 11") AND (AYIELD="Near 15"))
Let’s build fuzzy set for relation “NEAR” in relative range
[-5;5]. It’ll be trapeze with next coordinates
Let’s calculate value of fuzzy query “price near 250” for price 380. As a preliminary we defined domains for each linguistic variable:
PRICE -- [0; 1000],
RATIO – [0; 20],
AYIELD – [0; 20].
130 as difference between
250 we scale to range
[-5;5]. We got value
Than we apply fuzzy relation NEAR to each field PRICE, RATIO and AYIELD, calculate agregated value of membership function (using fuzzy “AND” operator), so we’ll got next result:
Disadvantage of fuzzy queries is subjectivity membership functions.
Application of fuzzy queries
Fuzzy queries prospectively to use in areas where the selection of information from databases is by using quality criteria and fuzzy formulated conditions, for example, Direct Marketing.
In direct marketing it is very important stage of selection of the target audience, which will apply different instruments of direct marketing. For example, a direct mail advertising, used in promotion of goods and services to organizations and private persons.
However, maximizing effect of direct mail requires careful selection of the recipients.
If selection of recipients will be liberal, unjustified expenses on direct marketing would increase, if it is too strict – will be lost number of potential clients.
For example, the company announces the promo action among its customers about new services using direct mailing. Marketing service has established, that the most interesting new kind of services will be man of middle age, fathers of families with an annual income above the average.
To get a list of recipients, there will be made the following query: select all of the males in the age from 40 to 55 years old, with at least 1 child, the annual income from 20 to 30 thousand dollars.
The exact criteria of the query can eliminate many potential clients: a 39-years old man, father of three children, with an income of 31 thousand is brought in the result of the query, though it is a potential consumer of the new service.
Similarly you can use fuzzy queries when choosing the tourist services, the selection of objects of the real estate.
Fuzzy query tool allows you to agree on the formal and informal requirements to the circle of potential clients and set intervals selection of potential customers as fuzzy sets. In such a case, clients, which do not meet any of the criteria can be selected from the database, if they have good results on other criteria s.
Author: Николай Паклин
Translated to English by Nazar Gerasymchuk (for Never Late)
- Дюбуа Д., Прад Г. Теория возможностей. Приложения к представлению знаний в информатике – М.: Радио и связь, 1990.
- Ribeiro R.A., Moreira A.M. Fuzzy Query Interface for a Business Database // International Journal of Human-Computers Studies, Vol. 58 (2003), PP. 363-391.
- Dubois D., Prade H. Using Fuzzy Sets in Database Systems: Why and How? // Proc. of 1996 Workshop on Flexible Query-Answering systems (FQAS’96), Denmark, May 22-24, 1996, PP. 89-103.
- Смолко Д.С.,Черноруцкий И.Г. Система поддержки принятия решения для портфеля ценных бумаг // Сборник докладов I Международной конференции по мягким вычислениям и измерениям (SCM-98), Санкт-Петербург, 1998, том 2, С. 231-234.