Published:
Updated:
Tags: ,

Introduction

Each of the 94 members of the Legislative Assembly of the State of São Paulo (Alesp) is entitled to General Office Expense Allowance and Lodging Allowance, jointly referred to as the "office allowance." This right was granted by Resolution 783, Article 11, of July 1, 19971. It is a monthly amount owed by the State so that representatives can be reimbursed for office operation and maintenance costs, lodging, and other expenses inherent to the full exercise of parliamentary activities.

These legally authorized expenses are grouped into 11 categories, including printing materials and services, consulting, fuel, car rental, lodging, and meals. In 2022, considering the maximum office-allowance limit of 1,250 São Paulo State Fiscal Units (Ufesp)1 and the Ufesp value of R$ 31.972, the monthly reimbursement ceiling per representative was R$ 39,962.50.

That year, the total amount committed to the office allowance was R$ 26,652,243.513. The sum was 24.43 % higher than in 2021 (R$ 21,419,316.883) and lower than the amount recorded for 2023 (R$ 28,607,099.963). If this budget is indeed reached this year, it will be the first time the figure tops R$ 28.5 million since 2018.

These public funds are scrutinized by oversight bodies such as the State Court of Accounts and the São Paulo Public Prosecutor's Office, which often open proceedings to verify the legitimacy of reimbursements. One example is Investigative Proceeding 29.0001.0246360.2021-544, initiated on 5 May 2022, concerning alleged misuse of the office allowance by Representative Murilo Felix, who supposedly paid rent on properties belonging to political allies that were never used. Another example is Criminal Case 0037174-14.2021.8.26.00005, which points, among other issues, to reimbursement of expenses never incurred by Representative Rogério Nogueira.

Against this backdrop, the present work aims to be an instrument for evaluating expenses and detecting anomalies through unsupervised machine learning. Its objective is not to state categorically whether a given expense is fraudulent; its scope is to serve as a first-look tool by clustering the values.

Method

Exploratory analysis

The first step was to fetch data from Alesp's Open Data Portal 6, where xml files dating back to 2002 contain fields indicating the reference period ("Year," "Month") as well as information on both the representative ("Registration," "Representative") and the expense ("Vendor," "CNPJ," "Type," "Value"). To avoid ideological bias, representatives' names were ignored. Given the temporal nature of expenses, "Year" and "Month" were used solely to deflate the amounts to December 31, 2022 according to the Broad Consumer Price Index (IPCA)7. The time dimension of the expenses was thus discarded.

Only meal and lodging expenses from 2018 to 2022 were included. Vendors with fewer than 20 expenses in the five-year span were excluded because a significant count is needed for clustering.

K-Means clustering algorithm

A K-Means clustering algorithm was implemented to process these records. In short, K-Means partitions a data set into a predefined number of non-overlapping clusters8. Each data point belongs to the cluster whose centroid is closest on average.

Given a set of observations \(x = \{x_1, x_2, \ldots, x_n\}\), the algorithm divides the \(n\) observations into \(k\) (\(\ge n\)) sets \(S = \{S_1, S_2, \ldots, S_k\}\) so as to minimize the sum of squares within clusters,

\[ \sum_{i = 1}^{k}\sum_{x \in S_i} \lVert x - \mu_i \rVert^2 \]

where

  • \(k\): number of clusters
  • \(S_i\): cluster \(i\)
  • \(x\): data point
  • \(\mu_i\): mean distance of points in \(S_i\)

Because our data set is univariate and the goal is anomaly detection,

  1. points are laid out according to their values;
  2. given the predefined number of clusters, centroids are calculated by minimizing squared distances;
  3. points near centroids form clusters;
  4. points outside clusters are considered anomalies.

However, K-Means imposes some requirements such as prior determination of the number of clusters, a centroid-initialization method that seeks the global minimum and not the local minima, convergence criteria, and validation of results. These were addressed respectively with the elbow method, K-Means++, centroid-movement comparison, and validation via the silhouette method and the Davies–Bouldin index.

Elbow method

The algorithm must know the number of clusters in advance. The elbow method9 provides that number by iterating over possible cluster centers and computing the sum of squared distances between them and the data points.

The logic is that increasing the number of clusters reduces intra-cluster distances because points are closer to their centroids. At some point, the reduction becomes marginal—in a graph, the line drops sharply at first and then levels out, forming an "elbow." That point is an estimate of the ideal number of clusters.number of clusters.

Considering merely the observation of a graph to measure results on the ideal number of clusters, statistical support is abdicated to ensure the robustness of the elbow method. Schubert10 shows the method applied to datasets with more or less visually cohesive clusters, in which results are similar even in uniform sets or when data contains a single normal distribution. Problems associated with the elbow graph include lack of significant angle metric and axis scaling, which can alter human interpretation of an "elbow".

To mitigate such problems, a less subjective method could be used, such as the Variance Ratio Criterion (VRC). While the elbow method relies on the sum of squared distances between each point and the cluster centroid, the VRC measures the ratio between the sum of dispersion between clusters and the sum of dispersion within clusters11. Since we have a data set that does not point to uniformity or normal distribution, the elbow method was chosen.

K-Means++

Defining the number of clusters, however, does not guarantee that the algorithm will find the best points to serve as centroids. When random initialization is used, where initial centroids are randomly selected within the cluster, points very close to each other may be chosen. The high sensitivity of the clustering technique can lead to a local minimum solution rather than a global one, generating suboptimal partitions12.

To overcome this limitation, this study used an initialization method called K-Means++13, where the centroid goes through iterations and is selected based on the probability that a given point is the best centroid based on distance relative to other data points. The successive change between centroids reduces the chances of the K-Means algorithm converging to a suboptimal solution.

Given a set of points \(D\) and a set of selected centroids \(C\), the probability of choosing point \(x\) as the next centroid is calculated by

$$ P(x) = \frac{D(x)^2}{\sum_{x^{\prime} \in D}D(x^{\prime})^2} $$

where \(D(x)\): distance between point \(x\) and the nearest centroid in \(C\).

With centroids initialized, each point is assigned to the nearest centroid. These points form clusters. Considering point \(x\) and a set of centroids \(C\), the cluster label \(l\) to which \(x\) belongs is computed by

$$ l(x) = \arg \min_{c \in C}\Vert x - c \Vert $$

Next, each centroid is recalculated by taking the mean distance of all points assigned to it,

$$ c_i = \frac{1}{\vert S_i \vert}\sum_{x \in S_i} x $$

where \(S_i\): set of all points assigned to centroid \(i\).

At each centroid update iteration, inertia is computed. For a univariate set,

$$ \sum_{i=1}^{n}{\Vert {x_i} - {c_{l(x_i)}}\Vert}^2 $$

where \(c_{l(x_i)}\): centroid of the cluster to which \(x_i\) was assigned.

Advanced convergence criteria

In addition to K-Means++ initialization, the algorithm adopts advanced convergence criteria by comparing centroid movement between iterations. Let \(C_t\) be the set of centroids at iteration \(t\), the algorithm converges if

$$ \max_{c \in C_t}\Vert c - c_{t - 1} \Vert < tol $$

where,

  • \(\Vert c - c_{t - 1} \Vert\): euclidean distance
  • \(tol\): specified tolerance

Validation by silhouette method

Validation of results obtained from implementing these techniques was performed, first, by the silhouette method14. This technique observes the similarity of a point with its cluster compared to other clusters from

$$ s_i = \frac{{b_i} - {a_i}}{\max({a_i},{b_i})} $$

where,

  • \(a_i\): average distance from \(i\) to all other intra-cluster points
  • \(b_i\): smallest average distance from \(i\) to all points in different clusters

The silhouette method returns results in the range -1 to 1. If the value is

  • close to -1: the point is clustered incorrectly;
  • close to 0: the point is between two clusters, so clustering can be improved;
  • close to 1: the point is well clustered.

Validation by Davies-Bouldin index

While the silhouette method compares a single point to clusters, the Davies-Bouldin index15, the second measure used in result validation, observes cluster cohesion, given the logic that adequate clustering is dense in itself while distant from other clusters.

The closer to 0 the index is, the better the clustering, a result obtained by

$$ \frac{1}{k}\sum_{i=1}^{k}\max_{i \ne j}\bigg(\frac{{S_i}+{S_j}}{M_{ij}}\bigg) $$

where,

  • \(k\): number of clusters
  • \(i\), \(j\): different clusters
  • \(S_i\), \(S_j\): internal dispersion of clusters \(i\) and \(j\), respectively
  • \(M_{ij}\): distance between clusters \(i\) and \(j\)

Results

An exploratory analysis was performed to understand the data and its dispersion. In the observed five-year period, there were 4,453 expense records across 86 unique CNPJ numbers, totaling R$ 1,784,601.08 after inflation adjustment. Each expense had an average value of R$ 400.76, but with a coefficient of variation of 241.41%, indicating significant data dispersion relative to the mean.

It was also noted that the mean is higher than the third quartile. This denotes data skewness toward lower values. The set thus presents a longer right tail than left, which is corroborated by the skewness of 5.21, while the kurtosis of 32.67 demonstrates a sharp peak compared to normal distribution.

MeasureValue
Count4,453
Mean (R$)400.763773
Standard deviation (R$)967.469752
Minimum (R$)6.49
1st Quartile (R$)55.75
2nd Quartile (R$)123.14
3rd Quartile (R$)276.18
Maximum (R$)10,250.41
Coefficient of variation (%)241.40648
Skewness5.21061
Kurtosis32.66851

Expenses were grouped by company to maintain spending behavior within the variability of values for each CNPJ. The present K-Means algorithm implementation processed information for each establishment following these parameters:

ParameterValue
Minimum number of clusters2
Number of clusters used2 to 10, selected by elbow method
Maximum iterations100
Tolerance for convergence0.0001
Percentile for anomaly detection95

The result was 262 anomalies totaling R$ 197,697.24 —11.08% of total expense value. Anomalies are understood as patterns in data that do not fit a well-defined notion of normal behavior16 —in the context of this work, anomalies are expense values that do not fit into clusters created by the algorithm. By definition, not every anomaly can be treated as fraud: there are anomalies in the middle of all expenses for a given company, not being the highest values in the set. Such anomalies between clusters are treated here as false positives.

Given the role of clusters in this algorithm and the K-Means++ implementation, there is great variability in the number of clusters. In the set of 86 companies, the number of clusters ranges from 2 to 10. We validated these values using the two aforementioned instruments:

  1. Silhouette method, whose acceptable results should be between 0.5 and 1 on a scale from -1 to 1;
  2. Davies-Bouldin index, with ideal results between 0 to 0.5, on a scale from 0 to 1.

The number of clusters for each CNPJ was validated using the two aforementioned instruments: the silhouette method and the Davies-Bouldin index. An adequate result for the first would be between 0.5 and 1 on a scale of -1 to 1; the second, from 0 to 0.5 on the scale of 0 to 1.

Given the set of 86 companies, all recorded ideal results for the silhouette method (values between 0.577 and 0.918); 79 showed ideal results for the Davies-Bouldin index (values between 0.166 and 0.489), while seven showed below-ideal results (values between 0.508 and 0.573).

With expense clustering, algorithm-based anomaly detection, and validation of applied methods, a final analysis was performed to consider anomalies subject to oversight body inquiry as those whose values are greater than the largest non-anomaly value of the last cluster. This discarded anomalies positioned between clusters, and the result obtained was 46 anomalies in 32 companies, with a total value of R$ 44,348.88.

Commented code

References

  1. Legislative Assembly of the State of São Paulo [Alesp]. 1997. Resolution n. 783, of July 1, 1997. Amends Resolution n° 776, of October 14, 1996, which implemented the new administrative structure, creates the Quality Center and institutes the office allowance. Available at: https://www.al.sp.gov.br/repositorio/legislacao/resolucao.alesp/1997/original-resolucao.alesp-783-01.07.1997.html. Accessed: March 19, 2023. ↩2

  2. São Paulo State Government Treasury and Planning Secretariat. 2023. Indices. Available at: https://portal.fazenda.sp.gov.br/Paginas/Indices.aspx. Accessed: March 26, 2023.

  3. São Paulo State Government Treasury and Planning Secretariat. 2023. Budget and financial execution. Available at: https://www.fazenda.sp.gov.br/sigeolei131/paginas/flexconsdespesa.aspx. Accessed: March 19, 2023. ↩2 ↩3

  4. São Paulo Public Prosecutor's Office. 2022. Electronic Information System. Available at: https://www.mpsp.mp.br/sei-sistema-eletronico-de-informacoes Accessed: March 26, 2023.

  5. São Paulo State Court of Justice. 2023. E-SAJ. Available at: https://esaj.tjsp.jus.br/esaj/portal.do?servico=190090 Accessed: September 24, 2023.

  6. Legislative Assembly of the State of São Paulo. 2023. Open Data Portal. Available at: https://www.al.sp.gov.br/dados-abertos/recurso/21 Accessed: March 26, 2023.

  7. Brazilian Institute of Geography and Statistics. IPCA. Available at: https://www.ibge.gov.br/estatisticas/economicas/precos-e-custos/9256-indice-nacional-de-precos-ao-consumidor-amplo.html?=&t=series-historicas Accessed: March 26, 2023.

  8. MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In: 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, Los Angeles, LA, United States, Proceedings… p. 281-297.

  9. Joshi, K.D.; Nalwade, P.S. 2012. Modified K-Means for better initial cluster centres. International Journal of Computer Science and Mobile Computing 7: 219-223.

  10. Schubert, E. 2023. Stop using the elbow criterion for k-means and how to choose the number of clusters instead. SIGKDD Explorations Newsletter 25: 36-42.

  11. Caliński, T.; Harabasz, J. 1974. A dendrite method for cluster analysis. Communications in Statistics 3: 1-27.

  12. Morissette, L.; Chartier, S. 2013. The K-Means clustering technique: General considerations and implementation in Mathematica. Tutorials in Quantitative Methods for Psychology 9: 15-24.

  13. Arthur, D.; Vassilvitskii, S. 2007. K-Means++: The advantages of careful seeding. Proceedings of Annual ACM-SIAM Symposium on Discrete Algorithms: 1027-1035.

  14. Rousseeuw, P.J. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20: 53-65.

  15. Davies, D.L.; Bouldin, D.W. 1979. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2: 224–227.

  16. Chandola, V; Banerjee, A.; Kumar, V. 2009. Anomaly detection: a survey. Association for Computing Machinery Computing Surveys 41: 1-58.