Metric similarity analysis of compositional data for fuzzy search of relevant elastomeric mixture formulations
- Alexander A. Rybanov, Volzhsky Polytechnic Institute (Branch) of Volgograd State Technical University (Volzhsky, Russia)
- Victor F. Kablov, Volzhsky Polytechnic Institute (Branch) of Volgograd State Technical University (Volzhsky, Russia)
The article addresses the pressing task of developing specialized methods for searching and ranking rubber compound formulations with similar compositions in databases. The aim of the research is to develop and conduct a comparative analysis of similarity metrics adapted for quantitatively assessing the proximity of multi-component compositional data represented as normalized vectors of ingredient weight fractions. The core of the work includes a formal statement of the problem of identifying relevant formulations, which requires maximizing a comprehensive similarity function that considers both qualitative composition (presence of ingredients) and quantitative proportions. Four metrics are proposed and adapted as tools: weighted Jaccard and Dice coefficients, Hellinger similarity, and cosine similarity. A theoretical analysis of their properties is supplemented by empirical validation on a real industrial database containing 6,096 unique formulations. The scientific novelty of the study lies in the systematic application and adaptation of metric analysis apparatus to the task of searching for analogues for compositional materials science data, as well as in revealing a fundamental clustering of the considered similarity measures. Unlike existing approaches focused on binary representation of composition or property prediction, the presented methodology purposefully solves the problem of precise search by composition and proportions. The obtained results revealed a near-functional equivalence of the weighted Jaccard and Dice coefficients (correlation coefficient r=0.991), forming one cluster of measures sensitive to the full set of components. Hellinger similarity and cosine similarity demonstrated a strong correlation (r=0.883), forming a second cluster of measures focused on assessing structural similarity of proportions, with the Hellinger metric showing increased sensitivity to variations in the fractions of minor ingredients. Based on this, practical recommendations are formulated for the combined use of one metric from each cluster to create effective search systems. The developed metric framework establishes a formal basis for intelligent analogue search, automation of component selection, and reduction of development time for new formulations in the industry.
rubber compound formulations, search for analogues, similarity metrics, weighted Jaccard coefficient, Dice coefficient, Hellinger similarity, cosine similarity, compositional data, database, materials science
2026-06-05